Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
CSCI-GA 2590: Natural Language Processing
1 Data
Datasets. There are two datasets under data.
1. Ice cream climatology dataset (ic). This is a synthetic toy dataset for you to test and debug your
code. We strongly encourage you to start with this easy dataset. The observations are the number of
ice cream cones eaten by Jason each day. The hidden states are the whether each day (C for cold and
H for hot). You will use ictrain for training and ictest for testing. However, this dataset is mainly
for debugging purposes and you don’t need to submit any code for it.
2. English POS tagging dataset (en). This dataset consists of word sequences and the POS tag for
each word. We use coarse English tags in this homework, i.e. only the first character of the original
tag is used, e.g., different types of nouns (NN, NNP, NNS) are reduced to simply nouns (N). The tagset
is described in Table 1. You will develop your model using entrain and endev. We will run you code
for training and testing on a hidden test set.
File format. Each line has a single word/tag pair separated by the / character. Punctuation marks count
as words. The special word ### is used for sentence boundaries, and is always tagged with ###. You can think
of it as the special start symbol * and the stop symbol STOP too. When you generate ### that represents a
decision to end a sentence (so ### = STOP). When you then generate the next tag, conditioning on ### as
the previous tag means that you’re at the beginning of a new sentence (so ### = STOP).
2 Evaluation
We will evaluate your program based on the model’s accuracy and the training speed. However, we suggest
you print additional metrics during model debugging and development.
accuracy percentage of test tokens that received the corret tag, excluding the sentence boundary markers
###.
known word accuracy consider only tokens that appear in the training data.
1
NYU ID Hidden Markov Models October 22, 2020
C Coordinating conjunction or Cardinal number
D Determiner
E Existential there
F Foreign word
I Preposition or subordinating conjunction
J Adjective
L List item marker (a., b., c., …) (rare)
M Modal (could, would, must, can, might…)
N Noun
P Pronoun or Possessive ending (’s) or Predeterminer
R Adverb or Particle
S Symbol, mathematical (rare)
T The word to
U Interjection (rare)
V Verb
W wh-word (question word)
### Boundary between sentences , Comma
. Period
: Colon, semicolon, or dash
– Parenthesis
‘ Open quotation mark
’ Close quotation mark
$ Currency symbol
Table 1: Tags in the en dataset.
novel word accuracy consider only tokens that do not appear in the training data, in which case the
model can only use the context to predict the tag.
3 Implementing the Tagger
Dependencies. Your code should run with python3. You can use any native python functions and the
numpy package. Please do not import additional libraries which will break the auto-grader.
Input/output. Implement a tagger that runs as follows (on the ic dataset):
python tag.py –train ictrain –test ictest
It reads in the training and test files, learns a HMM tagger, and writes a prediction file called test-output.
test-output should use the same format as the test file (i.e. each line has a single word/tag pair). We will
run your code to train a model and evaluate the predictions using test-output. (Note that during grading
we may randomize the tags in the test file, so please do not simply write the test file the program reads in.)
Key steps. You are free to implement the tagger however you want. Here are the suggested steps:
1. Read the training data and store the word-tag and tag-tag counts.
2. Compute the emission and transition probabilities in HMM.
3. Read the test data and compute the tag sequence that maximizes p(words,tags) given by the HMM.
4. Compute and print the accuracy on the test data and other metrics or statistics.
Make sure to do all computation in the log space.
Check your implementation on the ic dataset. Implement a vanilla HMM tagger using unsmoothed
counts and Viterbi decoding covered in lecture 6. The program tag.py should produce an accuracy of 87.88%
or 90.91%. There are two accuracies because of arbitrary tie breaking during decoding. It may seem to be
a problem on this toy data, but it’s very rare to have two sequences with the same probabilities in realistic
datasets. Therefore, breaking ties arbitrarily is common in practice. Next, let’s improve the tagger to make
it work on the en dataset.
Smoothed probabilities. To make the tagger work on the real data, the first thing you need to implement
is smoothed probabilities. It’s common to encounter unseen word-tag and tag-tag pairs at test time. Without
smoothing, you might see nan warnings during inference. For example, you can use simple add-1 smoothing.
Pruning. Now, if you try to run the tagger on the en dataset, you may find that it’s pretty slow even with
our reduced coarse tagset. You can speed up the tagger by ignoring word-tag pairs that do not appear in the
training set. You can derive a tag dictionary from the training set that records the set of allowed tags for
each word, and only consider allowed tags for each word during viterbi decoding. For unseen words, all tags
are allowed. With pruning, we may miss the actual best tag sequence, but it would significantly improve the
decoding speed.