Homework 1:Instructions: Please complete the following assignment in the groups to which you have been assigned.
Please complete the following assignment in the groups to which you have been assigned. The end product should be a jupyter notebook with well-documented code and discussion of results. Make sure you include in your submitted notebook any auxiliary files needed to make the notebook run. Please place all material in a zip file and upload by 4pm on Friday 29 January.
To begin the assignment, your group needs to select a textual database of your choice to analyze. You can use one that is in public circulation or, if you prefer, your own. In any case you will need to share the data with Yi and me to verify the notebook runs, but we will not circulate further.
Your assignment is the following:
Pre-process the documents in your dataset following the steps discussed in lecture (at the very least, tokenizing, stemming, and stopword removal). Which are the most common words based on overall counts? And which words in the corpus get the highest tf-idf score?
Identify a dictionary of interest to measure heterogeneity across documents. You can use an existing one, or invent one of your own.. Why have you selected the dictionary you did?
Use your dictionary to provide a quantitative representation of each address using a simple count-based measure.
Find some external series of interest (e.g. for the state-of-the-union addresses whether the US is in recession, engaged in a major war, the average inflation rate, etc.) that you think might correlate with your quantitative representation, and compute the correlation. Is it the sign you expected? Is itsignificant?
Now use the same dictionary, but compute the content of each document usingterm weighting as discussed in class. Do your answers to the previous question change if you use this alternative representation?