Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
Statistical Computing: R - HW 2021 Koen Plevoets For the R Homework you are going to compute some language statistics. The data can be found on the Github repository . More specifically, you are going to be working with the annotated versions of some tales by Edgar Allan Poe: • morgue_TAG.txt: The Murders in the Rue Morgue • roget_TAG.txt: The Mystery of Marie Roget • purloined_TAG.txt: The Purloined Letter • usher_TAG.txt: The Fall of the House of Usher You find these files in the folder poe of the Github repository. If you click on any file, then you will see its contents. Note, however, that the page on which Github shows the contents is not the URL from which you can download the contents. Downloading files from Github should be done from a URL starting with raw.githubusercontent.com/, which can be accessed by click- ing on the Raw button. For instance, the file morgue_TAG.txt can be downloaded from the URL With these instructions you should be able to solve the following exercises. As mentioned in Class 1, your solution should be an RStudio Project. Zip this RStudio Project and submit the zip file on Ufora. Your RStudio Project should essentially contain an R script with the code for solving the exercises. Give both your RStudio Project and your R script a name structured as RHW2021_First name_Last Name, in which you replace First name by your (principal) first name and Last name by your surname. Last but not least, remember to pay attention to coding style. 1. Read the file morgue_TAG.txt from Github into a data frame called morgue. It should look as follows: head(morgue) ## doc_id token lemma upos ## 1 morgue THE the DET ## 2 morgue MURDERS murder NOUN ## 3 morgue IN in ADP ## 4 morgue THE the DET ## 5 morgue RUE Rue PROPN ## 6 morgue MORGUE Morgue PROPN 1 2. The files with the suffix _TAG on Github all have the same structure. The individual words or “tokens” of a text appear underneath each other and there are four character columns: • doc_id: A unique identifier of each text (hence, the data sets for different texts can be combined) • token: The actual token as it appears in the text • lemma: The “dictionary form” of the token (i.e. the word without its inflectional variants) • upos: The “universal part-of-speech tag” or word category of the token. The categories are based on the CoNNL-U standard in Natural Language Processing. Create a new data frame called morgue_lan which only contains linguistic tokens, i.e. in which all punctuation is removed. 3. Create a new variable in morgue_lan called lemma_low which transforms all lemmas (in the column lemma) to lowercase. 4. Compute the frequencies of all lemmas in lemma_low and store them in an object called morgue_frq. This should look as follows: head(morgue_frq, n = 10) ## ’s 18 4000 a abandon ability able abound about above ## 16 1 2 360 5 3 2 1 30 3 tail(morgue_frq, n = 10) ## writer wrong xerxes yard year yesterday yet you ## 1 2 2 3 10 2 7 106 ## young yourself ## 2 4 5. The object morgue_frq is ordered alphabetically but it is more informative to order the frequencies numer- ically. It is a conventional to order word frequencies from large to small. Re-order the values in morgue_frq so it looks as follows: head(morgue_frq, n = 10) ## the be of a in and to have i it ## 1174 646 601 360 322 276 274 254 224 203 2 tail(morgue_frq, n = 10) ## wisdom wonder wont work worldly worst worthy wrap wrath writer ## 1 1 1 1 1 1 1 1 1 1 6. On the basis of a frequency distribution like morgue_frq you can compute the number of “types” and “tokens”: • The number of types is the number of uniquely distinct lemmas and is usually denoted as V . • The number of tokens is the total number of observations in the text and is usually denoted as N . Compute the number of types and tokens in morgue_frq and store them in the objects morgue_Vand morgue_N, respectively. 7. In the frequency distribution morgue_frq you see that there is more than one lemma which occurs only once. These are the “hapax legomena” or the types with token frequency 1. The total number of hapax legomena is usually denoted as V1. Similarly, the types occurring twice are called “dis legomena” and their total number is denoted as V2, the types occurring three times are called “tris legomena” and their total number is denoted as V3, etc. Write a function V(x, n) which computes Vn in a frequency distribution x, i.e. the total number of types in x with a given token frequency n. It should contain the necessary safety checks: • It should raise an error n has to be positive whenever n is not a positive value. • It should raise a warning n is outside of frequency range in x when n is beyond the frequencies in x. 8. Rewrite V(x, n) so it can handle a vector of frequencies in n instead of a single number. The result should be a vector with the type counts corresponding to the frequencies in n. 9. The frequency spectrum is the complete overview of the type counts in function of the token frequencies. In other words, it lists the type count Vn for every observed frequency n. Obtain the frequency spectrum of morgue_frq and store it in an object called morgue_spc. It should look as follows: head(morgue_spc, n = 10) ## 1 2 3 4 5 6 7 8 9 10 ## 1230 361 182 116 72 63 45 30 30 23 3 tail(morgue_spc, n = 10) ## 203 224 254 274 276 322 360 601 646 1174 ## 1 1 1 1 1 1 1 1 1 1 Hint: The quickest way of computing the frequency spectrum does not make use of any loop or meta- function! 10. The type-token ratio is a well-known measure to represent the lexical diversity or lexical richness of a text. It is usually denoted as TTR and it is computed as: TTR = V N Write a function TTR_freq(x) which calculates the TTR on the basis of a frequency distribution (.e.g morgue_frq) and write a function TTR_spec(x) which calculates the (same) TTR on the basis of a frequency spectrum (e.g. morgue_spc). 11. Write an S3 generic TTR(x) with appropriate S3 methods which calculates the TTR based on whether x is a frequency distribution or a frequency spectrum. To this end, you should set the class of a frequency distribution equal to freq and the class of a frequency spectrum equal to spec: class(morgue_frq) <- "freq" class(morgue_spc) <- "spec" 12. Now write the S3 “generator” functions freq(x) and spec(x) which return the frequency distribution and resp. the frequency spectrum of the individual tokens in x. In other words, freq(x) should return an object like morgue_frq and spec(x) should return an object like morgue_spc where the argument x is in both cases a character vector like the column lemma_low in data frame morgue_lan. Make sure that the results are properly sorted. Apply both functions to obtain the frequency distributions and frequency spectra of the three other tales The Mystery of Marie Roget, The Purloined Letter and The Fall of the House of Usher. You can store the downloaded data in objects called roget, purloined and usher. Their frequency distributions and frequency spectra can be stored in objects with the suffixes _frq and _spc, respectively. 13. Word frequencies are known to exhibit Zipf’s law. Denote the frequency of type j (= 1 . . . V ) by nj . These frequencies can be ranked in reverse order (i.e. rank 1 is given to the highest value) and the reverse ranks are denoted as rj . Zipf’s law states that the following (empirical) relationship holds between frequencies and reverse ranks: nj ∝ r−αj 4 This means that a decreasing line shows up if both the frequencies and ranks are log-transformed: log(nj) ∝ −α× log(rj) This line also shows up if the frequencies and ranks are plotted on logarithmic axes. For The Murders in the Rue Morgue (based on morgue_frq) this looks as follows: rj n j 10^0.0 10^0.5 10^1.0 10^1.5 10^2.0 10^2.5 10^3.0 10^0 10^1 10^2 10^3 Plot the Zipf curve of all four texts by Edgar Allan Poe in a single plot, which should look like this: 5 Rank Fr eq 10^0 10^1 10^2 10^3 10^0 10^1 10^2 10^3 morgue purloined roget 10^0 10^1 10^2 10^3 10^0 10^1 10^2 10^3 usher 14. A word cloud is a visualization of the most important words/types in a text based on their frequency. This gives an indication of what the text is about. However, this makes most sense when only nouns are considered. Make frequency distributions of each of the four Poe tales which are restricted to nouns and proper nouns. Visualize each of these using the package wordcloud2. This package does not allow to visualize several word cloud in one plot, so it is okay to make separate plots. Note that the wordcloud2 package expects the frequency distribution to be in a data frame with proper column names. The Murders in the Rue Morgue: 6 The Mystery of Marie Roget: The Purloined Letter : 7 The Fall of the House of Usher : 8