Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
HW 6: Text Analysis of Harry Potter Books
Stat 133
In this assignment, you will build a shiny app to visualize the results from a text analysis
performed on the seven Harry Potter1 books written by Joanne Kathleen Rowling:
1) Harry Potter and the Philosopher’s Stone
2) Harry Potter and the Chamber of Secrets
3) Harry Potter and the Prisoner of Azkaban
4) Harry Potter and the Goblet of Fire
5) Harry Potter and the Order of the Phoenix
6) Harry Potter and the Half-Blood Prince
7) Harry Potter and the Deathly Hallows
We are assuming that you have reviewed the learning materials of weeks 11, 12, and 13 (see
bCourses). Specifically, we are assuming that you have reviewed the text-mining tutorials
available in the readings/ folder:
Part I: Data
This section introduces the data for this assignment.
Harry Potter Data Files
The data for this assignment involves the text of Harry Potter books. This data is available
in two different presentations:
1) A single csv file (with the text of all seven Harry Potter books)
2) A set of seven rda2 files (one file per book)
1Harry Potter is a series of seven fantasy novels. The novels chronicle the lives of a young wizard, Harry
Potter, and his friends Hermione Granger and Ron Weasley.
2An rda file is a binary file that uses native’s R binary format (i.e. can be opened only in R).
1
All these files are located in the hw6/ folder (see bCourses folder Files/hws/hw6).
For sake of redundancy (in case you experience any issue when trying to access the files
through bCourses), you can also find the associated files in the following github repository:
1.1) Single csv file harry_potter_books.csv
The data of all the books is available in csv format. You may want to use this file to perform
sentiment analysis, or word-trend analysis.
# assuming that the csv file is in your working directory
hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
This data set is fairly simple—in terms of its structure—although the text content is far from
being tidy. The dataset has 95,085 rows and 3 columns:
1) text: text content
2) book: title of associated book
3) chapter: associated chapter number
1.2) Seven rda files
The data of each book is also available in its own R-Data rda file. To be more precise, the
text of each book comes in a character vector with as many elements as chapters in a book.
You may want to use these files to perform bigram analysis (or other type of n-gram analysis).
To import these files use the load() function:
# assuming that the rda files are in your working directory
load("philosophers_stone.rda")
load("chamber_of_secrets.rda")
load("prisoner_of_azkaban.rda")
load("goblet_of_fire.rda")
load("order_of_the_phoenix.rda")
load("half_blood_prince.rda")
load("deathly_hallows.rda")
Consider the first book “Harry Potter and the Philosopher’s Stone”. Assuming that you’ve
loaded the file "philosophers_stone.rda", the text of this book is available in the homonym
character vector philosophers_stone
length(philosophers_stone)
## [1] 17
2
As mentioned before, the number of elements in philosophers_stone corresponds to the
number of chapters in this book: 17 chapters.
Part II: Text Analysis
This section provides some of the suggested text analysis that you can perform for this
assignment.
Listed below are some text analysis ideas for you to get inspiration from. We are also
including recommended readings (some available in bCourses, some available in the book
“Text Mining with R”, by Silge & Robinson).
Out of the four listed types of text analysis (2A-2D) you will have to choose two
of them in order to create the shiny app.
2.A) Sentiment Analysis
Perhaps the broader and richer type of analysis that can be performed on the Harry Potter
text data involves sentiment analysis. Given the amount of text—spread across the seven
books and all their chapters—and the four sentiment lexicons (bing, afinn, nrc, loughran),
the options to do all sorts of sentiment analysis seem limitless.
For example, here are a few ideas (this is by no means an exhaustive list):
• Given a certain book, compute a sentiment score for each chapter. Note that different
scores can be obtained by using different lexicons.
• Given a certain book, compute sentiment scores and visualize them across a plot
trajectory. This is what Julia Silge refers to as the “track of narrative time in sections
of text”. To clarify, the notion of section of text does not necessarily have to match an
entire chapter.
• Compute a sentiment score for each book. And then rank them from more positive
to more negative (or viceversa). Again, note that different scores can be obtained by
using different lexicons.
• Which chapters or books have “relatively large” positive scores? And/or what words
contribute the most for the score?
• Which chapters or books have “relatively large” negative scores? And/or what words
contribute the most for the score?
Sentiment Lexicons Files. For the sake of convenience, you can find rda files of all the
sentiment lexicons in the folder hws/hw6/sentiment-lexicons
# assuming that the rda files are in your working directory
load("bing.rda")
load("afinn.rda")
load("nrc.rda")
load("loughran.rda")
Assuming that you’ve loaded the file "bing.rda", the associated lexicon is available in the
homonym tibble bing
bing
## # A tibble: 6,786 x 2
## word sentiment
##
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ... with 6,776 more rows
Suggested reading
• text-mining-4-sentiment-analysis.html (see Files/readings in bCourses)
• See also chapter 2 “Sentiment analysis with tidy data” (in “Text Mining with R”; link
below)
2.B) Word Trend Analysis
Another possibility consists of a word trend analysis.
One example of words could be the names of main characters: "harry", "ron", "hermione",
"dumbledore", "voldemort", "hagrid", etc.
With these names, one can compute their relative frequencies (i.e. proportion of occurrences)
across the books, and visualize their trend. See the tutorial listed below to get a better idea
4
of this kind of trend.
Alternatively, you can also look for one or more specific words, and see how they are used
across chapters of a book, or across the seven books. For instance, how does the word "love"
is used across the chapters of the first book “The Philosopher’s Stone”? Or other words such
as "spell", "potion", "wand", "quidditch", to mention a few.
Suggested reading
• text-mining-5-pride-and-prejudice.html (see Files/readings in bCourses)
• See figure 5.4 in “Text Mining with R” (link below) to get a rough idea about this type
of trends over time. Obviously there is no time in Harry Potter but you can use the
sequence of chapters as a proxy of time.
https://www.tidytextmining.com/dtm.html#tidying-dfm-objects
2.C) Word and Document Frequency (tf-idf)
You can also look at a term’s inverse document frequency (idf), which decreases the weight
for commonly used words and increases the weight for words that are not used very much in
a collection of documents. This can be combined with term frequency to calculate a term’s
tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how
rarely it is used.
Suggested reading
• See chapter 3 “Analyzing word and document frequency: tf-idf” (in “Text Mining with
R”; link below)
2.D) Bigram Analysis
Another type of analysis involves studying so-called bigrams (or n-grams in general) for
answering questions like:
• what kind of words tend to be associated with other words?
Suggested reading
• text-mining-2-pride-and-prejudice.html (see Files/readings in bCourses)
• See chapter 4 “Relationships between words: n-grams and correlations” (in “Text
Mining with R”; link below)
Warning: visualizing graph networks of n-grams tends to be computationally expensive due
to the size of the text data;
5
Part III: Shiny App
This section describes generic specifications of the shiny app.
3) Shiny App
The main data product to be delivered for this assignment is a shiny app that allows the
user to explore the results of two types of text analysis.
For example, you can choose 1) a sentiment analysis, and a 2) word trend analysis. Keep in
mind that even if two (or more) students choose to work on the same type of analyses, there
is still enough room to approach them in slightly different ways, therefore producing different
shiny apps, with different scopes, and of course different data visualizations and outputs.
Important Note: It is possible to find—online—various types of text analysis on Harry
Potter data that different authors/analysts have performed in the past. You can definitely
get inspiration from them, but we expect that you do your own work, write your own code,
and conduct yourself with academic integrity.
3.1) Layout
Title of your App
analysis1
Graphing Area
You may want to arrange widgets across columns
Numeric/text output to help in the
interpretation of the displayed graphs
Use at least four different types of widgets
analysis2
Input widgets Input widgets Input widgets Input widgets
Now there are 2 tabs!!!
Figure 1: Diagram of the overall shiny app’s layout
6
You can find a template R script file app-template.R in the folder containing this pdf of
instructions (see bCourses folder Files/hws/hw6).
As you can tell from the above diagram, the layout of the app is different to the shiny app of
hw5. One big difference in the app for this assignment is in the fact that it uses two tabs:
1) Analysis1: this tab is for displaying the results for one type of text analysis (for
example: sentiment analysis)
2) Analysis2: this tab is for displaying the results for another type of text analysis (for
example: word trend analysis)
From the diagram above, note that there are four distinctive sections in the layout—see
template file app-template.R:
• title: main title for your app (give it a meaningful name).
• input widgets: the template already contains various input widgets arranged in four
columns. You can change this configuration if you want, as well as the types and
number of widgets. The only condition is to use at least four different types of widgets
(e.g. slider, numeric, text, and select).