Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
SECU0057: Applied Data Science
• N-grams
• Keywords-in-context analysis
• Parts-of-speech
• Sentiment analysis
• Trajectory analysisE SCIENCE
Plan for today
Department of Security and Crime Science
2
• Sometimes sequence of words may contain more information
• “ice cream”, “crime science”, “bus stop”, “stolen bicycle”, “metal theft”
• “was not”, “not good”, “not helpful” -> retain more information than
individual words “was”, “not”, “good”
• “I am going out tonight to see a play” vs “I play tennis every weekend”
• By tying a word with its surrounding words, we may retain more information
Sequence of words
Department of Security and Crime Science
3
• N-gram is a consecutive sequence of n elements extracted from a text
• Elements can be words, syllables, characters and symbols
• In this module we are interested in n-grams of words
“Not all crime is reported to police”
• Unigram: n=1
• Bigram: n= 2
• Triagram: n=3
• four-gram, five-gram, …infinity
N-grams
Department of Security and Crime Science
DNA-Representation with N-grams
not all crime is reported To police
not all all crime crime is is reported reported to to police
not all crime all crime is crime is reported is reported to reported to police
4
library(quanteda)
sentence <- tokens("Man jailed for life after knife attacks")
#set n=2 for bigram, n=3 for trigram, n=4 four-gram, etc.
token_sentence <- tokens_ngrams(sentence, n=3 2) #error: should be 2
dfm(x = token_sentence, tolower = TRUE)
Document-feature matrix of: 1 document, 6 features (0.00% sparse) and 0
docvars.
features
docs man_jailed jailed_for for_life life_after after_knife knife_attacks
text1 1 1 1 1 1 1
N-grams with Quanteda
Department of Security and Crime Science