Quantitative Methods – Social Sciences
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
GR5067
Natural Language Processing
Quantitative Methods – Social Sciences (QMSS)
Data Retrieval Pre-Processing Tokenization Stemming Vectorization
TrainTest & ValidationClassification DimensionReduction
Topical Extraction
● Determine topics/themes from corpus
● Unsupervised machine learning technique to classify documents into a specific topic
● Latent Dirichlet Allocation (LDA)
Bayes Theorem
● Conditional Probabilities
● Predictors are independent from one another
● Let’s say we know the following during the past 100 days
• It was cloudy on 40 days P(cloudy) = 40/100 = .40
• It rained on 30 days P(rainy) = 30/100 = .30
• It was both raining and cloudy on 25 day P(rainy|cloudy) = 25/40 = .625
● We can solve for what is probability it will rain given its cloudy
● P(cloudy|rainy) = P(rainy|cloudy)*p(cloudy) / p(rainy) = (.625*.40)/.30 = 0.833
Bayes Theorem
Word Count Sports = 11
Word Count Not Sports = 9
Unique Words Count = 14
We add a 1 to each to force a never zero condition: Called Laplacian Smoothing
A generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-
level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture
over an underlying set of latent topics. Each observed word originates from a topic that we do not
directly observe. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic
probabilities.
What is Latent Dirichlet
Allocation (LDA)?
What is used for? The fitted model can be used to estimate the similarity between documents as well
as between a set of specified keywords using an additional layer of latent variables which are referred to as topics.
How is it related to text
mining and other machine
learning techniques?
Topic models can be seen as classical text mining or natural language processing tools. Fitting topic
models based on data structures from the text mining usually done by considering the problem of
modeling text corpora and other collections of discrete data. One of the advantages of LDA over
related latent variable models is that it provides well-defined inference procedures for previously
unseen documents (LSI uses a singular value decomposition)
Latent Dirichlet Allocation
● LDA is a mixture of topics
● p(topic t | document d)
○ Similar to transformation, except we determine what is the probability a word belongs to a specific document
■ How would you go about calculating this?
● p(word w | topic t)
○ Captures how many documents are in a specific topic, t, due to a specific word, w
● Probability a word w belongs top topic Z:
p(word w with topic z) = p(topic z | document d) * p(word w | topic z)
Latent Dirichlet Allocation
● Measures
○ Topic Coherence: score a single topic by measuring the degree of semantic similarity between high scoring words in
the topic
■ C_v measure is based on a sliding window, one-set segmentation of the top words and an indirect confirmation
measure that uses normalized pointwise mutual information (NPMI) and the cosine similarity
■ C_p is based on a sliding window, one-preceding segmentation of the top words and the confirmation measure of
Fitelson’s coherence
■ C_uci measure is based on a sliding window and the pointwise mutual information (PMI) of all word pairs of the given
top words
■ C_umass is based on document cooccurrence counts, a one-preceding segmentation and a logarithmic conditional
probability as confirmation measure
■ C_npmi is an enhanced version of the C_uci coherence using the normalized pointwise mutual information (NPMI)
■ C_a is baseed on a context window, a pairwise comparison of the top words and an indirect confirmation measure that
uses normalized pointwise mutual information (NPMI) and the cosine similarity
○ Perplexity Score: measure how good the model is, lower score the better