Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
ATHK1001 WEEK 5: Inferential statistics
Lecture 14: Analysis for categorical data
Copyright Regulations 1969
WARNING
This material has been reproduced and communicated to
you by or on behalf of the University of Sydney
pursuant to Part VB of the Copyright Act 1968 (the
Act).
The material in this communication may be subject
to copyright under the Act. Any further reproduction or
communication of this material by you may be the
subject of copyright protection under the Act.
Do not remove this notice.
Page 2The University of Sydney
Lecture 13:
Central Limit Theorem and
the Sampling Distribution
Week 5 : Inferential statistics
Lecture 14:
Analysis for categorical data
Lecture 15:
Regression analysis
WEEK 5 LIVE REVIEW
Page 3The University of Sydney
Hypothesis testing for
categorical data
1. What is categorical data?
2. How do we test for
differences between
categories?
3. The chi-square test statistic
and how we use it
Lecture 14 overview
Page 4The University of Sydney
Four types of data
1. Ordinal
Data that orders values, e.g., “On a scale of 1-7 how confident are you in your
answer?”
2. Interval
Numeric scales in which we know both the order and the exact differences
between the values, e.g., temperature in Celsius
3. Ratio
Interval scales which has an absolute zero, e.g., height
4. Categorical data, also called “Nominal” or “qualitative”
Data consisting of labels, e.g., Hair colour
For categorical data we cannot
calculate meaningful means, just
frequencies. So how do we test
hypotheses about categories?
The University of Sydney
Hypothesis testing for categorical data
We could ask different questions about a drug study
A) Did participants given the drug recover faster than those given the placebo?
B) Were participants given the drug more likely to recover within three days?
These questions refer to different types of data:
A) is a question about means
B) is about numbers of people in a category
Hypothesis testing for categorical
data has same logic as for means,
but uses different statistical tests
(can’t just pretend they are means
because of test assumptions).
The University of Sydney
Hypothesis testing for categorical data: Example
Political polls are categorical data: How many people will vote for each party?
H0 (null hypothesis):
– There is no difference in how many people will vote for each party.
H1 (alternative hypothesis):
– More people will vote for one party than the other.
H0 is certain to be wrong, one party is going to win, but when can we reject the null
and predict which party will win?
Test hypotheses by calculating the
95% confidence interval and ask
if the two party votes are outside
this “margin of error”.
Test
The University of Sydney
Hypothesis testing for categorical data
The null hypothesis is not always 50%
If we are trying to see if frequencies differ from those expected by chance, then
need a way to test splits other than 50/50.
A example: teenagers in a town think police are handing out more tickets to them,
so look at 72 tickets issued:
– Teens 11
– Adults 61
There are a lot fewer tickets for teen
drivers, but there are a lot fewer teens.
The percentage of licenced drivers who
are teens is only 8%
The University of Sydney
Hypothesis testing for categorical data
H0 (null hypothesis): The proportion of teens receiving tickets is equal to their
proportion of drivers
H1 (alternative hypothesis): The proportion of teens receiving tickets is
different to their proportion of drivers
Basically we are asking: Are teens and non-teens from the same population
(in terms of how they are treated by police) or different populations?
We want to calculate the
probability that we would get
the observed proportions if
the H0 (they are from the
same population) was correct.
Test
The University of Sydney
Hypothesis testing for categorical data
Wouldn’t expect number of tickets
would be exactly equal to proportion
of drivers.
So calculate probability by comparing
observed versus expected frequencies
(i.e., expected if H0 was correct) and
calculating the probability that we
would observe this data if H0 was
correct.
Teenagers Non-teenagers
Observed
frequency
11 61
Expected frequency 5.76 66.24
The University of Sydney
Hypothesis testing for categorical data: Chi-square
Sum up the differences between observed
frequency and expected frequency for
each category, each difference is squared
and divided by the expect frequency.
Then compare to chi-square distribution to
find the p-value
For driving data X2(1) = 5.18, p < 0.05
So reject H0 and support H1 - police
might be targeting teen drivers
k=number of
categories
o=observed frequency
e=expected frequency
Note: Don’t need to know this calculation for the exam!
The University of Sydney
Hypothesis testing for two categorical variables
With more than one categorical variable we need cross-tabulated frequencies
We often have hypotheses about whether the categories are independent
Same basic idea as when you test one category:
1) Work out expected frequencies if no relationship
2) Add up differences between expected and observed
3) Use chi-square test to find probability of the data if H0 was correct.
H0 (null hypothesis): The pattern of
frequencies in one variable is the same
across all levels of the other variable; there
is no association between the two categories.
H1 (alternative hypothesis): The pattern of
frequencies is different across levels of the
other variable; there is an association
between the two categories
The University of Sydney
Testing two categorical variables: Example
For a previous
ATHK1001 class we can
test if there was an
associations between
passing ATHK1001 and
handing in Assignment
1 on time.
Look at results for 741
students.
On
time
Late Total
Pass 491 71 562
Not
pass
62 117 179
Total 553 188 741
The University of Sydney
Testing two categorical variables: Example
Research question: Is there a there an association between passing
ATHK1001 and handing in Assignment 1 on time?
Step 1: Write the hypotheses
H0: There is no association between handing Assignment 1 in late and passing
ATHK1001
H1: There is an association between handing Assignment 1 in late and
passing ATHK1001
The University of Sydney
Testing two categorical variables: Example
Step 2: What are the expected frequencies if H0 was correct?
Expected On time late Total
Pass 419.4 142.6 562
Not pass 133.6 45.4 179
Total 553 188 741
TIP:
You don’t need to be able to calculate this
for the exam - Just focus on understanding
the concept and the process
The University of Sydney
Testing two categorical variables: Example
Step 3:
Calculate
the test
statistic
TIP:
You don’t need to be able to calculate this
for the exam - Just focus on understanding
the concept and the process
Cell Observed