STAT 3450 Assignment
Assignment
STAT 3450 Assignment 1 (35 points)
Quan Yuan
Banner: B00923505
Intro
In this assignment, you will use three datasets:
1. a1d1.csv
2. a1d2.csv
3. a1d3.csv
Please download them from BS and place them in the same folder as this Rmd file
before working on your assignment.
This assignment is composed of 4 parts:
1. Problem-1 (logistic regression): [10 pts]
2. Problem-2 (knn and naive bayes): [8 pts]
3. Problem-3 (discriminant analysis): [5 pts]
4. Problem-4 (logistic regression): [12 pts]
You will need to load a few libraries. These libraries have already been used in the
lectures. Make sure they are installed on your computer.
For example, the glm function belongs to the stats package, therefore I load this
package in order to fit generalized linear models. Using the argument : quietly = R
disables some extra message to make the html/pdf report a bit tidier. I add some
flags to the cell to make sure that verbosity is reduced in the report:
library(stats, quietly=T)
Problem 1 [10 pts]
TOPIC: Logistic regression
Question 1 [1 pt] [1/35]
Load and prepare the data.
The data used in this problem records the number of failures (column NFailed)
observed in a series of machines of the same type, at various temperatures (in
Fahrenheit) and pressures.
Read the data from file a1d1.csv. Call the resulting dataframe dc, and use str to print
the structure of dc.
dc <- read.csv("a1d1.csv")
Convert the Pressure column to a factor.
dc$Pressure <- as.factor(dc$Pressure)
Add a new column called Failure to dataframe dc, which contains the result of the
boolean question : is NFailed positive?
dc$Failure <- dc$NFailed > 0
Question 2 [1 pt] [2/35]
Fit a logistic regression model of the variable Failure with Temperature as the only
predictor. Call the resulting model lrmod1, save the summary of lrmod1 to an
object called s1 and print s1.
lrmod1 <- glm(Failure~Temperature, family = "binomial", data = dc)
s1 <- summary(lrmod1)
s1
##
## Call:
## glm(formula = Failure ~ Temperature, family = "binomial", data = dc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0611 -0.7613 -0.3783 0.4524 2.2175
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.0429 7.3786 2.039 0.0415 *
## Temperature -0.2322 0.1082 -2.145 0.0320 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.315 on 21 degrees of freedom
## AIC: 24.315
##
## Number of Fisher Scoring iterations: 5
Extract the p-value of the summary for the regression coefficient of the Temperature
predictor. You can do this by extracting the second row and fourth column of the
coef function applied to s1.
s1$coefficients[2,4]
## [1] 0.0319561
Is the temperature effect statistically significant at the 5% level?
Yes, the p-value of temperature is less than 0.05.
Is the temperature effect statistically significant at the 1% level?
No, the p-value of temperature is more than 0.01.
Question 3 [1 pt] [3/35]
Does the sign of the regression coefficient associated with the Temperature variable
indicate that:
a. failure risk increases when temperature increases?
b. failure risk decreases when temperature increases?
Report the correct choice (a or b):
b
Question 4 [1 pt] [4/35]
Compute and print the estimated Odds ratios and 95% confidence interval of the
coefficients of the model lrmod1.
exp(coefficients(lrmod1))
## (Intercept) Temperature
## 3.412315e+06 7.928171e-01
exp(confint(lrmod1, level=0.95))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 27.9546841 8.214986e+14
## Temperature 0.5972188 9.409919e-01
By how much does the odds of failure (increase/decrease: choose the correct option)
if we decrease the temperature by 10°F?
# computation here:
exp(coefficients(lrmod1))[2]^10
## Temperature
## 0.09811378
increase or decrease ? by a factor of ?
Decrease, and by a factor of 0.09811378.
Question 5 [2 pts] [6/35]
We want to produce a plot that shows the following elements:
1. the temperatures in F are on the x-axis
2. observations (Failure=0 or 1) are shown on the plot
3. the curve of the probability of Failure=1, as a function of Temperature
4. a horizontal red line passing through all failure observations (Failure=1)
5. a horizontal green line passing through all non-failure observations
(Failure=0)
6. 3 vertical lines from 0 to 1 passing through the temperatures 31, 41 and 51
(Fahrenheit)
HINT-1: to add the curve, you can call the function curve like this:
curve(f(x),add=true)
where you will have to replace f(x) by the function predicted by the model.
HINT-2: Remember that, for a univariate logistic regression model with parameters
b0 and b1, the prediction formula is:
( = 1|) =
exp(0 + 1)
1 + exp(0 + 1)
# call the plot function
# use this label for the x-axis: Temperature (°F)
# use this label for the x-axis: Failure probability
# use the temperature range 25,90
# use a point choice (pch) of 16
plot(dc$Temperature, dc$Failure, xlab = "Tempereture(°F)", ylab = "Fail
ure probability", xlim = c(25,90), pch = 16)