17. Regression
Regression is performed in R using the linear model function lm(). This can perform single and multiple linear regression analyses. Similar to the other analyses, it’s best to create an object to store the results, and then ask for a summary. Note that we should use the factors we created earlier for boot type (bootsf) and gender (genderf). Suppose we want to know whether age, gender and type of boots can predict boot comfort score:
fit_comfort_age_boots_gender <- lm(comfort ~ age + genderf + bootsf, data = data)
summary(fit_comfort_age_boots_gender)
We see that overall, the type of boot worn is strongly associated with boot comfort. Interestingly, we see that female recruits give lower comfort scores than males. This is because when this study was undertaken in about 2000, female boots were actually made from male boots lasts, and so provided a sub-optimal fit. This has now been fixed, and female recruits have their own lasts.
Other useful aspects of the analysis include:
coefficients(fit_comfort_age_boots_gender) # model coefficients
confint(fit_comfort_age_boots_gender, level=0.95) # Confidence Intervals for model parameters, here set to 95%
fitted(fit_comfort_age_boots_gender) # predicted values
residuals(fit_comfort_age_boots_gender) # residuals
anova(fit_comfort_age_boots_gender) # anova table
vcov(fit_comfort_age_boots_gender) # covariance matrix for model parameters
influence(fit_comfort_age_boots_gender) # regression diagnostics
To undertake regression diagnostics, try this:
layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page
plot(fit_comfort_age_boots_gender)
Which looks reasonable, but we can also undertake a Shapiro-Wilks test on the residuals to check if they are normally distributed. First you will need to create an object containing the residuals. This is done using the command we used above, but making it produce and store the result as an object. the residuals
fit_res <- residuals(fit_comfort_age_boots_gender)
shapiro.test(fit_res)
Which shows that the residuals aren’t normally distributed, but this isn’t uncommon with smaller datasets.