Introduction to Regression in R (Part2 Regression Diagnostics)

IDRE Statistical Consulting Group

2.0 Regression Diagnostics

In the previous part, we learned how to do ordinary linear regression with R. Without verifying that the data have met the assumptions underlying OLS regression, results of regression analysis may be misleading.

Here will explore how you can use R to check on how well your data meet the assumptions of OLS regression. In particular, we will consider the following assumptions.

Homogeneity of variance (homoscedasticity): The error variance should be constant
Linearity: the relationships between the predictors and the outcome variable should be linear
Independence: The errors associated with one observation are not correlated with the errors of any other observation
Normality: the errors should be normally distributed. Technically normality is necessary only for hypothesis tests to be valid.
Model specification: The model should be properly specified (including all relevant variables, and excluding irrelevant variables)

Additionally, there are issues that can arise during the analysis that, while strictly speaking are not assumptions of regression, are none the less, of great concern to data analysts.

Influence: individual observations that exert undue influence on the coefficients
Collinearity: predictors that are highly collinear, i.e., linearly related, can cause problems in estimating the regression coefficients.

Many graphical methods and numerical tests have been developed over the years for regression diagnostics.

R has many of these methods in stats package which is already installed and loaded in R. There are some other tools in different packages that we can use by installing and loading those packages in our R environment.

Packages that we use in this part are: "car", "alr3", and "faraway".
To install a package in RStudio we can search for the package in the packages tab and install it or we can use R command install.packages("package name"). For example:

install.packages("car")

We need to install a package only one time and after we close RStudio the package will exist in our local computer library.
To load a package in RStudio environment we use library(package name) or we can check the box for that package in the packages tab.
Every time we close an RSudio session and open it again we need to load the package into RStudio environment.

# install packages for part 2, Regression Diagnostics

#install.packages("car") 
#install.packages("alr3")
#install.packages("faraway")

#loading packages into working environment
library(car)

## Warning: package 'car' was built under R version 3.5.2

## Warning: package 'carData' was built under R version 3.5.2

library(alr3)

## Warning: package 'alr3' was built under R version 3.5.2

library(faraway)

## Warning: package 'faraway' was built under R version 3.5.2

2.1 Unusual and Influential data

A single observation that is substantially different from all other observations can make a large difference in the results of your regression analysis. If a single observation (or small group of observations) substantially changes your results, you would want to know about this and investigate further. There are three ways that an observation can be unusual.

Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.
Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an effect on the estimate of regression coefficients.
Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

We use the model m2 in our last multiple regression model. In that model we predicted api00 by independent variables enroll, meals, and full.

#multiple regression model of DV api00 and DVs enroll, meals, and full
m2 <- lm(api00 ~  enroll + meals + full, data = d)

Let’s make individual graphs of api00 with enroll and meals and full so we can get a better view of these scatter plots.

scatterplotMatrix(~ api00 + enroll + meals + full, data = d)

The graphs of api00 with other variables show some potential problems. In every plot, we see one or more data point that is far away from the rest of the data points.

We will go over some graphical methods to identify influential points and potential outliers.

We are not going to discuss and justify underlying methods and formulas here.

Studentized residuals can be used to identify outliers. In R we use rstandard() function to compute Studentized residuals.

res.std <- rstandard(m2) #studentized residuals stored in vector res.std 
#plot Standardized residual in y axis. X axis will be the index or row names
plot(res.std, ylab="Standardized Residual", ylim=c(-3.5,3.5))
#add horizontal lines 3 and -3 to identify extreme values
abline(h =c(-3,0,3), lty = 2)
#find out which data point is outside of 3 standard deviation cut-off
#index is row numbers of those point
index <- which(res.std > 3 | res.std < -3)
#add School name next to points that have extreme value
text(index-20, res.std[index] , labels = d$snum[index])

#print row number of values that are out of bounds
print(index)

## 226 
## 226

#print school names of points that are out of bounds
print(d$snum[index])

## [1] 211

We should pay attention to studentized residuals that exceed +2 or -2, and get even more concerned about residuals that exceed +2.5 or -2.5 and even yet more concerned about residuals that exceed +3 or -3. These results show that school number 211 is the most worrisome observation.

Note: To get a t-statistics we use the externally Studentized residuals by excluding each point from the data to calculate standard deviation of residuals. In package "car" we can use function rstudent(model) to get the externally Studentized residuals.

We can run a test statistic to test if we have outlier using function outlierTest(m2) in the package "car".

#Bonferroni p-values for testing outliner
outlierTest(m2)

## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
##      rstudent unadjusted p-value Bonferonni p
## 226 -3.151186            0.00175      0.70001

This is a test of hypothesis that we do not have outlier vs. at least we have one outlier.

Now let’s look at the leverage’s to identify observations that will have potential great influence on regression coefficient estimates.

To find points with high leverage is to use half normal plot in the package "faraway". First we need to compute diagonal of hat matrix as a measure for leverage

#a vector containing the diagonal of the 'hat' matrix
h <- influence(m2)$hat
#half normal plot of leverage from package faraway
halfnorm(influence(m2)$hat, ylab = "leverage")

Cook’s distance is a measure for influence points. A point with high level of cook’s distance is considers as a point with high influence point.

A cut of value for cook’s distance can be calculated as

\[ \frac{4}{(n - p - 1)}\] Where n is sample size and p is number of predictors.

We can plot Cook’s distance using the following code:

#the cut of value for cook's distance
cutoff <- 4/((nrow(d)-length(m2$coefficients)-2))
plot(m2, which = 4, cook.levels = cutoff)

We can use influencePlot() function in package "car" to identify influence point. It plots Studentized residuals against leverage with cook’s distance.

#cook's distance, studentized residuals, and leverage in the same plot
influencePlot(m2, main="Influence Plot", 
              sub="Circle size is proportional to Cook's Distance" )

##         StudRes        Hat         CookD
## 8    0.18718812 0.08016299 0.00076527787
## 93   2.76307269 0.02940688 0.05687488372
## 210  0.03127861 0.06083329 0.00001588292
## 226 -3.15118603 0.01417076 0.03489753461
## 346 -2.83932062 0.00412967 0.00821116981

And finally infIndexPlot() function will gives use a series of plots that we need for to investigate influence points.

#4 diagnostic plots to identify influential points
infIndexPlot(m2)

2.2 Checking Homoscedasticity

One of the main assumptions for the ordinary least squares regression is the homogeneity of variance of the residuals. If the model is well-fitted, there should be no pattern to the residuals plotted against the fitted values. If the variance of the residuals is non-constant then the residual variance is said to be “heteroscedastic.” There are graphical and non-graphical methods for detecting heteroscedasticity. A commonly used graphical method is to plot the residuals versus fitted (predicted) values.

#residual vs. fitted value plot for Homoscedasticity
plot(m2$resid ~ m2$fitted.values)
#add horizental line from 0
abline(h = 0, lty = 2)

2.3 Checking Linearity

To check linearity residuals should be plotted against the fit as well as other predictors. If any of these plots show systematic shapes, then the linear model is not appropriate and some nonlinear terms may need to be added. In package car, function residualPlots() produces those plots. Also give use a test of linear model vs. adding quadratic term for each variable (test for curvature).

#residual vs. fitted value and all predictors plus test for curvature
residualPlots(m2)

##            Test stat Pr(>|Test stat|)
## enroll        0.0111           0.9911
## meals        -0.6238           0.5331
## full          1.1565           0.2482
## Tukey test   -0.8411           0.4003

2.4 Issues of Independence

A simple visual check would be to plot the residuals versus the time variable. Here the data is not as time series therefore we use school number as an order variable for the data.

plot(m2$resid ~ d$snum)   #residual plot vs. school id

2.5 Checking Normality of Residuals

Many researchers believe that multiple regression requires normality. This is not the case. Normality of residuals is only required for valid hypothesis testing, that is, the normality assumption assures that the p-values for the t-tests and F-test will be valid. Normality is not required in order to obtain unbiased estimates of the regression coefficients. OLS regression merely requires that the residuals (errors) be identically and independently distributed. Furthermore, there is no assumption or requirement that the predictor variables be normally distributed. If this were the case than we would not be able to use dummy coded variables in our models.

Furthermore, because of large sample theory if we have large enough sample size we do not even need the residuals be normally distributed.

However for small sample sizes the normality assumption is required.

To test normality we use qq-normal plot of residuals.

qqnorm(m2$resid)  #Normal Quantile to Quantile plot
qqline(m2$resid)

2.6 Checking for Multicollinearity

When there is a perfect linear relationship among the predictors, the estimates for a regression model cannot be uniquely computed. The term collinearity implies that two variables are near perfect linear combinations of one another. When more than two variables are involved it is often called multicollinearity, although the two terms are often used interchangeably.

The primary concern is that as the degree of multicollinearity increases, the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated.

VIF, variance inflation factor, is used to measure the degree of multicollinearity. As a rule of thumb, a variable whose VIF values are greater than 10 may merit further investigation. Tolerance, defined as 1/VIF, is used by many researchers to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10. It means that the variable could be considered as a linear combination of other independent variables.

In R we can use function vif from package car or faraway. Since there are two packages in our environment that include this function we use ‘car::vif()’ to tell R that we want to run this function from package car

car::vif(m2)  #variance inflation factor

##   enroll    meals     full 
## 1.135733 1.394279 1.482305

2.7 Model Specification

A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model. If relevant variables are omitted from the model, the common variance they share with included variables may be wrongly attributed to those variables, and the error term is inflated. On the other hand, if irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to them. Model specification errors can substantially affect the estimate of regression coefficients.

There are many methods and statistical tests to check whether or not a variable is relevant in the model. To visualize the effect of each variable in the model we can use added variable plot also called a partial-regression plot. The added variable plot is scatter plot of residuals of a model by excluding one variable from the full model against residuals of a model that uses the excluded variable as dependent variable predicted by other variables. The slope of the simple regression between those residuals will be the same as coefficient of the excluded variable. If the slope is not close to zero we conclude that the variable is relevant to the model.

Added variable plot is also useful for detecting influential points.

avPlots(m2) #added variable plots from package "car"