Introduction to Regression in R (Part3, Regression with Categorical Predictors)

IDRE Statistical Consulting Group

3.0 Introduction

In the previous two chapters, we have focused on regression analyses using continuous variables. However, it is possible to include categorical predictors in a regression analysis, but it requires some extra work in performing the analysis and extra work in properly interpreting the results. This chapter will illustrate how you can use R for including categorical predictors in your analysis and describe how to interpret the results of such analyses.

This chapter will use the elemapi2v2 data that you have seen in the prior chapters. We will focus on three variables, api00 as dependent variable, yr_rnd, and mealcat as categorical independent variables. mealcat is categorical variable which takes meals and breaks it up into 3 categories.

Let’s have a quick look at these variables.

class(d$api00)   #class of api00
## [1] "integer"
class(d$yr_rnd)  #class of yr_rnd
## [1] "integer"
class(d$mealcat) #class of mealcat
## [1] "integer"

The class of all three variables are integer. However, since variables yr_rnd and mealcat are taking only 2 and 3 values the summary statistics of them are not very informative.

#summary of api00, yr_rnd, and mealcat
summary(d$api00)   
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   369.0   523.8   643.0   647.6   762.2   940.0
summary(d$yr_rnd)  
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    0.23    0.00    1.00
summary(d$mealcat) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   2.000   2.015   3.000   3.000

The variable api00 is a measure of the performance of the schools and it is a continuous variable but variables yr_rnd and mealcat are categorical variables.

Let’s look at frequency table of variables yr_rnd and mealcat

table(d$yr_rnd)    #frequency table for yr_rnd
## 
##   0   1 
## 308  92
table(d$mealcat)   #frequency table for mealcat
## 
##   1   2   3 
## 131 132 137

The Variable yr_rnd take value 0 for the school that is not year round and 1 for school that is year round school.

The variable meals is the percentage of students who are receiving state sponsored free meals and can be used as an indicator of poverty. This was broken into 3 categories (to make equally sized groups) creating the variable mealcat as follow:

mealcat     meals
  1         0-46% free meals
  2         47-80% free meals
  3         81-100% free meals

In R it is preferred to use a class named factor for categorical variables and in the lm() function, if we want to use a variable as categorical, we need to make it as a factor class.

The function factor() is used to encode a variable (a vector) as a factor. We use this function to create two new variables for yr_rnd and mealcat as factor class and will include them as two new column in the data frame d.

#creat two new variables as factor and add to the data frame
d$yr_rnd_F <- factor(d$yr_rnd)
d$mealcat_F <- factor(d$mealcat)

Now let’s look at the class of new variables

#class of new categorical variables
class(d$yr_rnd_F)
## [1] "factor"
class(d$mealcat_F)
## [1] "factor"

Function levels() gives the levels attribute of a factor variable

#levels of factor
levels(d$yr_rnd_F)
## [1] "0" "1"
levels(d$mealcat_F)
## [1] "1" "2" "3"

Summary statistics of a factor now will returns the frequency table

#summary of factor
summary(d$yr_rnd_F)
##   0   1 
## 308  92
summary(d$mealcat_F)
##   1   2   3 
## 131 132 137

3.1 Regression with two level factor predictor

The simplest example of a categorical predictor in a regression analysis is a 0/1 variable, also called a dummy variable. In R when we include a factor as a predictor to the model R generate dummy variables for each category of the factor. Let’s use the variable yr_rnd_F as a predictor variable and api00 as response variable.

#regression of api00 with yr_rnd_F
m3 <- lm(api00 ~ yr_rnd_F, data = d)
summary(m3)
## 
## Call:
## lm(formula = api00 ~ yr_rnd_F, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -273.539  -95.662    0.967  103.341  297.967 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   684.54       7.14   95.88   <2e-16 ***
## yr_rnd_F1    -160.51      14.89  -10.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 125.3 on 398 degrees of freedom
## Multiple R-squared:  0.226,  Adjusted R-squared:  0.2241 
## F-statistic: 116.2 on 1 and 398 DF,  p-value: < 2.2e-16

This may seem odd at first, but this is a legitimate analysis. But what does this mean? Let’s go back to basics and write out the regression equation that this model implies. yr_rnd_F1 is a dummy variable for category (level) “1” of yr_rnd_F. Meaning it is zero when yr_rnd_F is not at level “1” and it takes value 1 when yr_rnd_F is at level “1”.

api00 = Intercept + Byr_rnd * yr_rnd_F1

where Intercept is the intercept (or constant) and we use Byr_rnd to represent the coefficient for variable yr_rnd_F1. Filling in the values from the regression equation, we get

api00 = 684.539 + -160.5064 * yr_rnd_F1

If a school is not a year-round school (i.e. yr_rnd_F1 is 0) the regression equation would simplify to

api00 = Intercept    + 0 * Byr_rnd 
api00 = 684.539      + 0 * -160.5064  
api00 = 684.539

If a school is a year-round school, the regression equation would simplify to

api00 = Intercept + 1 * Byr_rnd 
api00 = 684.539   + 1 * -160.5064 
api00 = 524.0326

We can graph the observed values and the predicted values using the plot of api00 against original variable for yr_rnd, where it is an integer. Although yr_rnd only has 2 values, we can still draw a regression line showing the relationship between yr_rnd and api00. Based on the results above, we see that the predicted value for non-year round schools is 684.539 and the predicted value for the year round schools is 524.032, and the slope of the line is negative, which makes sense since the coefficient for yr_rnd was negative (-160.5064).

*Note that if you plot api00 against factor yr_rnd_F you will get a side by side boxplot.

#scatter plot api00 against yr_rnd
plot(api00 ~ yr_rnd, data = d)
abline(m3)  # add regression line to the plot

Let’s compare these predicted values to the mean api00 scores for the year-round and non-year-round schools.

# mean of api00 when yr_rnd_F is at level "0". school is not year round
mean(d$api00[d$yr_rnd_F == "0"])
## [1] 684.539
# mean of api00 when yr_rnd_F is at level "1". school is year round
mean(d$api00[d$yr_rnd_F == "1"])
## [1] 524.0326

You can use function aggregate to do the above.

#using aggregate to find the mean for each group of year school round
aggregate(api00 ~ yr_rnd_F, FUN=mean, data = d)
##   yr_rnd_F    api00
## 1        0 684.5390
## 2        1 524.0326

As you see, the regression equation predicts that the value of api00 will be the mean value, depending on whether a school is a year round school or non-year round school.

Let’s relate these predicted values back to the regression equation. For the non-year-round schools, their mean is the same as the intercept (684.539). The coefficient for yr_rnd is the amount we need to add to get the mean for the year-round schools, i.e., we need to add -160.5064 to get 524.0326, the mean for the non year-round schools. In other words, Byr_rnd is the mean api00 score for the year-round schools minus the mean api00 score for the non year-round schools, i.e., mean(year-round) - mean(non year-round).

It may be surprising to note that this regression analysis with a single dummy variable is the same as doing a t-test comparing the mean api00 for the year-round schools with the non year-round schools (see below). You can see that the t value below is the same as the t value for yr_rnd in the regression above. This is because Byr_rnd compares the year-rounds and non year-rounds (since the coefficient is mean(year round)-mean(non year-round)).

#t test for equality of mean of api00 for two group of year round and not year round 
#with equal variance assumption
t.test(api00 ~ yr_rnd_F, data = d, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  api00 by yr_rnd_F
## t = 10.782, df = 398, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  131.2390 189.7737
## sample estimates:
## mean in group 0 mean in group 1 
##        684.5390        524.0326

Since a t-test is the same as doing an anova, we can get the same results using the anova command as well.

#anova table
anova(m3)
## Analysis of Variance Table
## 
## Response: api00
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## yr_rnd_F    1 1825001 1825001  116.24 < 2.2e-16 ***
## Residuals 398 6248671   15700                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since a t-test is the same as doing an anova, we can get the same results using the anova command as well.

If we square the t-value from the t-test, we get the same value as the F-value from the anova.

#square of t value from the t-test is the same as F value from anova
10.7815 ^ 2
## [1] 116.2407

3.2 Regression with a multicategory predictor

Say, that we would like to examine the relationship between the amount of poverty and api scores. We don’t have a measure of poverty, but we can use mealcat as a proxy for a measure of poverty. We have created a factor variable for mealcat named mealcat_F and we can use it to model regression of api00 with mealcat_F.

#regression model of api00 against categorical variable mealcat_F with 3 levels
m4 <- lm(api00 ~ mealcat_F, data = d)
summary(m4)
## 
## Call:
## lm(formula = api00 ~ mealcat_F, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -253.394  -47.883    0.282   52.282  185.620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  805.718      6.169  130.60   <2e-16 ***
## mealcat_F2  -166.324      8.708  -19.10   <2e-16 ***
## mealcat_F3  -301.338      8.629  -34.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.61 on 397 degrees of freedom
## Multiple R-squared:  0.7548, Adjusted R-squared:  0.7536 
## F-statistic: 611.1 on 2 and 397 DF,  p-value: < 2.2e-16

The F test at the last line give a test for the overall differences among the three groups. This shows that the overall differences among the three groups are significant.

The interpretation of the coefficients is much like that for the binary variables. Group 1 is the omitted group, so Intercept is the mean for group 1. The coefficient for mealcat_F2 is the mean for group 2 minus the mean of the omitted group (group 1). And the coefficient for mealcat_F3 is the mean of group 3 minus the mean of group 1. You can verify this by comparing the coefficients with the means of the groups.

#aggregate the mean of api00 for each category in mealcat_F
aggregate(api00 ~ mealcat_F, FUN=mean, data = d)
##   mealcat_F    api00
## 1         1 805.7176
## 2         2 639.3939
## 3         3 504.3796

Based on these results, we can say that the three groups differ in their api00 scores, and that in particular group2 is significantly different from group1 (because mealcat_F2 was significant) and group 3 is significantly different from group 1 (because mealcat_F3 was significant).

We can change the reference group by rearranging the levels of factor mealcat_F

#relevel factor mealcat_F and make group "3" as the reference group
d$mealcat_F <- relevel(d$mealcat_F, ref = "3")
m5 <- lm(api00 ~ mealcat_F, data = d)
summary(m5)
## 
## Call:
## lm(formula = api00 ~ mealcat_F, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -253.394  -47.883    0.282   52.282  185.620 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  504.380      6.033   83.61   <2e-16 ***
## mealcat_F1   301.338      8.629   34.92   <2e-16 ***
## mealcat_F2   135.014      8.612   15.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.61 on 397 degrees of freedom
## Multiple R-squared:  0.7548, Adjusted R-squared:  0.7536 
## F-statistic: 611.1 on 2 and 397 DF,  p-value: < 2.2e-16

With group 3 omitted, the Intercept is now the mean of group 3 and mealcat1 is group1-group3 and mealcat2 is group2-group3. We see that both of these coefficients are significant, indicating that group 1 is significantly different from group 3 and group 2 is significantly different from group 3.

We can also use continues and categorical predictor or 2 or more categorical variable in a regression model or add interactions between predictors into a model. However, interpreting those more complicated models will be harder and is out of the scoop of this seminar.