Introduction to Regression in R (Part1, Simple and Multiple Regression)

1.1 Reading the data into RStudio/R

a) A quick overview of RStudio environment

Before we begin, let’s take a look at the RStudio environment. RStudio is an integrated development environment (IDE) to make R easier to use. RStudio interface has four panels:

Source code editor or R Script: to type, edit and manage R commands. The first time you open RStudio the source code panel is hidden and will appear after we open a new or existed R script. R scripts are text files with the .R extension. You can save R script file to keep R codes that you have written.

To run a command or a set of commands from the R Script, select them first and hold Ctrl + Enter.

Console is linked to R script and every time we run a command from R Script it evaluated in the R console and outputs and results (if any) appears in the consol. We can also type a command in the console and click Enter to run it directly from the R consol.

There are two other panels with some tabs on each. One panel usually includes Environment and history. Environment tab shows R data objects we define in the current R session for example, vector, matrix, and dataframe. History simply shows all commands that we already run and evaluated in the R console.

The other panel includes files, Plots, packages, and help. We can always customize R panels and include tabs that we need.

Files tab is to view and manage directory and files in the hard drive of the computer. From files tab we can navigate a directory and set it as working directory.

Plots tab is to view and manage plots. By clicking on the export we can save our plots as jpeg or PDF.

Packages shows a list of all the R packages installed in the computer. We can load a package by checking their box. Also we can install a new package from this tab.

However, another way to manage R packages is to run function install.packages("package name") in R console or from R script to install a new package and use function library(package name) to load a package.

Help tab is the help windows to view help for an R function. We can use search to search help from this tab or we can use ?function name or help (function name) from the console. The help documentation will appears in the help tab.

RStudio panel

b) Loading the data in RStudio

Now let’s read our dataset in the R workspace. The dataset used in this portion of the seminar is located here: elemapi2v2

You can download the data and save it into your local computer and load it from there or you can directly read it from the web url address. The data is in .csv format and we use read.csv() function to read and load it in R. The resulting value will be a data.frame object that will appears in the environment tab. By clicking on the data.frame object we can view it as a spreadsheet. The following code is use to read and load the data into R workspace first from url address and then from the local computer:

d <- read.csv(
"https://stats.idre.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv")
#We can also read the data from the local computer
#we can set the working directory by setwd()
#d <- read.csv("c:/.../elemapi2v2.csv",
#                header = TRUE, sep = ",")
class(d)  #class of object d, returns data.frame

## [1] "data.frame"

names(d)  #retuns names of variables (columns)

##  [1] "snum"     "dnum"     "api00"    "api99"    "growth"   "meals"   
##  [7] "ell"      "yr_rnd"   "mobility" "acs_k3"   "acs_46"   "not_hsg" 
## [13] "hsg"      "some_col" "col_grad" "grad_sch" "avg_ed"   "full"    
## [19] "emer"     "enroll"   "mealcat"  "collcat"  "abv_hsg"  "lgenroll"

dim(d)    #returns number of rows and columns of data frame

## [1] 400  24

As we mentioned the result of read.csv is a data frame object. A data frame is a list of variables of the same number of rows with unique row names, given class “data.frame”. In the data frame d we have 24 variables and 400 observations.

c) exploring the data

As researchers we need to make sure first that the data we cleaned hold plausible values. Descriptive analysis will help us to understand our data better and investigate if there is any problem in the data. In this workshop we skip this part but in a real study descriptive analysis is an important part of a good data analysis.

Here we take a look at structure of the data and summary statistics.

str(d)      #returns structure of variables in the data frame d

## 'data.frame':    400 obs. of  24 variables:
##  $ snum    : int  906 889 887 876 888 4284 4271 2910 2899 2887 ...
##  $ dnum    : int  41 41 41 41 41 98 98 108 108 108 ...
##  $ api00   : int  693 570 546 571 478 858 918 831 860 737 ...
##  $ api99   : int  600 501 472 487 425 844 864 791 838 703 ...
##  $ growth  : int  93 69 74 84 53 14 54 40 22 34 ...
##  $ meals   : int  67 92 97 90 89 10 5 2 5 29 ...
##  $ ell     : int  9 21 29 27 30 3 2 3 6 15 ...
##  $ yr_rnd  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mobility: int  11 33 36 27 44 10 16 44 10 17 ...
##  $ acs_k3  : int  16 15 17 20 18 20 19 20 20 21 ...
##  $ acs_46  : int  22 32 25 30 31 33 28 31 30 29 ...
##  $ not_hsg : int  0 0 0 36 50 1 1 0 2 8 ...
##  $ hsg     : int  0 0 0 45 50 8 4 4 9 25 ...
##  $ some_col: int  0 0 0 9 0 24 18 16 15 34 ...
##  $ col_grad: int  0 0 0 9 0 36 34 50 42 27 ...
##  $ grad_sch: int  0 0 0 0 0 31 43 30 33 7 ...
##  $ avg_ed  : num  NA NA NA 1.91 1.5 ...
##  $ full    : int  76 79 68 87 87 100 100 96 100 96 ...
##  $ emer    : int  24 19 29 11 13 0 0 2 0 7 ...
##  $ enroll  : int  247 463 395 418 520 343 303 1513 660 362 ...
##  $ mealcat : int  2 3 3 3 3 1 1 1 1 1 ...
##  $ collcat : int  1 1 1 1 1 2 2 2 2 3 ...
##  $ abv_hsg : int  100 100 100 64 50 99 99 100 98 92 ...
##  $ lgenroll: num  2.39 2.67 2.6 2.62 2.72 ...

summary(d)  #Summary statistics of variables in d

##       snum           dnum           api00           api99      
##  Min.   :  58   Min.   : 41.0   Min.   :369.0   Min.   :333.0  
##  1st Qu.:1720   1st Qu.:395.0   1st Qu.:523.8   1st Qu.:484.8  
##  Median :3008   Median :401.0   Median :643.0   Median :602.0  
##  Mean   :2867   Mean   :457.7   Mean   :647.6   Mean   :610.2  
##  3rd Qu.:4198   3rd Qu.:630.0   3rd Qu.:762.2   3rd Qu.:731.2  
##  Max.   :6072   Max.   :796.0   Max.   :940.0   Max.   :917.0  
##                                                                
##      growth           meals             ell            yr_rnd    
##  Min.   :-69.00   Min.   :  0.00   Min.   : 0.00   Min.   :0.00  
##  1st Qu.: 19.00   1st Qu.: 31.00   1st Qu.: 9.75   1st Qu.:0.00  
##  Median : 36.00   Median : 67.50   Median :25.00   Median :0.00  
##  Mean   : 37.41   Mean   : 60.31   Mean   :31.45   Mean   :0.23  
##  3rd Qu.: 53.25   3rd Qu.: 90.00   3rd Qu.:50.25   3rd Qu.:0.00  
##  Max.   :134.00   Max.   :100.00   Max.   :91.00   Max.   :1.00  
##                                                                  
##     mobility         acs_k3          acs_46         not_hsg      
##  Min.   : 2.00   Min.   :14.00   Min.   :20.00   Min.   :  0.00  
##  1st Qu.:13.00   1st Qu.:18.00   1st Qu.:27.00   1st Qu.:  4.00  
##  Median :17.00   Median :19.00   Median :29.00   Median : 14.00  
##  Mean   :18.25   Mean   :19.16   Mean   :29.69   Mean   : 21.25  
##  3rd Qu.:22.00   3rd Qu.:20.00   3rd Qu.:31.00   3rd Qu.: 34.00  
##  Max.   :47.00   Max.   :25.00   Max.   :50.00   Max.   :100.00  
##  NA's   :1       NA's   :2       NA's   :3                       
##       hsg            some_col        col_grad        grad_sch     
##  Min.   :  0.00   Min.   : 0.00   Min.   :  0.0   Min.   : 0.000  
##  1st Qu.: 17.00   1st Qu.:12.00   1st Qu.:  7.0   1st Qu.: 1.000  
##  Median : 26.00   Median :19.00   Median : 16.0   Median : 4.000  
##  Mean   : 26.02   Mean   :19.71   Mean   : 19.7   Mean   : 8.637  
##  3rd Qu.: 34.00   3rd Qu.:28.00   3rd Qu.: 30.0   3rd Qu.:10.000  
##  Max.   :100.00   Max.   :67.00   Max.   :100.0   Max.   :67.000  
##                                                                   
##      avg_ed           full             emer           enroll      
##  Min.   :1.000   Min.   : 37.00   Min.   : 0.00   Min.   : 130.0  
##  1st Qu.:2.070   1st Qu.: 76.00   1st Qu.: 3.00   1st Qu.: 320.0  
##  Median :2.600   Median : 88.00   Median :10.00   Median : 435.0  
##  Mean   :2.668   Mean   : 84.55   Mean   :12.66   Mean   : 483.5  
##  3rd Qu.:3.220   3rd Qu.: 97.00   3rd Qu.:19.00   3rd Qu.: 608.0  
##  Max.   :4.620   Max.   :100.00   Max.   :59.00   Max.   :1570.0  
##  NA's   :19                                                       
##     mealcat         collcat        abv_hsg          lgenroll    
##  Min.   :1.000   Min.   :1.00   Min.   :  0.00   Min.   :2.114  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.: 66.00   1st Qu.:2.505  
##  Median :2.000   Median :2.00   Median : 86.00   Median :2.638  
##  Mean   :2.015   Mean   :2.02   Mean   : 78.75   Mean   :2.640  
##  3rd Qu.:3.000   3rd Qu.:3.00   3rd Qu.: 96.00   3rd Qu.:2.784  
##  Max.   :3.000   Max.   :3.00   Max.   :100.00   Max.   :3.196  
##

c) Variables description

snum          school number
dnum          district number 
api00         api 2000, Academic performance index in 2000
api99         api 1999, Academic performance index in 1999
growth        growth from 1999 to 2000
meals         percentage of students who gets free meals
ell           english language learners
yr_rnd        year round school
mobility      percentage 1st year in school
acs_k3        avg class size k-3
acs_46        avg class size 4-6
not_hsg       parent not hsg
hsg           parent hsg
some_col      parent some college
col_grad      parent college grad
grad_sch      parent grad school
avg_ed        avg parent ed
full          percentage teachers with full credential
emer          percentage of teachers with emergency credential
enroll        number of students
mealcat       Percentage free meals in 3 categories
.
.
.

1.2 Simple Linear Regression

a) What is a linear regression model?

Regression Analysis is a statistical modeling tool that is used to explain a response (criterion or dependent) variable as a function of one or more predictor (independent) variables.

Employee efficiency may be related to years of training , educational background, and age of the employee.
The amount of time until a headache is gone, when taking a pain killer, may be related to the dosage , age , and gender.
The number of votes for a presidential candidate may be related to gender, income, and the state.
In our data example, students’ performance may be related to the poverty level of students in the school or percentage of teachers with full credential in the school or number of students in the class.

Single regression is study of a single response as a function of a single predictor

$X$: The predictor. $x_1, x_2, \dots, x_n$ are $n$ observations from $X$.

$Y$: The response. $y_1, y_2, \dots, y_n$ are $n$ observations from $Y$.

In a simple linear regression model we assume this relationship is a linear function.

b) Linear function

A linear function can be determine by the intercept and the slope of the line.

The figure below shows plot of linear function

\[y = f(x) = 1 + 0.5 \times x\]

The intercept is equal to 1 and slope of the line is equal to 0.5.

The intercept represent the value of y given x = 0.
The slope represent increase in y for one unit increase in the x.

c) Linear regression as a statistical model

In a real life situation we usually cannot explain $Y$ as an exact function of $X$. There is always some errors or noises in the dependent variable that cannot be explained by the relationship between independent and dependent variable. We call that error or residual term in the model. We represent the formula for a simple linear model as:

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

Where $\beta_0$ is the intercept and $\beta_1$ is The slope of the model and $\epsilon_i$ is the residual term.

The quantities $\beta_0$ and $\beta_1$ are called model parameters and they are estimated from the data.
We assume the linear function can predict the mean value of the outcome for a given value of predictor variable. The mean value in mathematical statistics is noted as $\text{E}(Y | X = x)$ and reads the expected value of $Y$ given $X = x$.
In a simple linear regression, the mean function of outcome is given by

\[\text{mean of } Y \text{ given a value of }X \text{ equal to } x = E(Y |X = x) = \beta_0 +\beta_1x\]

The goal of linear regression model is to estimate parameters $\beta_0$ and $\beta_1$ in a way that the line fit the best with the data.
The method to achieve this best fitted linear function is called Ordinary Least Square (OLS) method.
In the OLS we estimate the parameters of the model (intercept and slope) by minimizing sum of square of residuals.
Estimate model parameters are noted as $\hat{\beta}_0$ and $\hat{\beta}_1$.
The estimated model will be

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]

$\hat{y}$ represent the predicted mean of the outcome for given value of $x$. We also call it fitted value.

*note: The predictor variable may be continuous, meaning that it may assume all values within a range, for example, age or height. Or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1 or a categorical variable with more that with more than two levels. There is only one response or dependent variable, and it is continuous.

The use of categorical variables will be covered in section 3.

d) The first linear model in R

Now that we have some review on the linear model, let’s use R and run a simple regression model.

In our data example we are interested to study the relationship between students’ academic performance with some characteristics in their school life. For example we can use variable api00 as outcome and variable enroll which is the number of students in the school.
We expect that better academic performance would be associated with lower number of students in the school.
In R we use function lm() to run a linear regression model. Let’s look at R help documentation for function lm()

help(lm)   #shows R Documentation for function lm()

Now we run our model in R api00 as a dependent variable and enroll as independent variable:

#multiple regression model of DV api00 and DVs enroll, meals, and full
m1 <- lm(api00 ~ enroll, data = d)
print(m1)    #Prints coefficients of the model

## 
## Call:
## lm(formula = api00 ~ enroll, data = d)
## 
## Coefficients:
## (Intercept)       enroll  
##    744.2514      -0.1999

The first term of the function lm() is the formula of the model which itself is a function. The response variable will be in the left side of a "~" and in the right side we have the predictor variable.
api00 ~ enroll interpreted as api00 is modelled by the linear predictor enroll.
data = d is telling lm function to find variables name in the data frame d.
Remember objects api00 and enroll do not exist in our workspace.
The result of function lm() will be passed to m1 as a lm object.
print() prints estimated coefficients of the model.
The estimated linear line is:

\[ \text{api00 = 744.2514 - 0.1999 enroll}\]

The coefficient for enroll is -.1999, or approximately -.2, meaning that for a one unit increase in enroll, we would expect a .2 unit decrease in api00. In other words, a school with 1000 students would be expected (on average) to have an api score 20 units more than a school with 1100 students.
The constant is 744.2514, and this is the predicted value when enroll equals zero. In most cases, the constant is not very interesting.

e) summary and components of lm object

To observe the result of the lm object in more detail we use summary() function:

summary(m1)    #Prints summary of the model

## 
## Call:
## lm(formula = api00 ~ enroll, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -285.50 -112.55   -6.70   95.06  389.15 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 744.25141   15.93308  46.711  < 2e-16 ***
## enroll       -0.19987    0.02985  -6.695 7.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135 on 398 degrees of freedom
## Multiple R-squared:  0.1012, Adjusted R-squared:  0.09898 
## F-statistic: 44.83 on 1 and 398 DF,  p-value: 7.339e-11

The first line shows the model specification.
The second part of the summary report the five number statistics of residuals estimated which is:

\[ e_i = y_i - \hat{y}\]

In the coefficients part we have a table of estimated coefficients, standard error, t-value, and p-value of the hypothesis test of coefficients equal to zero.
If the p-value is a smaller than 0.001 then we have 3 stars in front of it, showing there is strong significant evidence that coefficient is not zero. For p-value less than 0.01 and greater than 0.001 we have 2 star, between 0.01 and 0.05 we get “*" and between 0.05 and 0.1 we have “.”.
Here both intercept and slope are significant.
In the last portion of the result we observe residual standard error which is 135 and its degree of freedom equal to 398.
Multiple R-squared is the R-squared of the model equal to 0.1012, and adjusted R-squared is 0.09898 which is adjusted for number of predictors.
In the simple linear regression model R-square is equal to square of the correlation between response and predicted variable. We can run the function cor() to see if this is true.

r <- cor(d$api00, d$enroll) #correlation coefficient of api00 and enroll
r ^ 2                       #this is equal to r-squared in simple regression

## [1] 0.1012335

The last line gives the overal significance of the model against the null model which is the model with only intercept. F-statistic is equal to 44.83 with 1 and 398 degrees of freedom and p-value equal to 7.339e-11. This p-value in a simple regression model is exactly eqaul to p-value of the slope. Why?
As we said the result of lm() function is a lm class. We can use function ls() to list components includes in object m1.

ls(m1)        #list of components in object class lm

##  [1] "assign"        "call"          "coefficients"  "df.residual"  
##  [5] "effects"       "fitted.values" "model"         "qr"           
##  [9] "rank"          "residuals"     "terms"         "xlevels"

To access to the above components we can use $ sign after the lm object.

m1$coefficients        #returns coefficients of the model

## (Intercept)      enroll 
## 744.2514142  -0.1998674

m1$fitted.values[1:10] #a vector of fitted values here we show the first 10

##        1        2        3        4        5        6        7        8 
## 694.8842 651.7128 665.3038 660.7068 640.3203 675.6969 683.6916 441.8520 
##        9       10 
## 612.3389 671.8994

We can store residuals in a new object.

residuals <- m1$resid  #a vector of residuals

Some of the components can be extracted using a function.

coefficients(m1)

## (Intercept)      enroll 
## 744.2514142  -0.1998674

There are some R functions that can be apply on a lm object.

To get the Confidence interval for the coefficients of the model we use confint()

confint(m1) # returns a matrix of Confidence Interval for coefficients

##                   2.5 %      97.5 %
## (Intercept) 712.9279024 775.5749259
## enroll       -0.2585532  -0.1411817

In addition to getting the regression table and statistics, it can be useful to see a scatterplot of the predicted and outcome variables with the regression line plotted.

plot(api00 ~ enroll, data = d) #scatter plot of api00 vs. enroll
abline(m1, col = "blue")       #add regression line to the scatter plot

As you see, some of the points appear to be outliers. If you add text with labels = d$snum on the scatter, you can see the school number for each point. This allows us to see, for example, that one of the outliers is school 2910.

#adding labels(school number) to the scatter plot
plot(api00 ~ enroll, data = d) 
text(d$enroll, d$api00+20, labels = d$snum, cex = .7)
abline(m1, col = "blue")

f) Breaking down the sum of squares and anova

If we use only intercept to model the response variable the regression line will be a horizontal line from the mean of the response variable. In our example this will be the mean of api00 which is the line $y = 647.6225$

mean(d$api00)

## [1] 647.6225

The residuals for this line will be $y_i - \bar{y}$. We can breakdown this error to two part using the predicted value from regression of api00 by predicted variable enroll.

This can be written as:

\[ y_i - \bar{y} = y_i - \hat{y_i} + \hat{y_i} - \bar{y} \]

Where $\bar{y}$ is the mean of response, $\hat{y_i}$ is the fitted value from the regression model including predictor variable and $y_i$ is the observed response.

The graph below shows this error parts.

It can be shown the following equation will be hold:

\[ \sum_{i = 1}^{n} (y_i - \bar{y})^2 = \sum_{i = 1}^{n}(y_i - \hat{y_i})^2 + \sum_{i = 1}^{n}(\hat{y_i} - \bar{y})^2 \]

The right hand side of this equation is called Sum of Square Total (SST). The first term of the left hand side of the above equation is called Residual Sum of Square (RSS) and the second term is called Sum of Square Regression (SSreg). We have:

\[SST = RSS + SSreg\]

The ratio of SSreg to SST is called R-squared. In percentage, R-square is the percentage of error that can be explained by the model.

\[R^2 = \frac{SSreg}{SST} = 1 - \frac{RSS}{SST}\]

We can use function anova() to observe sum of squares of the model.

anova(m1)  #anova table

## Analysis of Variance Table
## 
## Response: api00
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## enroll      1  817326  817326  44.829 7.339e-11 ***
## Residuals 398 7256346   18232                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In anova table we have sum of square of the regression model in row named enroll and sum of square of residuals with their degrees of freedom. Also it shows the F statistics that we have seen before in the summary of the model. Mean squares are sum of squares divided by their degrees of freedom.

1.3 Multiple Regression

a) Adding more predictors to a simple regression model

Now, let’s look at an example of multiple regression, in which we have one outcome (dependent) variable and multiple predictors. The percentage of variability explained by variable enroll was only 10.12%. In order to improve the percentage of variability accounted by the model, we can add more predictors. We add percentage of students who gets full meal as an indicator of socioeconomic status and percentage of teachers with full credential to our model. In R we can do this by simply + variable name to our lm() function.

#multiple regression model of DV api00 and DVs enroll, meals, and full
m2 <- lm(api00 ~  enroll + meals + full)
summary(m2)   #summary of model m2

## 
## Call:
## lm(formula = api00 ~ enroll + meals + full)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -181.721  -40.802    1.129   39.983  158.774 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 801.82983   26.42660  30.342  < 2e-16 ***
## enroll       -0.05146    0.01384  -3.719 0.000229 ***
## meals        -3.65973    0.10880 -33.639  < 2e-16 ***
## full          1.08109    0.23945   4.515 8.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 58.73 on 396 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8295 
## F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16

anova(m2)     #anova table of model m2

## Analysis of Variance Table
## 
## Response: api00
##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
## enroll      1  817326  817326  236.947 < 2.2e-16 ***
## meals       1 5820066 5820066 1687.263 < 2.2e-16 ***
## full        1   70313   70313   20.384 8.369e-06 ***
## Residuals 396 1365967    3449                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the result of the multiple regression model we will see that based on the F test the overall model is significant. In also all of the coefficients in the model are significantly not equal to zero.
The R-square is 0.8308, meaning that approximately 83% of the variability of api00 is accounted for by the variables in the model. The adjusted R-square shows after taking the account of number of predictors in the model R_square is still about 0.83.
The coefficients for each of the variables indicates the amount of change one could expect in api00 given a one-unit change in the value of that variable, given that all other variables in the model are held constant. For example, consider the variable meals. We would expect a decrease of about 3.66 in the api00 score for every one unit increase in percent free meals, assuming that all other variables in the model are held constant.
We see quite a difference in the coefficient of variable enroll compared to the simple linear regression. Before the coefficient for variable enroll was -.1999 and now it is -0.05146.
The anova table shows the sum of squares that will be explain by adding each variable at a time to the model. Or it is the reduced amount in sum of square residuals by an additional variable.

For example variable enroll reduces the total error by 817326. By adding variable meals we reduce additional 5820066 from the residuals and by adding variable full we reduce the error by 70313. Finally we have 1365967 left as unexplained error. The total error is all of those sum of squares added together. To get the total sum of square of variable api00 we can multiply its’ variance by $(n-1)$.

sum(anova(m2)$Sum)          #sum of RSS and SSreg

## [1] 8073672

(400 - 1) * var(d$api00)    #Total sum of squre

## [1] 8073672

b) Standardized regression coefficients

Some researchers are interested to compare the relative strength of the various predictors within the model. Since each variable have different unit we cannot compare coefficients to one another. To address this problem we use standardized regression coefficients which can be obtain by transforming the outcome and predictor variables all to their standard scores, also called z-scores, before running the regression.

In R we use function scale() to do this for each variable.

#Standardized regression model
m2.sd <- lm(scale(api00) ~  scale(enroll) + scale(meals) + scale(full), data = d)
summary(m2.sd) #coefficients are standardized

## 
## Call:
## lm(formula = scale(api00) ~ scale(enroll) + scale(meals) + scale(full), 
##     data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27749 -0.28683  0.00793  0.28108  1.11617 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.454e-16  2.064e-02   0.000 1.000000    
## scale(enroll) -8.191e-02  2.203e-02  -3.719 0.000229 ***
## scale(meals)  -8.210e-01  2.441e-02 -33.639  < 2e-16 ***
## scale(full)    1.136e-01  2.517e-02   4.515 8.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4129 on 396 degrees of freedom
## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8295 
## F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16

In the standardized regression coefficients summary we see that the intercept is zero and all t-statistics for other coefficients are exactly the same as the original model.

Because the coefficients are all in the same standardized units you can compare these coefficients to assess the relative strength of each of the predictors. In this example, meals has the largest Beta coefficient, -0.821.

Thus, a one standard deviation increase in meals leads to a 0.821 standard deviation decrease in predicted api00, with the other variables held constant.

Introduction to Regression in R (Part1, Simple and Multiple Regression)

1.1 Reading the data into RStudio/R

1.2 Simple Linear Regression

1.3 Multiple Regression

1.4 Summary