IDRE Statistical Consulting Group
a) A quick overview of RStudio environment
Before we begin, let’s take a look at the RStudio environment. RStudio is an integrated development environment (IDE) to make R easier to use. RStudio interface has four panels:
To run a command or a set of commands from the R Script, select them first and hold Ctrl + Enter.
Files tab is to view and manage directory and files in the hard drive of the computer. From files tab we can navigate a directory and set it as working directory.
Plots tab is to view and manage plots. By clicking on the export we can save our plots as jpeg or PDF.
Packages shows a list of all the R packages installed in the computer. We can load a package by checking their box. Also we can install a new package from this tab.
However, another way to manage R packages is to run function install.packages("package name")
in R console or from R script to install a new package and use function library(package name)
to load a package.
Help tab is the help windows to view help for an R function. We can use search to search help from this tab or we can use ?function name
or help (function name)
from the console. The help documentation will appears in the help tab.
b) Loading the data in RStudio
Now let’s read our dataset in the R workspace. The dataset used in this portion of the seminar is located here: elemapi2v2
You can download the data and save it into your local computer and load it from there or you can directly read it from the web url address. The data is in .csv format and we use read.csv()
function to read and load it in R. The resulting value will be a data.frame object that will appears in the environment tab. By clicking on the data.frame object we can view it as a spreadsheet. The following code is use to read and load the data into R workspace first from url address and then from the local computer:
d <- read.csv(
"https://stats.idre.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv")
#We can also read the data from the local computer
#we can set the working directory by setwd()
#d <- read.csv("c:/.../elemapi2v2.csv",
# header = TRUE, sep = ",")
class(d) #class of object d, returns data.frame
## [1] "data.frame"
names(d) #retuns names of variables (columns)
## [1] "snum" "dnum" "api00" "api99" "growth" "meals"
## [7] "ell" "yr_rnd" "mobility" "acs_k3" "acs_46" "not_hsg"
## [13] "hsg" "some_col" "col_grad" "grad_sch" "avg_ed" "full"
## [19] "emer" "enroll" "mealcat" "collcat" "abv_hsg" "lgenroll"
dim(d) #returns number of rows and columns of data frame
## [1] 400 24
As we mentioned the result of read.csv is a data frame object. A data frame is a list of variables of the same number of rows with unique row names, given class “data.frame”. In the data frame d we have 24 variables and 400 observations.
c) exploring the data
As researchers we need to make sure first that the data we cleaned hold plausible values. Descriptive analysis will help us to understand our data better and investigate if there is any problem in the data. In this workshop we skip this part but in a real study descriptive analysis is an important part of a good data analysis.
Here we take a look at structure of the data and summary statistics.
str(d) #returns structure of variables in the data frame d
## 'data.frame': 400 obs. of 24 variables:
## $ snum : int 906 889 887 876 888 4284 4271 2910 2899 2887 ...
## $ dnum : int 41 41 41 41 41 98 98 108 108 108 ...
## $ api00 : int 693 570 546 571 478 858 918 831 860 737 ...
## $ api99 : int 600 501 472 487 425 844 864 791 838 703 ...
## $ growth : int 93 69 74 84 53 14 54 40 22 34 ...
## $ meals : int 67 92 97 90 89 10 5 2 5 29 ...
## $ ell : int 9 21 29 27 30 3 2 3 6 15 ...
## $ yr_rnd : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mobility: int 11 33 36 27 44 10 16 44 10 17 ...
## $ acs_k3 : int 16 15 17 20 18 20 19 20 20 21 ...
## $ acs_46 : int 22 32 25 30 31 33 28 31 30 29 ...
## $ not_hsg : int 0 0 0 36 50 1 1 0 2 8 ...
## $ hsg : int 0 0 0 45 50 8 4 4 9 25 ...
## $ some_col: int 0 0 0 9 0 24 18 16 15 34 ...
## $ col_grad: int 0 0 0 9 0 36 34 50 42 27 ...
## $ grad_sch: int 0 0 0 0 0 31 43 30 33 7 ...
## $ avg_ed : num NA NA NA 1.91 1.5 ...
## $ full : int 76 79 68 87 87 100 100 96 100 96 ...
## $ emer : int 24 19 29 11 13 0 0 2 0 7 ...
## $ enroll : int 247 463 395 418 520 343 303 1513 660 362 ...
## $ mealcat : int 2 3 3 3 3 1 1 1 1 1 ...
## $ collcat : int 1 1 1 1 1 2 2 2 2 3 ...
## $ abv_hsg : int 100 100 100 64 50 99 99 100 98 92 ...
## $ lgenroll: num 2.39 2.67 2.6 2.62 2.72 ...
summary(d) #Summary statistics of variables in d
## snum dnum api00 api99
## Min. : 58 Min. : 41.0 Min. :369.0 Min. :333.0
## 1st Qu.:1720 1st Qu.:395.0 1st Qu.:523.8 1st Qu.:484.8
## Median :3008 Median :401.0 Median :643.0 Median :602.0
## Mean :2867 Mean :457.7 Mean :647.6 Mean :610.2
## 3rd Qu.:4198 3rd Qu.:630.0 3rd Qu.:762.2 3rd Qu.:731.2
## Max. :6072 Max. :796.0 Max. :940.0 Max. :917.0
##
## growth meals ell yr_rnd
## Min. :-69.00 Min. : 0.00 Min. : 0.00 Min. :0.00
## 1st Qu.: 19.00 1st Qu.: 31.00 1st Qu.: 9.75 1st Qu.:0.00
## Median : 36.00 Median : 67.50 Median :25.00 Median :0.00
## Mean : 37.41 Mean : 60.31 Mean :31.45 Mean :0.23
## 3rd Qu.: 53.25 3rd Qu.: 90.00 3rd Qu.:50.25 3rd Qu.:0.00
## Max. :134.00 Max. :100.00 Max. :91.00 Max. :1.00
##
## mobility acs_k3 acs_46 not_hsg
## Min. : 2.00 Min. :14.00 Min. :20.00 Min. : 0.00
## 1st Qu.:13.00 1st Qu.:18.00 1st Qu.:27.00 1st Qu.: 4.00
## Median :17.00 Median :19.00 Median :29.00 Median : 14.00
## Mean :18.25 Mean :19.16 Mean :29.69 Mean : 21.25
## 3rd Qu.:22.00 3rd Qu.:20.00 3rd Qu.:31.00 3rd Qu.: 34.00
## Max. :47.00 Max. :25.00 Max. :50.00 Max. :100.00
## NA's :1 NA's :2 NA's :3
## hsg some_col col_grad grad_sch
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000
## 1st Qu.: 17.00 1st Qu.:12.00 1st Qu.: 7.0 1st Qu.: 1.000
## Median : 26.00 Median :19.00 Median : 16.0 Median : 4.000
## Mean : 26.02 Mean :19.71 Mean : 19.7 Mean : 8.637
## 3rd Qu.: 34.00 3rd Qu.:28.00 3rd Qu.: 30.0 3rd Qu.:10.000
## Max. :100.00 Max. :67.00 Max. :100.0 Max. :67.000
##
## avg_ed full emer enroll
## Min. :1.000 Min. : 37.00 Min. : 0.00 Min. : 130.0
## 1st Qu.:2.070 1st Qu.: 76.00 1st Qu.: 3.00 1st Qu.: 320.0
## Median :2.600 Median : 88.00 Median :10.00 Median : 435.0
## Mean :2.668 Mean : 84.55 Mean :12.66 Mean : 483.5
## 3rd Qu.:3.220 3rd Qu.: 97.00 3rd Qu.:19.00 3rd Qu.: 608.0
## Max. :4.620 Max. :100.00 Max. :59.00 Max. :1570.0
## NA's :19
## mealcat collcat abv_hsg lgenroll
## Min. :1.000 Min. :1.00 Min. : 0.00 Min. :2.114
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.: 66.00 1st Qu.:2.505
## Median :2.000 Median :2.00 Median : 86.00 Median :2.638
## Mean :2.015 Mean :2.02 Mean : 78.75 Mean :2.640
## 3rd Qu.:3.000 3rd Qu.:3.00 3rd Qu.: 96.00 3rd Qu.:2.784
## Max. :3.000 Max. :3.00 Max. :100.00 Max. :3.196
##
c) Variables description
a) What is a linear regression model?
Regression Analysis is a statistical modeling tool that is used to explain a response (criterion or dependent) variable as a function of one or more predictor (independent) variables.
Employee efficiency may be related to years of training , educational background, and age of the employee.
The amount of time until a headache is gone, when taking a pain killer, may be related to the dosage , age , and gender.
The number of votes for a presidential candidate may be related to gender, income, and the state.
In our data example, students’ performance may be related to the poverty level of students in the school or percentage of teachers with full credential in the school or number of students in the class.
Single regression is study of a single response as a function of a single predictor
\(X\): The predictor. \(x_1, x_2, \dots, x_n\) are \(n\) observations from \(X\).
\(Y\): The response. \(y_1, y_2, \dots, y_n\) are \(n\) observations from \(Y\).
In a simple linear regression model we assume this relationship is a linear function.
b) Linear function
A linear function can be determine by the intercept and the slope of the line.
The figure below shows plot of linear function
\[y = f(x) = 1 + 0.5 \times x\]
The intercept is equal to 1 and slope of the line is equal to 0.5.
The intercept represent the value of y given x = 0.
The slope represent increase in y for one unit increase in the x.
c) Linear regression as a statistical model
In a real life situation we usually cannot explain \(Y\) as an exact function of \(X\). There is always some errors or noises in the dependent variable that cannot be explained by the relationship between independent and dependent variable. We call that error or residual term in the model. We represent the formula for a simple linear model as:
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
Where \(\beta_0\) is the intercept and \(\beta_1\) is The slope of the model and \(\epsilon_i\) is the residual term.
The quantities \(\beta_0\) and \(\beta_1\) are called model parameters and they are estimated from the data.
We assume the linear function can predict the mean value of the outcome for a given value of predictor variable. The mean value in mathematical statistics is noted as \(\text{E}(Y | X = x)\) and reads the expected value of \(Y\) given \(X = x\).
In a simple linear regression, the mean function of outcome is given by
\[\text{mean of } Y \text{ given a value of }X \text{ equal to } x = E(Y |X = x) = \beta_0 +\beta_1x\]
The goal of linear regression model is to estimate parameters \(\beta_0\) and \(\beta_1\) in a way that the line fit the best with the data.
The method to achieve this best fitted linear function is called Ordinary Least Square (OLS) method.
In the OLS we estimate the parameters of the model (intercept and slope) by minimizing sum of square of residuals.
Estimate model parameters are noted as \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
The estimated model will be
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]
*note: The predictor variable may be continuous, meaning that it may assume all values within a range, for example, age or height. Or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1 or a categorical variable with more that with more than two levels. There is only one response or dependent variable, and it is continuous.
The use of categorical variables will be covered in section 3.
d) The first linear model in R
Now that we have some review on the linear model, let’s use R and run a simple regression model.
In our data example we are interested to study the relationship between students’ academic performance with some characteristics in their school life. For example we can use variable api00 as outcome and variable enroll which is the number of students in the school.
We expect that better academic performance would be associated with lower number of students in the school.
In R we use function lm()
to run a linear regression model. Let’s look at R help documentation for function lm()
help(lm) #shows R Documentation for function lm()
Now we run our model in R api00 as a dependent variable and enroll as independent variable:
#multiple regression model of DV api00 and DVs enroll, meals, and full
m1 <- lm(api00 ~ enroll, data = d)
print(m1) #Prints coefficients of the model
##
## Call:
## lm(formula = api00 ~ enroll, data = d)
##
## Coefficients:
## (Intercept) enroll
## 744.2514 -0.1999
The first term of the function lm()
is the formula of the model which itself is a function. The response variable will be in the left side of a "~"
and in the right side we have the predictor variable.
api00 ~ enroll
interpreted as api00
is modelled by the linear predictor enroll
.
data = d
is telling lm function to find variables name in the data frame d.
Remember objects api00
and enroll
do not exist in our workspace.
The result of function lm()
will be passed to m1
as a lm
object.
print()
prints estimated coefficients of the model.
The estimated linear line is:
\[ \text{api00 = 744.2514 - 0.1999 enroll}\]
The coefficient for enroll is -.1999, or approximately -.2, meaning that for a one unit increase in enroll, we would expect a .2 unit decrease in api00. In other words, a school with 1000 students would be expected (on average) to have an api score 20 units more than a school with 1100 students.
The constant is 744.2514, and this is the predicted value when enroll equals zero. In most cases, the constant is not very interesting.
e) summary and components of lm object
To observe the result of the lm object in more detail we use summary()
function:
summary(m1) #Prints summary of the model
##
## Call:
## lm(formula = api00 ~ enroll, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -285.50 -112.55 -6.70 95.06 389.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 744.25141 15.93308 46.711 < 2e-16 ***
## enroll -0.19987 0.02985 -6.695 7.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135 on 398 degrees of freedom
## Multiple R-squared: 0.1012, Adjusted R-squared: 0.09898
## F-statistic: 44.83 on 1 and 398 DF, p-value: 7.339e-11
\[ e_i = y_i - \hat{y}\]
In the coefficients part we have a table of estimated coefficients, standard error, t-value, and p-value of the hypothesis test of coefficients equal to zero.
If the p-value is a smaller than 0.001 then we have 3 stars in front of it, showing there is strong significant evidence that coefficient is not zero. For p-value less than 0.01 and greater than 0.001 we have 2 star, between 0.01 and 0.05 we get “*" and between 0.05 and 0.1 we have “.”.
Here both intercept and slope are significant.
In the last portion of the result we observe residual standard error which is 135 and its degree of freedom equal to 398.
Multiple R-squared is the R-squared of the model equal to 0.1012, and adjusted R-squared is 0.09898 which is adjusted for number of predictors.
In the simple linear regression model R-square is equal to square of the correlation between response and predicted variable. We can run the function cor()
to see if this is true.
r <- cor(d$api00, d$enroll) #correlation coefficient of api00 and enroll
r ^ 2 #this is equal to r-squared in simple regression
## [1] 0.1012335
The last line gives the overal significance of the model against the null model which is the model with only intercept. F-statistic is equal to 44.83 with 1 and 398 degrees of freedom and p-value equal to 7.339e-11. This p-value in a simple regression model is exactly eqaul to p-value of the slope. Why?
As we said the result of lm()
function is a lm class. We can use function ls()
to list components includes in object m1
.
ls(m1) #list of components in object class lm
## [1] "assign" "call" "coefficients" "df.residual"
## [5] "effects" "fitted.values" "model" "qr"
## [9] "rank" "residuals" "terms" "xlevels"
$
sign after the lm object.m1$coefficients #returns coefficients of the model
## (Intercept) enroll
## 744.2514142 -0.1998674
m1$fitted.values[1:10] #a vector of fitted values here we show the first 10
## 1 2 3 4 5 6 7 8
## 694.8842 651.7128 665.3038 660.7068 640.3203 675.6969 683.6916 441.8520
## 9 10
## 612.3389 671.8994
We can store residuals in a new object.
residuals <- m1$resid #a vector of residuals
Some of the components can be extracted using a function.
coefficients(m1)
## (Intercept) enroll
## 744.2514142 -0.1998674
There are some R functions that can be apply on a lm
object.
To get the Confidence interval for the coefficients of the model we use confint()
confint(m1) # returns a matrix of Confidence Interval for coefficients
## 2.5 % 97.5 %
## (Intercept) 712.9279024 775.5749259
## enroll -0.2585532 -0.1411817
In addition to getting the regression table and statistics, it can be useful to see a scatterplot of the predicted and outcome variables with the regression line plotted.
plot(api00 ~ enroll, data = d) #scatter plot of api00 vs. enroll
abline(m1, col = "blue") #add regression line to the scatter plot
As you see, some of the points appear to be outliers. If you add text with labels = d$snum
on the scatter, you can see the school number for each point. This allows us to see, for example, that one of the outliers is school 2910.
#adding labels(school number) to the scatter plot
plot(api00 ~ enroll, data = d)
text(d$enroll, d$api00+20, labels = d$snum, cex = .7)
abline(m1, col = "blue")
f) Breaking down the sum of squares and anova
If we use only intercept to model the response variable the regression line will be a horizontal line from the mean of the response variable. In our example this will be the mean of api00 which is the line \(y = 647.6225\)
mean(d$api00)
## [1] 647.6225
The residuals for this line will be \(y_i - \bar{y}\). We can breakdown this error to two part using the predicted value from regression of api00 by predicted variable enroll.
This can be written as:
\[ y_i - \bar{y} = y_i - \hat{y_i} + \hat{y_i} - \bar{y} \]
Where \(\bar{y}\) is the mean of response, \(\hat{y_i}\) is the fitted value from the regression model including predictor variable and \(y_i\) is the observed response.
The graph below shows this error parts.
It can be shown the following equation will be hold:
\[ \sum_{i = 1}^{n} (y_i - \bar{y})^2 = \sum_{i = 1}^{n}(y_i - \hat{y_i})^2 + \sum_{i = 1}^{n}(\hat{y_i} - \bar{y})^2 \]
The right hand side of this equation is called Sum of Square Total (SST). The first term of the left hand side of the above equation is called Residual Sum of Square (RSS) and the second term is called Sum of Square Regression (SSreg). We have:
\[SST = RSS + SSreg\]
The ratio of SSreg to SST is called R-squared. In percentage, R-square is the percentage of error that can be explained by the model.
\[R^2 = \frac{SSreg}{SST} = 1 - \frac{RSS}{SST}\]
We can use function anova()
to observe sum of squares of the model.
anova(m1) #anova table
## Analysis of Variance Table
##
## Response: api00
## Df Sum Sq Mean Sq F value Pr(>F)
## enroll 1 817326 817326 44.829 7.339e-11 ***
## Residuals 398 7256346 18232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In anova table we have sum of square of the regression model in row named enroll and sum of square of residuals with their degrees of freedom. Also it shows the F statistics that we have seen before in the summary of the model. Mean squares are sum of squares divided by their degrees of freedom.
a) Adding more predictors to a simple regression model
Now, let’s look at an example of multiple regression, in which we have one outcome (dependent) variable and multiple predictors. The percentage of variability explained by variable enroll was only 10.12%. In order to improve the percentage of variability accounted by the model, we can add more predictors. We add percentage of students who gets full meal as an indicator of socioeconomic status and percentage of teachers with full credential to our model. In R we can do this by simply + variable name to our lm()
function.
#multiple regression model of DV api00 and DVs enroll, meals, and full
m2 <- lm(api00 ~ enroll + meals + full)
summary(m2) #summary of model m2
##
## Call:
## lm(formula = api00 ~ enroll + meals + full)
##
## Residuals:
## Min 1Q Median 3Q Max
## -181.721 -40.802 1.129 39.983 158.774
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 801.82983 26.42660 30.342 < 2e-16 ***
## enroll -0.05146 0.01384 -3.719 0.000229 ***
## meals -3.65973 0.10880 -33.639 < 2e-16 ***
## full 1.08109 0.23945 4.515 8.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.73 on 396 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 648.2 on 3 and 396 DF, p-value: < 2.2e-16
anova(m2) #anova table of model m2
## Analysis of Variance Table
##
## Response: api00
## Df Sum Sq Mean Sq F value Pr(>F)
## enroll 1 817326 817326 236.947 < 2.2e-16 ***
## meals 1 5820066 5820066 1687.263 < 2.2e-16 ***
## full 1 70313 70313 20.384 8.369e-06 ***
## Residuals 396 1365967 3449
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the result of the multiple regression model we will see that based on the F test the overall model is significant. In also all of the coefficients in the model are significantly not equal to zero.
The R-square is 0.8308, meaning that approximately 83% of the variability of api00 is accounted for by the variables in the model. The adjusted R-square shows after taking the account of number of predictors in the model R_square is still about 0.83.
The coefficients for each of the variables indicates the amount of change one could expect in api00 given a one-unit change in the value of that variable, given that all other variables in the model are held constant. For example, consider the variable meals. We would expect a decrease of about 3.66 in the api00 score for every one unit increase in percent free meals, assuming that all other variables in the model are held constant.
We see quite a difference in the coefficient of variable enroll compared to the simple linear regression. Before the coefficient for variable enroll was -.1999 and now it is -0.05146.
The anova table shows the sum of squares that will be explain by adding each variable at a time to the model. Or it is the reduced amount in sum of square residuals by an additional variable.
For example variable enroll reduces the total error by 817326. By adding variable meals we reduce additional 5820066 from the residuals and by adding variable full we reduce the error by 70313. Finally we have 1365967 left as unexplained error. The total error is all of those sum of squares added together. To get the total sum of square of variable api00 we can multiply its’ variance by \((n-1)\).
sum(anova(m2)$Sum) #sum of RSS and SSreg
## [1] 8073672
(400 - 1) * var(d$api00) #Total sum of squre
## [1] 8073672
b) Standardized regression coefficients
Some researchers are interested to compare the relative strength of the various predictors within the model. Since each variable have different unit we cannot compare coefficients to one another. To address this problem we use standardized regression coefficients which can be obtain by transforming the outcome and predictor variables all to their standard scores, also called z-scores, before running the regression.
In R we use function scale() to do this for each variable.
#Standardized regression model
m2.sd <- lm(scale(api00) ~ scale(enroll) + scale(meals) + scale(full), data = d)
summary(m2.sd) #coefficients are standardized
##
## Call:
## lm(formula = scale(api00) ~ scale(enroll) + scale(meals) + scale(full),
## data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27749 -0.28683 0.00793 0.28108 1.11617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.454e-16 2.064e-02 0.000 1.000000
## scale(enroll) -8.191e-02 2.203e-02 -3.719 0.000229 ***
## scale(meals) -8.210e-01 2.441e-02 -33.639 < 2e-16 ***
## scale(full) 1.136e-01 2.517e-02 4.515 8.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4129 on 396 degrees of freedom
## Multiple R-squared: 0.8308, Adjusted R-squared: 0.8295
## F-statistic: 648.2 on 3 and 396 DF, p-value: < 2.2e-16
In the standardized regression coefficients summary we see that the intercept is zero and all t-statistics for other coefficients are exactly the same as the original model.
Because the coefficients are all in the same standardized units you can compare these coefficients to assess the relative strength of each of the predictors. In this example, meals has the largest Beta coefficient, -0.821.
Thus, a one standard deviation increase in meals leads to a 0.821 standard deviation decrease in predicted api00, with the other variables held constant.
In this part we have discussed the basics of how to perform simple and multiple regressions in R, the basics of interpreting output, as well as some related commands. There are more results that we can extract from the regression model and we have skiped them in this workshop.
The next part we are going into a more thorough discussion of the assumptions of linear regression and how you can use R to assess these assumptions for your data. In particular, the next part will address the following issues.