Introduction to Regression in R (Part1, Simple and Multiple Regression)

IDRE Statistical Consulting Group

1.1 Reading the data into RStudio/R

a) A quick overview of RStudio environment

Before we begin, let’s take a look at the RStudio environment. RStudio is an integrated development environment (IDE) to make R easier to use. RStudio interface has four panels:

  • Source code editor or R Script: to type, edit and manage R commands. The first time you open RStudio the source code panel is hidden and will appear after we open a new or existed R script. R scripts are text files with the .R extension. You can save R script file to keep R codes that you have written.
  • To run a command or a set of commands from the R Script, select them first and hold Ctrl + Enter.

  • Console is linked to R script and every time we run a command from R Script it evaluated in the R console and outputs and results (if any) appears in the consol. We can also type a command in the console and click Enter to run it directly from the R consol.
  • There are two other panels with some tabs on each. One panel usually includes Environment and history. Environment tab shows R data objects we define in the current R session for example, vector, matrix, and dataframe. History simply shows all commands that we already run and evaluated in the R console.
  • The other panel includes files, Plots, packages, and help. We can always customize R panels and include tabs that we need.
  • Files tab is to view and manage directory and files in the hard drive of the computer. From files tab we can navigate a directory and set it as working directory.

    Plots tab is to view and manage plots. By clicking on the export we can save our plots as jpeg or PDF.

    Packages shows a list of all the R packages installed in the computer. We can load a package by checking their box. Also we can install a new package from this tab.

    However, another way to manage R packages is to run function install.packages("package name") in R console or from R script to install a new package and use function library(package name) to load a package.

    Help tab is the help windows to view help for an R function. We can use search to search help from this tab or we can use ?function name or help (function name) from the console. The help documentation will appears in the help tab.

    RStudio panel

    RStudio panel

    b) Loading the data in RStudio

    Now let’s read our dataset in the R workspace. The dataset used in this portion of the seminar is located here: elemapi2v2

    You can download the data and save it into your local computer and load it from there or you can directly read it from the web url address. The data is in .csv format and we use read.csv() function to read and load it in R. The resulting value will be a data.frame object that will appears in the environment tab. By clicking on the data.frame object we can view it as a spreadsheet. The following code is use to read and load the data into R workspace first from url address and then from the local computer:

    d <- read.csv(
    "https://stats.idre.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv")
    #We can also read the data from the local computer
    #we can set the working directory by setwd()
    #d <- read.csv("c:/.../elemapi2v2.csv",
    #                header = TRUE, sep = ",")
    class(d)  #class of object d, returns data.frame
    ## [1] "data.frame"
    names(d)  #retuns names of variables (columns)
    ##  [1] "snum"     "dnum"     "api00"    "api99"    "growth"   "meals"   
    ##  [7] "ell"      "yr_rnd"   "mobility" "acs_k3"   "acs_46"   "not_hsg" 
    ## [13] "hsg"      "some_col" "col_grad" "grad_sch" "avg_ed"   "full"    
    ## [19] "emer"     "enroll"   "mealcat"  "collcat"  "abv_hsg"  "lgenroll"
    dim(d)    #returns number of rows and columns of data frame
    ## [1] 400  24

    As we mentioned the result of read.csv is a data frame object. A data frame is a list of variables of the same number of rows with unique row names, given class “data.frame”. In the data frame d we have 24 variables and 400 observations.

    c) exploring the data

    As researchers we need to make sure first that the data we cleaned hold plausible values. Descriptive analysis will help us to understand our data better and investigate if there is any problem in the data. In this workshop we skip this part but in a real study descriptive analysis is an important part of a good data analysis.

    Here we take a look at structure of the data and summary statistics.

    str(d)      #returns structure of variables in the data frame d
    ## 'data.frame':    400 obs. of  24 variables:
    ##  $ snum    : int  906 889 887 876 888 4284 4271 2910 2899 2887 ...
    ##  $ dnum    : int  41 41 41 41 41 98 98 108 108 108 ...
    ##  $ api00   : int  693 570 546 571 478 858 918 831 860 737 ...
    ##  $ api99   : int  600 501 472 487 425 844 864 791 838 703 ...
    ##  $ growth  : int  93 69 74 84 53 14 54 40 22 34 ...
    ##  $ meals   : int  67 92 97 90 89 10 5 2 5 29 ...
    ##  $ ell     : int  9 21 29 27 30 3 2 3 6 15 ...
    ##  $ yr_rnd  : int  0 0 0 0 0 0 0 0 0 0 ...
    ##  $ mobility: int  11 33 36 27 44 10 16 44 10 17 ...
    ##  $ acs_k3  : int  16 15 17 20 18 20 19 20 20 21 ...
    ##  $ acs_46  : int  22 32 25 30 31 33 28 31 30 29 ...
    ##  $ not_hsg : int  0 0 0 36 50 1 1 0 2 8 ...
    ##  $ hsg     : int  0 0 0 45 50 8 4 4 9 25 ...
    ##  $ some_col: int  0 0 0 9 0 24 18 16 15 34 ...
    ##  $ col_grad: int  0 0 0 9 0 36 34 50 42 27 ...
    ##  $ grad_sch: int  0 0 0 0 0 31 43 30 33 7 ...
    ##  $ avg_ed  : num  NA NA NA 1.91 1.5 ...
    ##  $ full    : int  76 79 68 87 87 100 100 96 100 96 ...
    ##  $ emer    : int  24 19 29 11 13 0 0 2 0 7 ...
    ##  $ enroll  : int  247 463 395 418 520 343 303 1513 660 362 ...
    ##  $ mealcat : int  2 3 3 3 3 1 1 1 1 1 ...
    ##  $ collcat : int  1 1 1 1 1 2 2 2 2 3 ...
    ##  $ abv_hsg : int  100 100 100 64 50 99 99 100 98 92 ...
    ##  $ lgenroll: num  2.39 2.67 2.6 2.62 2.72 ...
    summary(d)  #Summary statistics of variables in d
    ##       snum           dnum           api00           api99      
    ##  Min.   :  58   Min.   : 41.0   Min.   :369.0   Min.   :333.0  
    ##  1st Qu.:1720   1st Qu.:395.0   1st Qu.:523.8   1st Qu.:484.8  
    ##  Median :3008   Median :401.0   Median :643.0   Median :602.0  
    ##  Mean   :2867   Mean   :457.7   Mean   :647.6   Mean   :610.2  
    ##  3rd Qu.:4198   3rd Qu.:630.0   3rd Qu.:762.2   3rd Qu.:731.2  
    ##  Max.   :6072   Max.   :796.0   Max.   :940.0   Max.   :917.0  
    ##                                                                
    ##      growth           meals             ell            yr_rnd    
    ##  Min.   :-69.00   Min.   :  0.00   Min.   : 0.00   Min.   :0.00  
    ##  1st Qu.: 19.00   1st Qu.: 31.00   1st Qu.: 9.75   1st Qu.:0.00  
    ##  Median : 36.00   Median : 67.50   Median :25.00   Median :0.00  
    ##  Mean   : 37.41   Mean   : 60.31   Mean   :31.45   Mean   :0.23  
    ##  3rd Qu.: 53.25   3rd Qu.: 90.00   3rd Qu.:50.25   3rd Qu.:0.00  
    ##  Max.   :134.00   Max.   :100.00   Max.   :91.00   Max.   :1.00  
    ##                                                                  
    ##     mobility         acs_k3          acs_46         not_hsg      
    ##  Min.   : 2.00   Min.   :14.00   Min.   :20.00   Min.   :  0.00  
    ##  1st Qu.:13.00   1st Qu.:18.00   1st Qu.:27.00   1st Qu.:  4.00  
    ##  Median :17.00   Median :19.00   Median :29.00   Median : 14.00  
    ##  Mean   :18.25   Mean   :19.16   Mean   :29.69   Mean   : 21.25  
    ##  3rd Qu.:22.00   3rd Qu.:20.00   3rd Qu.:31.00   3rd Qu.: 34.00  
    ##  Max.   :47.00   Max.   :25.00   Max.   :50.00   Max.   :100.00  
    ##  NA's   :1       NA's   :2       NA's   :3                       
    ##       hsg            some_col        col_grad        grad_sch     
    ##  Min.   :  0.00   Min.   : 0.00   Min.   :  0.0   Min.   : 0.000  
    ##  1st Qu.: 17.00   1st Qu.:12.00   1st Qu.:  7.0   1st Qu.: 1.000  
    ##  Median : 26.00   Median :19.00   Median : 16.0   Median : 4.000  
    ##  Mean   : 26.02   Mean   :19.71   Mean   : 19.7   Mean   : 8.637  
    ##  3rd Qu.: 34.00   3rd Qu.:28.00   3rd Qu.: 30.0   3rd Qu.:10.000  
    ##  Max.   :100.00   Max.   :67.00   Max.   :100.0   Max.   :67.000  
    ##                                                                   
    ##      avg_ed           full             emer           enroll      
    ##  Min.   :1.000   Min.   : 37.00   Min.   : 0.00   Min.   : 130.0  
    ##  1st Qu.:2.070   1st Qu.: 76.00   1st Qu.: 3.00   1st Qu.: 320.0  
    ##  Median :2.600   Median : 88.00   Median :10.00   Median : 435.0  
    ##  Mean   :2.668   Mean   : 84.55   Mean   :12.66   Mean   : 483.5  
    ##  3rd Qu.:3.220   3rd Qu.: 97.00   3rd Qu.:19.00   3rd Qu.: 608.0  
    ##  Max.   :4.620   Max.   :100.00   Max.   :59.00   Max.   :1570.0  
    ##  NA's   :19                                                       
    ##     mealcat         collcat        abv_hsg          lgenroll    
    ##  Min.   :1.000   Min.   :1.00   Min.   :  0.00   Min.   :2.114  
    ##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.: 66.00   1st Qu.:2.505  
    ##  Median :2.000   Median :2.00   Median : 86.00   Median :2.638  
    ##  Mean   :2.015   Mean   :2.02   Mean   : 78.75   Mean   :2.640  
    ##  3rd Qu.:3.000   3rd Qu.:3.00   3rd Qu.: 96.00   3rd Qu.:2.784  
    ##  Max.   :3.000   Max.   :3.00   Max.   :100.00   Max.   :3.196  
    ## 

    c) Variables description

    snum          school number
    dnum          district number 
    api00         api 2000, Academic performance index in 2000
    api99         api 1999, Academic performance index in 1999
    growth        growth from 1999 to 2000
    meals         percentage of students who gets free meals
    ell           english language learners
    yr_rnd        year round school
    mobility      percentage 1st year in school
    acs_k3        avg class size k-3
    acs_46        avg class size 4-6
    not_hsg       parent not hsg
    hsg           parent hsg
    some_col      parent some college
    col_grad      parent college grad
    grad_sch      parent grad school
    avg_ed        avg parent ed
    full          percentage teachers with full credential
    emer          percentage of teachers with emergency credential
    enroll        number of students
    mealcat       Percentage free meals in 3 categories
    .
    .
    .

    1.2 Simple Linear Regression

    a) What is a linear regression model?

    Regression Analysis is a statistical modeling tool that is used to explain a response (criterion or dependent) variable as a function of one or more predictor (independent) variables.

    Single regression is study of a single response as a function of a single predictor

    \(X\): The predictor. \(x_1, x_2, \dots, x_n\) are \(n\) observations from \(X\).

    \(Y\): The response. \(y_1, y_2, \dots, y_n\) are \(n\) observations from \(Y\).

    In a simple linear regression model we assume this relationship is a linear function.

    b) Linear function

    A linear function can be determine by the intercept and the slope of the line.

    The figure below shows plot of linear function

    \[y = f(x) = 1 + 0.5 \times x\]

    The intercept is equal to 1 and slope of the line is equal to 0.5.

    c) Linear regression as a statistical model

    In a real life situation we usually cannot explain \(Y\) as an exact function of \(X\). There is always some errors or noises in the dependent variable that cannot be explained by the relationship between independent and dependent variable. We call that error or residual term in the model. We represent the formula for a simple linear model as:

    \[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

    Where \(\beta_0\) is the intercept and \(\beta_1\) is The slope of the model and \(\epsilon_i\) is the residual term.

    \[\text{mean of } Y \text{ given a value of }X \text{ equal to } x = E(Y |X = x) = \beta_0 +\beta_1x\]

    \[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]

    *note: The predictor variable may be continuous, meaning that it may assume all values within a range, for example, age or height. Or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1 or a categorical variable with more that with more than two levels. There is only one response or dependent variable, and it is continuous.

    The use of categorical variables will be covered in section 3.

    d) The first linear model in R

    Now that we have some review on the linear model, let’s use R and run a simple regression model.

    help(lm)   #shows R Documentation for function lm()

    Now we run our model in R api00 as a dependent variable and enroll as independent variable:

    #multiple regression model of DV api00 and DVs enroll, meals, and full
    m1 <- lm(api00 ~ enroll, data = d)
    print(m1)    #Prints coefficients of the model
    ## 
    ## Call:
    ## lm(formula = api00 ~ enroll, data = d)
    ## 
    ## Coefficients:
    ## (Intercept)       enroll  
    ##    744.2514      -0.1999

    \[ \text{api00 = 744.2514 - 0.1999 enroll}\]

    e) summary and components of lm object

    To observe the result of the lm object in more detail we use summary() function:

    summary(m1)    #Prints summary of the model
    ## 
    ## Call:
    ## lm(formula = api00 ~ enroll, data = d)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -285.50 -112.55   -6.70   95.06  389.15 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 744.25141   15.93308  46.711  < 2e-16 ***
    ## enroll       -0.19987    0.02985  -6.695 7.34e-11 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 135 on 398 degrees of freedom
    ## Multiple R-squared:  0.1012, Adjusted R-squared:  0.09898 
    ## F-statistic: 44.83 on 1 and 398 DF,  p-value: 7.339e-11

    \[ e_i = y_i - \hat{y}\]

    r <- cor(d$api00, d$enroll) #correlation coefficient of api00 and enroll
    r ^ 2                       #this is equal to r-squared in simple regression 
    ## [1] 0.1012335
    ls(m1)        #list of components in object class lm
    ##  [1] "assign"        "call"          "coefficients"  "df.residual"  
    ##  [5] "effects"       "fitted.values" "model"         "qr"           
    ##  [9] "rank"          "residuals"     "terms"         "xlevels"
    m1$coefficients        #returns coefficients of the model
    ## (Intercept)      enroll 
    ## 744.2514142  -0.1998674
    m1$fitted.values[1:10] #a vector of fitted values here we show the first 10
    ##        1        2        3        4        5        6        7        8 
    ## 694.8842 651.7128 665.3038 660.7068 640.3203 675.6969 683.6916 441.8520 
    ##        9       10 
    ## 612.3389 671.8994

    We can store residuals in a new object.

    residuals <- m1$resid  #a vector of residuals

    Some of the components can be extracted using a function.

    coefficients(m1)
    ## (Intercept)      enroll 
    ## 744.2514142  -0.1998674

    There are some R functions that can be apply on a lm object.

    To get the Confidence interval for the coefficients of the model we use confint()

    confint(m1) # returns a matrix of Confidence Interval for coefficients
    ##                   2.5 %      97.5 %
    ## (Intercept) 712.9279024 775.5749259
    ## enroll       -0.2585532  -0.1411817

    In addition to getting the regression table and statistics, it can be useful to see a scatterplot of the predicted and outcome variables with the regression line plotted.

    plot(api00 ~ enroll, data = d) #scatter plot of api00 vs. enroll
    abline(m1, col = "blue")       #add regression line to the scatter plot

    As you see, some of the points appear to be outliers. If you add text with labels = d$snum on the scatter, you can see the school number for each point. This allows us to see, for example, that one of the outliers is school 2910.

    #adding labels(school number) to the scatter plot
    plot(api00 ~ enroll, data = d) 
    text(d$enroll, d$api00+20, labels = d$snum, cex = .7)
    abline(m1, col = "blue")

    f) Breaking down the sum of squares and anova

    If we use only intercept to model the response variable the regression line will be a horizontal line from the mean of the response variable. In our example this will be the mean of api00 which is the line \(y = 647.6225\)

    mean(d$api00)
    ## [1] 647.6225

    The residuals for this line will be \(y_i - \bar{y}\). We can breakdown this error to two part using the predicted value from regression of api00 by predicted variable enroll.

    This can be written as:

    \[ y_i - \bar{y} = y_i - \hat{y_i} + \hat{y_i} - \bar{y} \]

    Where \(\bar{y}\) is the mean of response, \(\hat{y_i}\) is the fitted value from the regression model including predictor variable and \(y_i\) is the observed response.

    The graph below shows this error parts.

    It can be shown the following equation will be hold:

    \[ \sum_{i = 1}^{n} (y_i - \bar{y})^2 = \sum_{i = 1}^{n}(y_i - \hat{y_i})^2 + \sum_{i = 1}^{n}(\hat{y_i} - \bar{y})^2 \]

    The right hand side of this equation is called Sum of Square Total (SST). The first term of the left hand side of the above equation is called Residual Sum of Square (RSS) and the second term is called Sum of Square Regression (SSreg). We have:

    \[SST = RSS + SSreg\]

    The ratio of SSreg to SST is called R-squared. In percentage, R-square is the percentage of error that can be explained by the model.

    \[R^2 = \frac{SSreg}{SST} = 1 - \frac{RSS}{SST}\]

    We can use function anova() to observe sum of squares of the model.

    anova(m1)  #anova table
    ## Analysis of Variance Table
    ## 
    ## Response: api00
    ##            Df  Sum Sq Mean Sq F value    Pr(>F)    
    ## enroll      1  817326  817326  44.829 7.339e-11 ***
    ## Residuals 398 7256346   18232                      
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    In anova table we have sum of square of the regression model in row named enroll and sum of square of residuals with their degrees of freedom. Also it shows the F statistics that we have seen before in the summary of the model. Mean squares are sum of squares divided by their degrees of freedom.

    1.3 Multiple Regression

    a) Adding more predictors to a simple regression model

    Now, let’s look at an example of multiple regression, in which we have one outcome (dependent) variable and multiple predictors. The percentage of variability explained by variable enroll was only 10.12%. In order to improve the percentage of variability accounted by the model, we can add more predictors. We add percentage of students who gets full meal as an indicator of socioeconomic status and percentage of teachers with full credential to our model. In R we can do this by simply + variable name to our lm() function.

    #multiple regression model of DV api00 and DVs enroll, meals, and full
    m2 <- lm(api00 ~  enroll + meals + full)
    summary(m2)   #summary of model m2
    ## 
    ## Call:
    ## lm(formula = api00 ~ enroll + meals + full)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -181.721  -40.802    1.129   39.983  158.774 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) 801.82983   26.42660  30.342  < 2e-16 ***
    ## enroll       -0.05146    0.01384  -3.719 0.000229 ***
    ## meals        -3.65973    0.10880 -33.639  < 2e-16 ***
    ## full          1.08109    0.23945   4.515 8.37e-06 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 58.73 on 396 degrees of freedom
    ## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8295 
    ## F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16
    anova(m2)     #anova table of model m2 
    ## Analysis of Variance Table
    ## 
    ## Response: api00
    ##            Df  Sum Sq Mean Sq  F value    Pr(>F)    
    ## enroll      1  817326  817326  236.947 < 2.2e-16 ***
    ## meals       1 5820066 5820066 1687.263 < 2.2e-16 ***
    ## full        1   70313   70313   20.384 8.369e-06 ***
    ## Residuals 396 1365967    3449                       
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    For example variable enroll reduces the total error by 817326. By adding variable meals we reduce additional 5820066 from the residuals and by adding variable full we reduce the error by 70313. Finally we have 1365967 left as unexplained error. The total error is all of those sum of squares added together. To get the total sum of square of variable api00 we can multiply its’ variance by \((n-1)\).

    sum(anova(m2)$Sum)          #sum of RSS and SSreg
    ## [1] 8073672
    (400 - 1) * var(d$api00)    #Total sum of squre 
    ## [1] 8073672

    b) Standardized regression coefficients

    Some researchers are interested to compare the relative strength of the various predictors within the model. Since each variable have different unit we cannot compare coefficients to one another. To address this problem we use standardized regression coefficients which can be obtain by transforming the outcome and predictor variables all to their standard scores, also called z-scores, before running the regression.

    In R we use function scale() to do this for each variable.

    #Standardized regression model
    m2.sd <- lm(scale(api00) ~  scale(enroll) + scale(meals) + scale(full), data = d)
    summary(m2.sd) #coefficients are standardized
    ## 
    ## Call:
    ## lm(formula = scale(api00) ~ scale(enroll) + scale(meals) + scale(full), 
    ##     data = d)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -1.27749 -0.28683  0.00793  0.28108  1.11617 
    ## 
    ## Coefficients:
    ##                 Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)    4.454e-16  2.064e-02   0.000 1.000000    
    ## scale(enroll) -8.191e-02  2.203e-02  -3.719 0.000229 ***
    ## scale(meals)  -8.210e-01  2.441e-02 -33.639  < 2e-16 ***
    ## scale(full)    1.136e-01  2.517e-02   4.515 8.37e-06 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.4129 on 396 degrees of freedom
    ## Multiple R-squared:  0.8308, Adjusted R-squared:  0.8295 
    ## F-statistic: 648.2 on 3 and 396 DF,  p-value: < 2.2e-16

    In the standardized regression coefficients summary we see that the intercept is zero and all t-statistics for other coefficients are exactly the same as the original model.

    Because the coefficients are all in the same standardized units you can compare these coefficients to assess the relative strength of each of the predictors. In this example, meals has the largest Beta coefficient, -0.821.

    Thus, a one standard deviation increase in meals leads to a 0.821 standard deviation decrease in predicted api00, with the other variables held constant.

    1.4 Summary

    In this part we have discussed the basics of how to perform simple and multiple regressions in R, the basics of interpreting output, as well as some related commands. There are more results that we can extract from the regression model and we have skiped them in this workshop.

    The next part we are going into a more thorough discussion of the assumptions of linear regression and how you can use R to assess these assumptions for your data. In particular, the next part will address the following issues.