Generalized linear Regression Models

IDRE Statistical Consulting Group

Table of contents

Part 1 Introduction

1.1 Review of linear regression model

Regression Analysis is a statistical modeling tool that is used to explain a response (criterion or dependent) variable as a function of one or more predictor (independent) variables.

\(X\): The predictor. \(x_1, x_2, \dots, x_n\) are \(n\) observations from \(X\).

\(Y\): The response. \(y_1, y_2, \dots, y_n\) are \(n\) observations from \(Y\).

In a Linear Regression Model we assume this relationship is a linear function.

We can represent a simple linear model (a linear model with only one predictor) as:

\[y = \beta_0 + \beta_1 x + \epsilon\]

Where \(\beta_0\) is the intercept of the line, \(\beta_1\) is the slope of the line and \(\epsilon\) is the error term that represent unexplained uncertainty.

Scatter plot

index Performance enroll
1 693 247
2 570 463
3 395 546
. . .
. . .
. . .
n 478 520

Ordinary Least Square

\[y = \overbrace{\beta_0 + \beta_1 x }^\text{linear function}+ \overbrace{\epsilon}^\text{error}\]

The predicted mean is:

\[\text{E}(Y | x_1, x_2 , \cdots x_p) = \beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p\]

Ordinary Least Square (cont.)

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]

\[ e_i = y_i - \hat{y_i}\]

*note: The predictor variable may be continuous, meaning that it may assume all values within a range, for example, age or height. Or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1 or a categorical variable with more that with more than two levels. There is only one response or dependent variable, and it is continuous.

Ordinary Least Square in graph

\[ \min \sum_{i = 1}^{n} (y_i - \hat{y_i}) ^ 2\]

Assumptions in OLS

1- Estimate the model parameters(i.e. intercept and slope of the fitted line and a measure of uncertainty)

2- Make inference about model parameters.

1 - Errors have mean equal to zero.

2 - Errors are independent.

3 - Errors have equal or constant variance (Homoscedasticity or Homogeneity of variance).

4 - Errors follows normal distribution.

5 - The relationship of dependent variable with Independent variables is linear.

Briefly, the assumption of OLS is that the error term in the model are independent with identically distributed random variables from a normal distribution with mean equal to zero for each randomly sampled outcomes \(y\) for any given value of \(x\).

R example for linear regression

Open R or R Studio and look at the help for data set attitude.

This data is from the book Regression Analysis by Example. by Chatterjee, S. and Price, B. (1977). The data are aggregated from the questionnaires of the approximately 35 employees for each of 30 (randomly selected) departments.

#help("attitude")
#copy and paste example and run it!
data(attitude)
#summary statistics of variables.
summary(attitude)
##      rating        complaints     privileges       learning         raises     
##  Min.   :40.00   Min.   :37.0   Min.   :30.00   Min.   :34.00   Min.   :43.00  
##  1st Qu.:58.75   1st Qu.:58.5   1st Qu.:45.00   1st Qu.:47.00   1st Qu.:58.25  
##  Median :65.50   Median :65.0   Median :51.50   Median :56.50   Median :63.50  
##  Mean   :64.63   Mean   :66.6   Mean   :53.13   Mean   :56.37   Mean   :64.63  
##  3rd Qu.:71.75   3rd Qu.:77.0   3rd Qu.:62.50   3rd Qu.:66.75   3rd Qu.:71.00  
##  Max.   :85.00   Max.   :90.0   Max.   :83.00   Max.   :75.00   Max.   :88.00  
##     critical        advance     
##  Min.   :49.00   Min.   :25.00  
##  1st Qu.:69.25   1st Qu.:35.00  
##  Median :77.50   Median :41.00  
##  Mean   :74.77   Mean   :42.93  
##  3rd Qu.:80.00   3rd Qu.:47.75  
##  Max.   :92.00   Max.   :72.00

We use graphic to investigate relationship between all variables by pairs.

pairs(attitude, main = "attitude data")

For the linear regression model of rating (Overall rating) as outcome and complaints (Handling of employee complaints) as predictor we use function lm().

fm2 <- lm(rating ~ complaints, data = attitude)
summary(fm2)
## 
## Call:
## lm(formula = rating ~ complaints, data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8799  -5.9905   0.1783   6.2978   9.6294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.37632    6.61999   2.172   0.0385 *  
## complaints   0.75461    0.09753   7.737 1.99e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.993 on 28 degrees of freedom
## Multiple R-squared:  0.6813, Adjusted R-squared:  0.6699 
## F-statistic: 59.86 on 1 and 28 DF,  p-value: 1.988e-08
  • The slope of the estimated line is 0.75461 and the intercept is 14.37632
  • To interpret slope we can say: for one unit increase on score of handling of employee complaints on average we expect the overall rating increase by 0.75461
  • The intercept in the expected overall score when the score of handling of employee complaints is zero
  • linear regression model diagnostics for model of rating vs. complaints

    opar <- par(mfrow = c(2, 2), 
                mar = c(4.1, 4.1, 2.1, 1.1))
    plot(fm2)

    par(opar)
  • The plot in the left corner shows residual vs. fit plot.
  • This plot is showing us that there is not strong evidence to suggest assumptions of zero mean and constant variance of error is not valid.

  • If we can confirm that the error has zero mean, then we can be sure if we increase the sample size then the estimated parameter is getting closer to the true value of parameter (i.e. \(\beta\)s). We say estimated parameter is unbiased.
  • Constant variance assumption is needed to insure the estimation method, OLS, achieve the minimal variability for parameters that has been estimated.
  • The Normal Q-Q plot is shows whether the error are distributed from a normal distribution using estimated errors (residuals).
  • The normal assumption is needed for statistical tests of parameters. Specially for small samples. For large sample sizes, this assumption is not essential
  • What should we do if we do not have a continuous outcome and the error does not have equal variance. What if the predicted value from the linear model is not plausible because of the range of the outcome?
  • 1.2 Discrete Random variable and Probability Mass Function

  • A discrete random variable is a random variable that can only take values from a countable set.
  • For example, number of time we flipped a coin until we observe a head for the first time. This value could be in the set of \(\{1, 2, 3, ...\}\).

    Sum of outcome of rolling two dice as a random variable that takes values in the set of \(\{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}\).

  • Bernoulli random variable with values of 0 and 1 is a discrete random variable.
  • Probability Mass Function

  • A probability Mass Function is a function that gives the probability of a discrete random variable at each and every value of that random variable.
  • For example the PMF of a random variable of outcome of rolling a dice with outcome of \(\{1, 2, 3, 4, 5, 6\}\) is:

    \(p(X = x) = 1/6\)

    for \(x \in \{1, 2, 3, 4, 5, 6\}\).

    Bernoulli distribution

    \[f(x;p)=\begin{cases} 1 - p & {\text{if }}x=0,\\p & {\text{if }}x=1.\end{cases}\]

    the mean of a Bernoulli random variable is:

    \(\mu_X = E(X) = 1 \times p + 0 \times (1 - p) = p\)

    and the variance is

    \(\sigma^2_X = var(X) = E(X- \mu)^2 = (1 - p) ^ 2 \times p + (0 - p) ^ 2 \times (1 - p) \\ = p(1 - p) ( 1 - p + p) = p(1 - p)\)

    Binomial distribution

  • Binomial is another popular discrete random variable that defines as number of success in \(n\) independent Bernoulli trial with probability of success equal to \(p\).
  • The PMF of Binomial(n, p) is: (optional) \[\displaystyle f(x; n,p)=\Pr(x;n,p)=\Pr(X=x)=\frac{n!}{x!(n-x)!} p^{x}(1-p)^{n-x}\] For \(x = 0, 1, 2, 3, \cdots, n\).

  • Sum of \(n\) independent Bernoulli random variable with the same parameter \(p\) has Binomial distribution with parameters \(p\) and \(n\).
  • If we know the PMF of a random variable, we can calculate the probability of observing any given value by plug in that value in the PMF.
  • For example for a Binomial random variable with \(p = 0.2\) and \(n = 10\) we can calculate probabilities for each value of the random variable:
  • \(P(x = 0) = [10! / 0!(10 - 0)!] 0.2 ^ 0 (1 - 0.2) ^ {10} = 0.8 ^ {10} \approx 0.107\)

  • The graph below shows the plot of PMF of a Binomial random variable with \(p = 0.2\) and \(n = 10\).
  • The mean of binomial is \(np\) and the variance is \(np(1 - p)\)
  • Natural logarithm and exponential function

  • Before we go over Poisson distribution we go over logarithm and exponential functions
  • logarithm is the inverse function to exponentiation.
  • \(\log_{10}(1000) = 3\)

    We say the logarithm base 10 of 1000 so,

    \(10^3 = 1000\)

    \(\log_{2}(64) = 6\)

    \(2^6 = 64\)

    Euler’s number or the number \(e\) is a mathematical constant that is the base of the natural logarithm.

    \[\displaystyle e=\sum \limits _{n=0}^{\infty }{\frac {1}{n!}}={\frac {1}{1}}+{\frac {1}{1}}+{\frac {1}{1\times 2}}+{\frac {1}{1\times 2\times 3}}+\cdots \approx 2.718282.\]

    If \(e^6 = a\) Then \(\log_e(a) = ?\)

    The answer is 6.

    \(e^x\) can also noted as \(\exp(x)\). \(\log_e(x)\) is also shown as \(\ln(x)\)

  • Multiplication rule
  • Since \(\exp(x + y) = \exp(x)\exp(y)\)

    \(\log(xy) = \log(x) + \log(y)\) and,

    \(\log(x/y) = \log(x) - \log(y)\)

  • Change of base
  • \[\ln(x) = \log_{e}x= {\dfrac {\log_{10}x}{\log_{10}e}}\]

    Poisson Random variable and Poisson distribution

  • The discrete random variable \(Y\) is define as number of event recorded during a unit of time has a Poisson distribution with rate parameter \(\mu > 0\), if \(Y\) has PMF
  • \[Pr(Y = y) = \dfrac{e^{-\mu}(\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\]

  • The mean of Poisson distribution is \(\mu\) and the variance of it is also \(\mu\).

  • In Poisson, we only have one parameter. The mean and variance of Poisson distribution are equal.
  • In a large number of Bernoulli trial with a very small probability of success the total number of success can be approximated by a Poisson distribution. This is called the law of rare events.
  • It can be shown that a Binomial distribution will converge to a Poisson distribution as \(n\) goes to infinite and \(p\) goes to zero such that \(np \rightarrow \mu\)

    An Example of Poisson Random variable

    Suppose number of accidents in a sunny day in freeway 405 between Santa Monica and Wilshire blvd is Poisson with the rate of 2 accidents per day, then probability of 0, 1, 2, …, 8 accidents can be calculated using the PMF of Poisson:

    \(P(Y = 0| \mu = 2) = \dfrac{e^{-2}(2)^0}{0!} = 1/e^2 \approx 0.1353\)

    \(P(Y = 1| \mu = 2) = \dfrac{e^{-2}(2)^1}{1!} = 2/e^2 \approx 0.2707\)

    \(P(Y = 2| \mu = 2) = \dfrac{e^{-2}(2)^2}{2!} = 2/e^2 \approx 0.2707\)

    \(P(Y = 3| \mu = 2) = \dfrac{e^{-2}(2)^3}{3!} = 8/(6e^2) \approx 0.1804\)

    \(P(Y = 4| \mu = 2) = \dfrac{e^{-2}(2)^4}{4!} = 16/(24e^2) \approx 0.0902\)

    \(P(Y = 5| \mu = 2) = \dfrac{e^{-2}(2)^5}{5!} = 32/(120e^2) \approx 0.0361\)

    \(P(Y = 6| \mu = 2) \approx 0.0120\)

    \(P(Y = 7| \mu = 2) \approx 0.0034\)

    \(P(Y = 8| \mu = 2) \approx 0.0009\)

    The following graph shows the plot of PMF of a Poisson random variable with \(\mu = 2\)

    This looks like binomial but not exactly the same!

    Now assume number of accidents in a rainy day in freeway 405 between Santa Monica and wilshire blvd is Poisson with the rate of 5.5 accidents per day, then probability of each \(x = 0, 1, 2, \cdots, 8\) accidents using R is:

     dpois(0:8, lambda = 5.5)
    ## [1] 0.004086771 0.022477243 0.061812418 0.113322766 0.155818804 0.171400684
    ## [7] 0.157117294 0.123449302 0.084871395
  • The ratio of rates of accident is 5.5 / 2 = 2.75
  • We can also say there is 2.75 times increase in the mean of accidents per day for a rainy day compare to a sunny day.

  • We can draw a sample of number 0 to 8 with probabilities that we calculated from Poisson(2)
  • A sample of 20 number will be look like this:

    prob <- c(0.1353, 0.2707, 0.2707, 0.1804, 0.0902, 0.0361, 0.0120, 0.0034, 0.0009)
    domain <- seq(0,8,1)
    set.seed(1001)
    y1 <- sample(x = domain, size = 20, replace = TRUE, prob = prob)
    y1
    ##  [1] 6 2 2 2 2 4 1 1 2 0 2 1 4 2 1 4 1 2 1 0
    mean(y1)
    ## [1] 2
    var(y1)
    ## [1] 2.210526

    Now lets add some zeroes and some larger values to our sample

    y2 <- c(y1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                0, 0, 0, 0, 0, 0, 5, 6, 8, 5, 6, 7, 4, 3, 6, 10, 15, 20)
    mean(y2)
    ## [1] 2.213115
    var(y2)
    ## [1] 14.07049
    par(mfrow = c(1,2))
    hist(y1, breaks = 10)
    hist(y2, breaks = 10)
  • We can see for sample y2 the variance is larger than mean. This is called Overdispersion.
  • The way we draw our sample is close to Poisson with \(\mu = 2\) but not exactly. If we want to sample from a true Poisson we can use rpois() function in R.
  • y <- rpois(1000, 2)
    mean(y)
    ## [1] 1.958
    var(y)
    ## [1] 2.038274
    hist(y, breaks = 15)

    Poisson distribution for unit time and time t

    The basic Poisson distribution assumes that each count in the distribution occurs over a unit of time. However, we often count events over a period of time or over various areas. For example if the number of accidents occur on average 2 times per day, over the course of a month (30 days) the average of accident will be \(2 \times 30 = 60\). The rate of accidents per day still is 2 accident per day. The rate parameterization of Poisson distribution will be:

    \[Pr(Y = y) = \dfrac{e^{-t\mu}(t\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\] Where \(t\) represents the length of time. We will talk about this when we talk about exposure in poission regression.

    Continuous random ariable

  • A continuous random variable is a random variable which can take infinity many values in an interval.
  • For example weight of newborn babies.

    The graph below shows the plot of PDF of a normal distribution with \(\mu = 2\) and \(sd = 1\).

    Part 2 Logistic Regression

    2.1 An example of logistic regression

    We start with an example:

    A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

    We can get basic descriptive for the entire data set by using summary. To get the standard deviations, we use sapply to apply the sd function to each variable in the data set.

    #reading the data from idre stat website:
    mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
    #summary statistics of variables
    summary(mydata)
    ##      admit             gre             gpa             rank      
    ##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
    ##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
    ##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
    ##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
    ##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
    ##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000
    #standard deviation of variables
    sapply(mydata, sd)
    ##       admit         gre         gpa        rank 
    ##   0.4660867 115.5165364   0.3805668   0.9444602

    Estimation on the mean of admit without any predictor.

    \[ \hat{p} = \dfrac{X}{n} = \dfrac{\sum_{i=1}^nx_i}{n} \]

    Where \(X\) is total number of people admitted and \(n\) is total number of students and \(x_i\) is value of admit for the \(i\)-th student.

    We can calculate standard deviation directly from the mean. Note that this variable is binary and assuming a binary outcome is a result of a Bernoulli trial then, \(\hat\mu_{admit} = \hat p = \sum_{i=1}^{n} x_i /n\)

    Since in Bernoulli distribution the variance is \(var(x) = p(1-p)\) we can use this formula to calculate the variance.

    mean(mydata$admit)
    ## [1] 0.3175
    #the variance can be calculated using p hat.
    mean(mydata$admit) *(1-mean(mydata$admit))
    ## [1] 0.2166937
    #but this is population variance and need to be adjusted for the sample variance
    (mean(mydata$admit) *(1-mean(mydata$admit))) * 400 / (400 - 1)
    ## [1] 0.2172368
    #Which is exactly equal to sample variance
    var(mydata$admit)
    ## [1] 0.2172368

    Why we talk about this?

    Regression model of admit predicted by gre

    lm.admit <- lm(admit ~ gre, data = mydata)
    summary(lm.admit)        
    ## 
    ## Call:
    ## lm(formula = admit ~ gre, data = mydata)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -0.4755 -0.3415 -0.2522  0.5989  0.8966 
    ## 
    ## Coefficients:
    ##               Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) -0.1198407  0.1190510  -1.007 0.314722    
    ## gre          0.0007442  0.0001988   3.744 0.000208 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.4587 on 398 degrees of freedom
    ## Multiple R-squared:  0.03402,    Adjusted R-squared:  0.03159 
    ## F-statistic: 14.02 on 1 and 398 DF,  p-value: 0.0002081
    opar <- par(mfrow = c(2, 2), 
                mar = c(4.1, 4.1, 2.1, 1.1))
    plot(lm.admit)

    par(opar)
    
    phat <- lm.admit$fitted.values
    
    plot(phat * (1 - phat) ~ phat)

    As we can see the variance of the predicted outcome is systematically related to the predicted outcome and thus is not constant.

    Another issue with this model is that:

    Also we need to use a transformation function such that we can relate the outcome to a value between 0 and 1.

    2.2 Understanding the odds

    \[ \tt{odds(admit)} = \dfrac{p}{1 - p} = \dfrac{0.3}{1 - 0.3} = 0.4285714 \]

    \[ \tt{odds(not \ admit)} = \dfrac{p}{1 - p} = \dfrac{0.7}{1 - 0.7} = 2.333333 \]

    \(odds = \dfrac{p}{1 - p}\)

    Lets calculate probability of Lakers goes to play off. From the odds formula:

    \(p = odds * (1 - p)\) and \(p + p \times odds = odds\)

    \(p = \dfrac{odds} {1 + odds} = 9 / (9+1) = 0.9\)

    Then probability of not going to play off will be 0.1.

    Another useful function is log. If we have a value between zero and infinity then the log of that value is between negative and positive infinity.

    The inverse log function is an exponential function. We usually use natural log and exponential function.

    In summary, we can use log(odds) transformation to transform a value between zero and one to a value between negative infinity to positive infinity. This function, log(odds) is called logit function.

    This suggest that the logit transformation can be model as a linear relationship to predictors.

    logistic Regression model example

    glm.mod <- glm(admit ~ gre, family = binomial(link = "logit"), data = mydata)
    
    summary(glm.mod)
    ## 
    ## Call:
    ## glm(formula = admit ~ gre, family = binomial(link = "logit"), 
    ##     data = mydata)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.1623  -0.9052  -0.7547   1.3486   1.9879  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -2.901344   0.606038  -4.787 1.69e-06 ***
    ## gre          0.003582   0.000986   3.633  0.00028 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for binomial family taken to be 1)
    ## 
    ##     Null deviance: 499.98  on 399  degrees of freedom
    ## Residual deviance: 486.06  on 398  degrees of freedom
    ## AIC: 490.06
    ## 
    ## Number of Fisher Scoring iterations: 4
    coef(summary(glm.mod))
    ##                 Estimate   Std. Error   z value     Pr(>|z|)
    ## (Intercept) -2.901344270 0.6060376103 -4.787400 1.689561e-06
    ## gre          0.003582212 0.0009860051  3.633056 2.800845e-04
    confint(glm.mod)
    ##                    2.5 %       97.5 %
    ## (Intercept) -4.119988259 -1.739756286
    ## gre          0.001679963  0.005552748

    Part3 Regression of count data

    3.1 What are count data?

  • Count data are observations that have only nonnegative integer values.
  • Count can range from zero to some grater undetermind value.
  • Theoretically count can range from zero to infinity but in practice they are always limited to some lesser value.
  • Some types of count data:

  • A count of items or events occuring within a period of time.
  • Count of item or events occouring in a given geographical or spatial area.
  • Count of number of people having a particular disease, adjusted by the size of the population.
  • Examples of count data:

  • Number of accidents in a highway.
  • Number of patients died at a hospital within 48 hours of having myocardial infarction
  • Count of number of people having a particular disease, adjusted by the size of the population.
  • 3.2 Poisson Regression Model

    3.2 Poisson Regression Model

  • A statistical model describes the relationship between one or more variable to another variable or variables.
  • Usually we are interested to study relationship between one (response, dependent or outcome) variable to one or more variables (explanatory, independent or predictors).
  • Linear regression model

    The traditional linear regression model assume this relationship is linear and the randomness of the relationship is from a Gaussian or normal probability distribution The most simple case can be formulize as follow: \[ Y = \beta_0 + \beta X + \epsilon\]
  • By estimating the model we try to find the mean of response for a given predictor as a linear relationship this proccess also known as ordinary least square method
  • \[ mean(Y) = E(Y|X) = \hat\beta_0 + \hat\beta X\] Regression model for count data

  • Regression model for count data referes to regression models such that the response variable is a non-negative integer.
  • Can we model the count response as a continuous random variable and by using ordinary least square estimate the parameters?
  • There are two problem with this approach:

  • In linear OLS model we may get the mean response as a negative value. This is not correct for count data
  • In linear OLS model we assume the variance of response for any given predictor is constant. This may not be correct for count data and in fact almost all the time it is not
  • Poisson regression model is the standard model for count data
  • In Poisson regression model we assume that the response variable for a given set of predictor variables is distributed as Poisson distribution with the parameter \(\mu\) depends of values of predictors
  • Since parameter \(\mu\) should be greater than zero for \(p\) number of predictors, we use the exponentiate function as a positive link function
  • Recall OLS model, the response variable is assume to have a normal distribution
  • Nearly all of the count models have the basic structure of the linear model. The difference is that the left-hand side of the model equation in in log form.
  • Formalization of count model

    A count model with one predictor can be formulize as:

    \[\text{log of mean response} = \ln(Y|X) = \ln(\mu) = \beta_0 + \beta X\]

    From this model the expected count or mean response will be:

    \[\text{mean response} = \mu =E(Y|X) = \exp( \beta_0 + \beta_1 X)\] This ensures \(\mu\) will be positive for any value of \(\beta_0\), \(\beta_1\), or \(X\).

    The Poisson model

    The Poisson model is the model that the count \(Y|x\) has a poisson distribution with the above mean. For an observation \(Y_i\) and single predictor \(X_i\) we have:

    \[ f(y_i | x_i; \beta_0, \beta_1) = \dfrac{e^{-\exp( \beta_0 + \beta x_i)}(\exp( \beta_0 + \beta x_i))^{y_i}}{y_i!}\]

    The plot below shows the distribution of the responses for a given predictor.

    Distribution of the response for Poisson regression model will be

    Poisson Regression Model(continue)

    Back to our toy example of number of accidents in 405. Let \(X\) is a binary variable of 0 for sunny day and 1 for rainy day.

    We have a sample of 20 days with 10 rainy and 10 sunny days from Poisson with rates 5 and 2 respectively.

    In fact we can artifitially generate this sample using R:

    y <- c(rpois(20, lambda = 2), rpois(20, lambda = 5))
    x <- rep(c(0,1), each = 20)
    print(data.frame(y = y,x = x))
    ##     y x
    ## 1   1 0
    ## 2   3 0
    ## 3   3 0
    ## 4   2 0
    ## 5   5 0
    ## 6   0 0
    ## 7   5 0
    ## 8   4 0
    ## 9   4 0
    ## 10  4 0
    ## 11  1 0
    ## 12  3 0
    ## 13  0 0
    ## 14  0 0
    ## 15  1 0
    ## 16  1 0
    ## 17  0 0
    ## 18  1 0
    ## 19  1 0
    ## 20  4 0
    ## 21  7 1
    ## 22  3 1
    ## 23  4 1
    ## 24  4 1
    ## 25  4 1
    ## 26  7 1
    ## 27  5 1
    ## 28 10 1
    ## 29  6 1
    ## 30  9 1
    ## 31  8 1
    ## 32  2 1
    ## 33  5 1
    ## 34  0 1
    ## 35  5 1
    ## 36  3 1
    ## 37  3 1
    ## 38  5 1
    ## 39  6 1
    ## 40  4 1
    accidents_model <- glm(y ~ x, family = poisson(link = "log"))
    summary(accidents_model)
    ## 
    ## Call:
    ## glm(formula = y ~ x, family = poisson(link = "log"))
    ## 
    ## Deviance Residuals: 
    ##      Min        1Q    Median        3Q       Max  
    ## -3.16228  -0.87696  -0.05176   0.84298   1.96544  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)   0.7655     0.1525   5.020 5.18e-07 ***
    ## x             0.8440     0.1824   4.628 3.69e-06 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 82.391  on 39  degrees of freedom
    ## Residual deviance: 59.027  on 38  degrees of freedom
    ## AIC: 172.29
    ## 
    ## Number of Fisher Scoring iterations: 5

    What is this?

    Let’s exponentiate the coefficients:

    exp(accidents_model$coeff)
    ## (Intercept)           x 
    ##    2.150000    2.325581

    Can we find the rate of accidents for each group (sunny or rainy)? How?

    mu0 <- mean(y[x==0])
    mu1 <- mean(y[x==1])
    c(mu0, mu1, mu1/mu0)
    ## [1] 2.150000 5.000000 2.325581

    Poisson model log-likelihood function

    This part is just for your information and is optional.

    If we have more than one predictors for ith observation we will have:

    \[\mu_i = E(y_i | x_{i1} , x_{i2}, \dots, x_{ip};\beta) = \exp(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip})\]

    For simplicity we use \(\boldsymbol{x'\beta}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip}\) and therefore we have

    \[P(Y = y_i|\boldsymbol{x}_i, \boldsymbol{\beta}) = \dfrac{e^{-\exp(\boldsymbol{x'\beta}_i)}(\exp(\boldsymbol{x'\beta}_i))^y}{y!}\] and log-likelihood function will be:

    \[\cal L(\boldsymbol{\beta}) = \sum_{i=1}^n [y_i \boldsymbol{x'\beta}_i - \exp(\boldsymbol{x'\beta}_i) - \ln y_i!]\]

    Analysis of count data model

  • For this part We will use German national health registry data set as an example of Poisson regression. The data set is also exist in package COUNT in R. We are going to use only year 1984.
  • Source: German Health Reform Registry, years pre-reform 1984-1988, From Hilbe and Greene (2008).
  • The response variable is the number of visit.
  • We are using outwork and age as predictors. outwork is a binary variable with 1 = not working and 0 = working. age is a continuous variable range 25-64.
  • Since ages are range from 25 to 64 and age 0 is not a valid value for age, to interpret the intercept we center variable age.

    . gen cage = age - r(mean)

    R code for loading the data and descriptive statistics

    # the response 
    rwm1984 <- read.csv("rwm1984.csv")
    #you can load the data from package COUNT
    library(COUNT)
    #rwm1984 <- data("rwm1984")
    head(rwm1984)
    ##   docvis hospvis edlevel age outwork female married kids hhninc educ self
    ## 1      1       0       3  54       0      0       1    0  3.050 15.0    0
    ## 2      0       0       1  44       1      1       1    0  3.050  9.0    0
    ## 3      0       0       1  58       1      1       0    0  1.434 11.0    0
    ## 4      7       2       1  64       0      0       0    0  1.500 10.5    0
    ## 5      6       0       3  30       1      0       0    0  2.400 13.0    0
    ## 6      9       0       3  26       1      0       0    0  1.050 13.0    0
    ##   edlevel1 edlevel2 edlevel3 edlevel4
    ## 1        0        0        1        0
    ## 2        1        0        0        0
    ## 3        1        0        0        0
    ## 4        1        0        0        0
    ## 5        0        0        1        0
    ## 6        0        0        1        0
    summary(rwm1984)
    ##      docvis           hospvis           edlevel          age    
    ##  Min.   :  0.000   Min.   : 0.0000   Min.   :1.00   Min.   :25  
    ##  1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:1.00   1st Qu.:35  
    ##  Median :  1.000   Median : 0.0000   Median :1.00   Median :44  
    ##  Mean   :  3.163   Mean   : 0.1213   Mean   :1.38   Mean   :44  
    ##  3rd Qu.:  4.000   3rd Qu.: 0.0000   3rd Qu.:1.00   3rd Qu.:54  
    ##  Max.   :121.000   Max.   :17.0000   Max.   :4.00   Max.   :64  
    ##     outwork           female          married            kids       
    ##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
    ##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
    ##  Median :0.0000   Median :0.0000   Median :1.0000   Median :0.0000  
    ##  Mean   :0.3665   Mean   :0.4793   Mean   :0.7891   Mean   :0.4491  
    ##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
    ##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    ##      hhninc            educ            self            edlevel1     
    ##  Min.   : 0.015   Min.   : 7.00   Min.   :0.00000   Min.   :0.0000  
    ##  1st Qu.: 2.000   1st Qu.:10.00   1st Qu.:0.00000   1st Qu.:1.0000  
    ##  Median : 2.800   Median :10.50   Median :0.00000   Median :1.0000  
    ##  Mean   : 2.969   Mean   :11.09   Mean   :0.06118   Mean   :0.8136  
    ##  3rd Qu.: 3.590   3rd Qu.:11.50   3rd Qu.:0.00000   3rd Qu.:1.0000  
    ##  Max.   :25.000   Max.   :18.00   Max.   :1.00000   Max.   :1.0000  
    ##     edlevel2         edlevel3         edlevel4      
    ##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
    ##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
    ##  Median :0.0000   Median :0.0000   Median :0.00000  
    ##  Mean   :0.0524   Mean   :0.0746   Mean   :0.05937  
    ##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
    ##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000
    #centerin the age predictor
    rwm1984$cage <- rwm1984$age - mean(rwm1984$age)

    Run a poission model of docvis as independent variable and outwork and age as predictors

    pois_m <- glm(docvis ~ outwork + age, family = poisson(link = "log"), data = rwm1984)
    summary(pois_m)
    ## 
    ## Call:
    ## glm(formula = docvis ~ outwork + age, family = poisson(link = "log"), 
    ##     data = rwm1984)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -3.4573  -2.2475  -1.2366   0.2863  25.1320  
    ## 
    ## Coefficients:
    ##               Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -0.0335170  0.0391815  -0.855    0.392    
    ## outwork      0.4079314  0.0188447  21.647   <2e-16 ***
    ## age          0.0220842  0.0008377  26.362   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 25791  on 3873  degrees of freedom
    ## Residual deviance: 24190  on 3871  degrees of freedom
    ## AIC: 31279
    ## 
    ## Number of Fisher Scoring iterations: 6

    Use centered ag and run the model again:

    pois_m1 <- glm(docvis ~ outwork + cage, family = poisson(link = "log"), data = rwm1984)
    summary(pois_m1)
    ## 
    ## Call:
    ## glm(formula = docvis ~ outwork + cage, family = poisson(link = "log"), 
    ##     data = rwm1984)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -3.4573  -2.2475  -1.2366   0.2863  25.1320  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) 0.9380965  0.0127571   73.53   <2e-16 ***
    ## outwork     0.4079314  0.0188447   21.65   <2e-16 ***
    ## cage        0.0220842  0.0008377   26.36   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 25791  on 3873  degrees of freedom
    ## Residual deviance: 24190  on 3871  degrees of freedom
    ## AIC: 31279
    ## 
    ## Number of Fisher Scoring iterations: 6
    pr <- sum(residuals(pois_m1, type = "pearson") ^ 2)
    pr / pois_m1$df.resid
    ## [1] 11.34322
    pois_m1$aic / (pois_m1$df.null + 1)
    ## [1] 8.074027
    COUNT::modelfit(pois_m1)
    ## $AIC
    ## [1] 31278.78
    ## 
    ## $AICn
    ## [1] 8.074027
    ## 
    ## $BIC
    ## [1] 31297.57
    ## 
    ## $BICqh
    ## [1] 8.07418

    Interpreting Coefficients and rate ratio

  • Coefficients are in log format and we can directly interpret them in term of log response. In general we say: Log mean of the response variable increases by \(\beta\) for a one unit increase of the corresponding predictor variable, other predictors are held constant.
  • in our results, the expected log number of visits for patients who are out of work is 0.4 more than log number of visits for patients who are working with the same age.
  • For each one year increase in age, there is an increase in the expected log count of visit to the doctor of 0.022, holding outwork the same.
  • In order to have a change in predictor value reflect a change in actual visits to a physician, we must exponentiate the coefficient and confidence interval of coefficients.

    The standard error can approximately calculated by delta method.

    Delta Method:

    \[se(\exp(\beta)) = exp(\beta) * se(\beta)\] . *calculating irr

    . di exp(_b[outwork])

    1.5037041

    . di exp(_b[outwork]) * _se[outwork]

    .02833683

    exp(coef(pois_m1))
    ## (Intercept)     outwork        cage 
    ##    2.555113    1.503704    1.022330
    exp(coef(pois_m1)) * sqrt(diag(vcov(pois_m1))) #delta method
    ##  (Intercept)      outwork         cage 
    ## 0.0325958199 0.0283368154 0.0008564317
    exp(confint.default(pois_m1))
    ##                2.5 %   97.5 %
    ## (Intercept) 2.492018 2.619805
    ## outwork     1.449178 1.560282
    ## cage        1.020653 1.024010