Regression Models for Count Data

IDRE Statistical Consulting Group

1.1 What are count data?

  • Count data are observations that have only nonnegative integer values.
  • Count can range from zero to some grater undetermind value.
  • Theoretically count can range from zero to infinity but in practice they are always limited to some lesser value.
  • Some types of count data:

  • A count of items or events occuring within a period of time.
  • Count of item or events occouring in a given geographical or spatial area.
  • Count of number of people having a particular disease, adjusted by the size of the population.
  • Examples of count data:

  • Number of accidents in a highway.
  • Number of patients died at a hospital within 48 hours of having myocardial infarction
  • Count of number of people having a particular disease, adjusted by the size of the population.
  • 1.2 Discrete Random variable and Probability Mass Function

  • A discrete random variable is a random variable that can only take values from a countable set.
  • For example, number of time we flipped a coin until we observe a head for the first time. This value could be in the set of \(\{1, 2, 3, ...\}\).

    Sum of outcome of rolling two dice as a random variable that takes values in the set of \(\{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}\).

  • Bernoulli random variable with values of 0 and 1 is a discrete random variable.
  • Probability Mass Function

  • A probability Mass Function is a function that gives the probability of a discrete random variable at each and every value of that random variable.
  • For example the PMF of Bernoulli with zero and one outcomes is:

    \[f(x;p)=\begin{cases} 1 - p & {\text{if }}x=0,\\p & {\text{if }}x=1.\end{cases}\]

  • Binomial is another popular discrete random variable that defines as number of success in \(n\) independent Bernoulli trial with probability of success equal to \(p\).
  • The PMF of Binomial(n, p) is: (optional) \[\displaystyle f(x; n,p)=\Pr(x;n,p)=\Pr(X=x)=\frac{n!}{x!(n-x)!} p^{x}(1-p)^{n-x}\] For \(x = 0, 1, 2, 3, \cdots, n\).

    If we know the PMF of a random variable we can calculate the probability of observing any given value by plug in that value in the PMF.

    For example for a Binomial random variable with \(p = 0.2\) and \(n = 10\) we can calculate probabilities for each value of the random variable:

    \(P(x = 0) = [10! / 0!(10 - 0)!] 0.2 ^ 0 (1 - 0.2) ^ {10} = 0.8 ^ {10} \approx 0.107\)

    The graph below shows the plot of PMF of a Binomial random variable with \(p = 0.2\) and \(n = 10\).

  • A continuous random variable is a random variable which can take infinity many values in an interval.
  • For example weight of newborn babies.

    The graph below shows the plot of PDF of a normal distribution with \(\mu = 2\) and \(sd = 1\).

    1.2 Poisson Random variable and Poisson distribution

    The discrete random variable \(Y\) is define as number of event recorded during a unit of time has a Poisson distribution with rate parameter \(\mu > 0\), if \(Y\) has PMF

    \[Pr(Y = y) = \dfrac{e^{-\mu}(\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\]

  • What is \(e\) and relationship to log?
  • \(\log_{10}(1000) = 3\)

    \(10^3 = 1000\)

    \(\log_{2}(64) = 6\)

    \(2^6 = 64\)

    Euler’s number or the number \(e\) is a mathematical constant that is the base of the natural logarithm.

    \[\displaystyle e=\sum \limits _{n=0}^{\infty }{\frac {1}{n!}}={\frac {1}{1}}+{\frac {1}{1}}+{\frac {1}{1\cdot 2}}+{\frac {1}{1\cdot 2\cdot 3}}+\cdots \approx 2.718282.\]

    If \(e^6 = a\) Then \(\log_e(a) = ?\)

    \(e^x\) can also noted as \(\exp(x)\). \(\log_e(x)\) is also shown as \(\ln(x)\)

    Example of number of accidents in a highway

    Suppose number of accidents in a sunny day in freeway 405 between Santa Monica and wilshire blvd is Poisson with the rate of 2 accidents per day, then probability of 0 accidents can be calculated using the formula above:

    \(P(Y = 0| \mu = 2) = \dfrac{e^{-2}(2)^0}{0!} = 1/e^2 \approx 0.1353\)

    \(P(Y = 1| \mu = 2) = \dfrac{e^{-2}(2)^1}{1!} = 2/e^2 \approx 0.2707\)

    \(P(Y = 2| \mu = 2) = \dfrac{e^{-2}(2)^2}{2!} = 2/e^2 \approx 0.2707\)

    \(P(Y = 3| \mu = 2) = \dfrac{e^{-2}(2)^3}{3!} = 8/(6e^2) \approx 0.1804\)

    \(P(Y = 4| \mu = 2) = \dfrac{e^{-2}(2)^4}{4!} = 16/(24e^2) \approx 0.0902\)

    \(P(Y = 5| \mu = 2) = \dfrac{e^{-2}(2)^5}{5!} = 32/(120e^2) \approx 0.0361\)

    \(P(Y = 6| \mu = 2) \approx 0.0120\)

    \(P(Y = 7| \mu = 2) \approx 0.0034\)

    \(P(Y = 8| \mu = 2) \approx 0.0009\)

    The following graph shows the plot of PMF of a Poisson random variable with \(\mu = 2\)

    This looks like binomial but not exactly the same!
  • In Poisson we only have one parameter. The mean and variance of Poisson distribution are equal.
  • In a large number of Bernoulli trial with a very small probability of success the total number of success can be approximated by a Poisson distribution. This is called the law of rare events.
  • It can be shown that a Binomial distribution will converge to a Poisson distribution as \(n\) goes to infinite and \(p\) goes to zero such that \(np \rightarrow \mu\)

    Now assume number of accidents in a rainy day in freeway 405 between Santa Monica and wilshire blvd is Poisson with the rate of 5.5 accidents per day, then probability of 0 to 8 accidents using R is:

     dpois(0:10, lambda = 5.5)
    ##  [1] 0.004086771 0.022477243 0.061812418 0.113322766 0.155818804 0.171400684
    ##  [7] 0.157117294 0.123449302 0.084871395 0.051865853 0.028526219
  • The ratio of rates of accident is 5.5 / 2 = 2.75
  • We can also say there is 2.75 times increase in the mean of accidents per day for a rainy day compare to a sunny day.

  • We can draw a sample of number 0 to 8 with probabilities that we calculated from Poisson(2)
  • A sample of 20 number will be look like this:

    prob <- c(0.1353, 0.2707, 0.2707, 0.1804, 0.0902, 0.0361, 0.0120, 0.0034, 0.0009)
    domain <- seq(0,8,1)
    set.seed(1001)
    y1 <- sample(x = domain, size = 20, replace = TRUE, prob = prob)
    y1
    ##  [1] 6 2 2 2 2 4 1 1 2 0 2 1 4 2 1 4 1 2 1 0
    mean(y1)
    ## [1] 2
    var(y1)
    ## [1] 2.210526

    Now lets add some zeroes and some larger values to our sample

    y2 <- c(y1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                0, 0, 0, 0, 0, 0, 5, 6, 8, 5, 6, 7, 4, 3, 6, 10, 15, 20)
    mean(y2)
    ## [1] 2.213115
    var(y2)
    ## [1] 14.07049
    par(mfrow = c(1,2))
    hist(y1, breaks = 10)
    hist(y2, breaks = 10)
  • We can see for sample y2 the variance is larger than mean. This is called Overdispersion.
  • The way we draw our sample is close to Poisson with \(\mu = 2\) but not exactly. If we want to sample from a true Poisson we can use rpois() function in R.
  • y <- rpois(1000, 2)
    mean(y)
    ## [1] 1.958
    var(y)
    ## [1] 2.038274
    hist(y, breaks = 15)

    Exposure

    The basic Poisson distribution assumes that each count in the distribution occures over a unit of time. However, we offten count events over a period of time or over various areas. For example if the number of accidents occure on average 2 times per day, over the course of a month (30 days) the average of accident will be \(2 \times 30 = 60\). The rate of accidents per day still is 2 accident per day. The rate parameterization of Poisson distribution will be:

    \[Pr(Y = y) = \dfrac{e^{-t\mu}(t\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\] Where \(t\) represents the length of time, or exposure.

    1.2 Poisson Regression Model

  • A statistical model describes the relationship between one or more variable to another variable or variables.
  • Usually we are interested to study relationship between one (response, dependent or outcome) variable to one or more variables (explanatory, independent or predictors).
  • Linear regression model

    The traditional linear regression model assume this relationship is linear and the randomness of the relationship is from a Gaussian or normal probability distribution The most simple case can be formulize as follow: \[ Y = \beta_0 + \beta X + \epsilon\]
  • By estimating the model we try to find the mean of response for a given predictor as a linear relationship this proccess also known as ordinary least square method
  • \[ mean(Y) = E(Y|X) = \hat\beta_0 + \hat\beta X\] Regression model for count data

  • Regression model for count data referes to regression models such that the response variable is a non-negative integer.
  • Can we model the count response as a continuous random variable and by using ordinary least square estimate the parameters?
  • There are two problem with this approach:

  • In linear OLS model we may get the mean response as a negative value. This is not correct for count data
  • In linear OLS model we assume the variance of response for any given predictor is constant. This may not be correct for count data and in fact almost all the time it is not
  • Poisson regression model is the standard model for count data
  • In Poisson regression model we assume that the response variable for a given set of predictor variables is distributed as Poisson distribution with the parameter \(\mu\) depends of values of predictors
  • Since parameter \(\mu\) should be greater than zero for \(p\) number of predictors, we use the exponentiate function as a positive link function
  • Recall OLS model, the response variable is assume to have a normal distribution
  • Nearly all of the count models have the basic structure of the linear model. The difference is that the left-hand side of the model equation in in log form.
  • Formalization of count model

    A count model with one predictor can be formulize as:

    \[\text{log of mean response} = \ln(Y|X) = \ln(\mu) = \beta_0 + \beta X\]

    From this model the expected count or mean response will be:

    \[\text{mean response} = \mu =E(Y|X) = \exp( \beta_0 + \beta_1 X)\] This ensures \(\mu\) will be positive for any value of \(\beta_0\), \(\beta_1\), or \(X\).

    The Poisson model

    The Poisson model is the model that the count \(Y|x\) has a poisson distribution with the above mean. For an observation \(Y_i\) and single predictor \(X_i\) we have:

    \[ f(y_i | x_i; \beta_0, \beta_1) = \dfrac{e^{-\exp( \beta_0 + \beta x_i)}(\exp( \beta_0 + \beta x_i))^{y_i}}{y_i!}\]

    The plot below shows the distribution of the responses for a given predictor.

    Distribution of the response for Poisson regression model will be

    Poisson Regression Model(continue)

    Back to our toy example of number of accidents in 405. Let \(X\) is a binary variable of 0 for sunny day and 1 for rainy day.

    We have a sample of 20 days with 10 rainy and 10 sunny days from Poisson with rates 5 and 2 respectively.

    In fact we can artifitially generate this sample using R:

    y <- c(rpois(20, lambda = 2), rpois(20, lambda = 5))
    x <- rep(c(0,1), each = 20)
    print(data.frame(y = y,x = x))
    ##     y x
    ## 1   1 0
    ## 2   3 0
    ## 3   3 0
    ## 4   2 0
    ## 5   5 0
    ## 6   0 0
    ## 7   5 0
    ## 8   4 0
    ## 9   4 0
    ## 10  4 0
    ## 11  1 0
    ## 12  3 0
    ## 13  0 0
    ## 14  0 0
    ## 15  1 0
    ## 16  1 0
    ## 17  0 0
    ## 18  1 0
    ## 19  1 0
    ## 20  4 0
    ## 21  7 1
    ## 22  3 1
    ## 23  4 1
    ## 24  4 1
    ## 25  4 1
    ## 26  7 1
    ## 27  5 1
    ## 28 10 1
    ## 29  6 1
    ## 30  9 1
    ## 31  8 1
    ## 32  2 1
    ## 33  5 1
    ## 34  0 1
    ## 35  5 1
    ## 36  3 1
    ## 37  3 1
    ## 38  5 1
    ## 39  6 1
    ## 40  4 1
    accidents_model <- glm(y ~ x, family = poisson(link = "log"))
    summary(accidents_model)
    ## 
    ## Call:
    ## glm(formula = y ~ x, family = poisson(link = "log"))
    ## 
    ## Deviance Residuals: 
    ##      Min        1Q    Median        3Q       Max  
    ## -3.16228  -0.87696  -0.05176   0.84298   1.96544  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)   0.7655     0.1525   5.020 5.18e-07 ***
    ## x             0.8440     0.1824   4.628 3.69e-06 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 82.391  on 39  degrees of freedom
    ## Residual deviance: 59.027  on 38  degrees of freedom
    ## AIC: 172.29
    ## 
    ## Number of Fisher Scoring iterations: 5

    What is this?

    Let’s exponentiate the coefficients:

    exp(accidents_model$coeff)
    ## (Intercept)           x 
    ##    2.150000    2.325581

    Can we find the rate of accidents for each group (sunny or rainy)? How?

    mu0 <- mean(y[x==0])
    mu1 <- mean(y[x==1])
    c(mu0, mu1, mu1/mu0)
    ## [1] 2.150000 5.000000 2.325581

    Poisson model log-likelihood function

    This part is just for your information and is optional.

    If we have more than one predictors for ith observation we will have:

    \[\mu_i = E(y_i | x_{i1} , x_{i2}, \dots, x_{ip};\beta) = \exp(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip})\]

    For simplicity we use \(\boldsymbol{x'\beta}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip}\) and therefore we have

    \[P(Y = y_i|\boldsymbol{x}_i, \boldsymbol{\beta}) = \dfrac{e^{-\exp(\boldsymbol{x'\beta}_i)}(\exp(\boldsymbol{x'\beta}_i))^y}{y!}\] and log-likelihood function will be:

    \[\cal L(\boldsymbol{\beta}) = \sum_{i=1}^n [y_i \boldsymbol{x'\beta}_i - \exp(\boldsymbol{x'\beta}_i) - \ln y_i!]\]

    Part 2 Analysis of count data model

  • For this part We will use German national health registry data set as an example of Poisson regression. The data set is also exist in package COUNT in R. We are going to use only year 1984.
  • Source: German Health Reform Registry, years pre-reform 1984-1988, From Hilbe and Greene (2008).
  • The response variable is the number of visit.
  • We are using outwork and age as predictors. outwork is a binary variable with 1 = not working and 0 = working. age is a continous variable range 25-64.
  • . use "C:\rwm1984.dta", clear

    Since ages are range from 25 to 64 and age 0 is not a valid value for age, to interpret the intercept we center variable age.

    . gen cage = age - r(mean)

    R code for loading the data and descriptive statistics

    # the response 
    rwm1984 <- read.csv("rwm1984.csv")
    #you can load the data from package COUNT
    library(COUNT)
    #rwm1984 <- data("rwm1984")
    head(rwm1984)
    ##   docvis hospvis edlevel age outwork female married kids hhninc educ self
    ## 1      1       0       3  54       0      0       1    0  3.050 15.0    0
    ## 2      0       0       1  44       1      1       1    0  3.050  9.0    0
    ## 3      0       0       1  58       1      1       0    0  1.434 11.0    0
    ## 4      7       2       1  64       0      0       0    0  1.500 10.5    0
    ## 5      6       0       3  30       1      0       0    0  2.400 13.0    0
    ## 6      9       0       3  26       1      0       0    0  1.050 13.0    0
    ##   edlevel1 edlevel2 edlevel3 edlevel4
    ## 1        0        0        1        0
    ## 2        1        0        0        0
    ## 3        1        0        0        0
    ## 4        1        0        0        0
    ## 5        0        0        1        0
    ## 6        0        0        1        0
    summary(rwm1984)
    ##      docvis           hospvis           edlevel          age    
    ##  Min.   :  0.000   Min.   : 0.0000   Min.   :1.00   Min.   :25  
    ##  1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:1.00   1st Qu.:35  
    ##  Median :  1.000   Median : 0.0000   Median :1.00   Median :44  
    ##  Mean   :  3.163   Mean   : 0.1213   Mean   :1.38   Mean   :44  
    ##  3rd Qu.:  4.000   3rd Qu.: 0.0000   3rd Qu.:1.00   3rd Qu.:54  
    ##  Max.   :121.000   Max.   :17.0000   Max.   :4.00   Max.   :64  
    ##     outwork           female          married            kids       
    ##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
    ##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
    ##  Median :0.0000   Median :0.0000   Median :1.0000   Median :0.0000  
    ##  Mean   :0.3665   Mean   :0.4793   Mean   :0.7891   Mean   :0.4491  
    ##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
    ##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
    ##      hhninc            educ            self            edlevel1     
    ##  Min.   : 0.015   Min.   : 7.00   Min.   :0.00000   Min.   :0.0000  
    ##  1st Qu.: 2.000   1st Qu.:10.00   1st Qu.:0.00000   1st Qu.:1.0000  
    ##  Median : 2.800   Median :10.50   Median :0.00000   Median :1.0000  
    ##  Mean   : 2.969   Mean   :11.09   Mean   :0.06118   Mean   :0.8136  
    ##  3rd Qu.: 3.590   3rd Qu.:11.50   3rd Qu.:0.00000   3rd Qu.:1.0000  
    ##  Max.   :25.000   Max.   :18.00   Max.   :1.00000   Max.   :1.0000  
    ##     edlevel2         edlevel3         edlevel4      
    ##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
    ##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
    ##  Median :0.0000   Median :0.0000   Median :0.00000  
    ##  Mean   :0.0524   Mean   :0.0746   Mean   :0.05937  
    ##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
    ##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000
    #centerin the age predictor
    rwm1984$cage <- rwm1984$age - mean(rwm1984$age)

    2.1 Running a count data model

  • In Stata we use command glm to model a Poisson regression:
  • The first part of the result shows the optimization process. The optimization method by default is ML.
  • In the middle of the result estimated coefficients with their standard errors and confidence intervals are appeared.
  • R code for runnig Poisson model is very similar to Stata

    pois_m <- glm(docvis ~ outwork + age, family = poisson(link = "log"), data = rwm1984)
    summary(pois_m)
    ## 
    ## Call:
    ## glm(formula = docvis ~ outwork + age, family = poisson(link = "log"), 
    ##     data = rwm1984)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -3.4573  -2.2475  -1.2366   0.2863  25.1320  
    ## 
    ## Coefficients:
    ##               Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -0.0335170  0.0391815  -0.855    0.392    
    ## outwork      0.4079314  0.0188447  21.647   <2e-16 ***
    ## age          0.0220842  0.0008377  26.362   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 25791  on 3873  degrees of freedom
    ## Residual deviance: 24190  on 3871  degrees of freedom
    ## AIC: 31279
    ## 
    ## Number of Fisher Scoring iterations: 6

    And if we use centered age:

    pois_m1 <- glm(docvis ~ outwork + cage, family = poisson(link = "log"), data = rwm1984)
    summary(pois_m1)
    ## 
    ## Call:
    ## glm(formula = docvis ~ outwork + cage, family = poisson(link = "log"), 
    ##     data = rwm1984)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -3.4573  -2.2475  -1.2366   0.2863  25.1320  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) 0.9380965  0.0127571   73.53   <2e-16 ***
    ## outwork     0.4079314  0.0188447   21.65   <2e-16 ***
    ## cage        0.0220842  0.0008377   26.36   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 25791  on 3873  degrees of freedom
    ## Residual deviance: 24190  on 3871  degrees of freedom
    ## AIC: 31279
    ## 
    ## Number of Fisher Scoring iterations: 6
    pr <- sum(residuals(pois_m1, type = "pearson") ^ 2)
    pr / pois_m1$df.resid
    ## [1] 11.34322
    pois_m1$aic / (pois_m1$df.null + 1)
    ## [1] 8.074027
    COUNT::modelfit(pois_m1)
    ## $AIC
    ## [1] 31278.78
    ## 
    ## $AICn
    ## [1] 8.074027
    ## 
    ## $BIC
    ## [1] 31297.57
    ## 
    ## $BICqh
    ## [1] 8.07418

    2.2 Interpreting Coefficients and rate ratio

  • Coefficients are in log format and we can directly interpret them in term of log response. In general we say: Log mean of the response variable increases by \(\beta\) for a one unit increase of the corresponding predictor variable, other predictors are held constant.
  • in our results, the expected log number of visits for patients who are out of work is 0.4 more than log number of visits for patients who are working with the same age.
  • For each one year increase in age, there is an increase in the expected log count of visit to the doctor of 0.022, holding outwork the same.
  • In order to have a change in predictor value reflect a change in actual visits to a physician, we must exponentiate the coefficient and confidence interval of coefficients.

    The standard error can approximately calculated by delta method.

    Delta Method:

    \[se(\exp(\beta)) = exp(\beta) * se(\beta)\] . *calculating irr

    . di exp(_b[outwork])

    1.5037041

    . di exp(_b[outwork]) * _se[outwork]

    .02833683

    We can get the Rate Ratios directly from the output:

    exp(coef(pois_m1))
    ## (Intercept)     outwork        cage 
    ##    2.555113    1.503704    1.022330
    exp(coef(pois_m1)) * sqrt(diag(vcov(pois_m1))) #delta method
    ##  (Intercept)      outwork         cage 
    ## 0.0325958199 0.0283368154 0.0008564317
    exp(confint.default(pois_m1))
    ##                2.5 %   97.5 %
    ## (Intercept) 2.492018 2.619805
    ## outwork     1.449178 1.560282
    ## cage        1.020653 1.024010

    2.3 Marginal effects

  • Marginal effect at the mean: The number of visit made to a doctor increased by 6.6 % for each year of age, when outwork is set at its mean.
  • Average marginal effect: For each additional year of age, there are on average about 0.07 additional doctor visit with outwork at its mean value.
  • Discrete change or partial effect is used to evaluate the change in predicted of the response when a binary predictor changes value from 0 to 1
  • #marginal effect at mean
    beta <- coef(pois_m)
    mout <- mean(rwm1984$outwork)
    mage <- mean(rwm1984$age)
    xb <- sum(beta * c(1,mout,mage))
    dfdxb <- exp(xb) * beta[3]
    dfdxb
    ##        age 
    ## 0.06552846
    #avarage marginal effect
    mean(rwm1984$docvis) * beta[3]
    ##        age 
    ## 0.06984968

    2.4 Exposure

    Sometimes all data points are not occur at the same period , area or volumes. In this case we can use an offset with a model to adjust for counts of events over time period, area, and volumes.

    \[ \exp(\boldsymbol{x'\beta}) = \mu / t\]

    \[ \mu = \exp(\boldsymbol{x'\beta} + \log(t))\] The term \(\log(t)\) is referred as offset.

  • For example we use the data set of Canadian National Cardiovascular Disease registry called FASTRAK.
  • In this data set:

    die : number of death

    cases: number of cases

    anterior: if the patient had a previous anterior myocardial infarction.

    hcabg: if the patient had a history of CABG procedure.

    Killip: a summary indicator of the cardiovascular health of the patient.

    The disparsion appear as relatively low at 1.40, but notice the total number of observation is 5388.

    R code for exposure

    data("fasttrakg")
    head(fasttrakg)
    ##   die cases anterior hcabg killip kk1 kk2 kk3 kk4
    ## 1   5    19        0     0      4   0   0   0   1
    ## 2  10    83        0     0      3   0   0   1   0
    ## 3  15   412        0     0      2   0   1   0   0
    ## 4  28  1864        0     0      1   1   0   0   0
    ## 5   1     1        0     1      4   0   0   0   1
    ## 6   0     3        0     1      3   0   0   1   0
    summary(fast <- glm(die ~ anterior + hcabg +factor(killip) , family = poisson, 
                        offset = log(cases), data = fasttrakg))
    ## 
    ## Call:
    ## glm(formula = die ~ anterior + hcabg + factor(killip), family = poisson, 
    ##     data = fasttrakg, offset = log(cases))
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -1.4791  -0.5271  -0.1766   0.4337   2.3317  
    ## 
    ## Coefficients:
    ##                 Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)      -4.0698     0.1459 -27.893  < 2e-16 ***
    ## anterior          0.6749     0.1596   4.229 2.34e-05 ***
    ## hcabg             0.6614     0.3267   2.024   0.0429 *  
    ## factor(killip)2   0.9020     0.1724   5.234 1.66e-07 ***
    ## factor(killip)3   1.1133     0.2513   4.430 9.44e-06 ***
    ## factor(killip)4   2.5126     0.2743   9.160  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 116.232  on 14  degrees of freedom
    ## Residual deviance:  10.932  on  9  degrees of freedom
    ## AIC: 73.992
    ## 
    ## Number of Fisher Scoring iterations: 5
    exp(coef(fast))
    ##     (Intercept)        anterior           hcabg factor(killip)2 factor(killip)3 
    ##      0.01708133      1.96376577      1.93746487      2.46463342      3.04434885 
    ## factor(killip)4 
    ##     12.33745552
    exp(coef(fast)) * sqrt(diag(vcov(fast))) #delta method
    ##     (Intercept)        anterior           hcabg factor(killip)2 factor(killip)3 
    ##     0.002492295     0.313359487     0.632970849     0.424784164     0.765119601 
    ## factor(killip)4 
    ##     3.384215172
    exp(confint.default(fast))
    ##                     2.5 %      97.5 %
    ## (Intercept)     0.0128329  0.02273622
    ## anterior        1.4363585  2.68482829
    ## hcabg           1.0212823  3.67554592
    ## factor(killip)2 1.7581105  3.45508317
    ## factor(killip)3 1.8602297  4.98221265
    ## factor(killip)4 7.2067174 21.12096274
    modelfit(fast)
    ## $AIC
    ## [1] 73.9917
    ## 
    ## $AICn
    ## [1] 4.93278
    ## 
    ## $BIC
    ## [1] 78.24
    ## 
    ## $BICqh
    ## [1] 5.566187

    2.5 Overdispersion

  • The overdispersion is the key problem to consider in count model.
  • As we noted in Poisson random variable overdispersion occurs when the variance is greater than mean.
  • In count model, overdispersion means the variance of response condition on predictor is more than the conditional mean of response.
  • There are variety of tests to determine whether the model fits the data.
  • The first and natural test used in Poisson regression is called the deviance goodness-of-fit test.
  • The deviance is defined as the difference between a saturated log-likelihood and full model log-likelihood.
  • dev <- deviance(pois_m)
    df <- df.residual(pois_m)
    p_value <- 1 - pchisq(dev, df)
    print(matrix(c("Deviance GOF ", " ", 
                   "D =", round(dev,4) , 
                   "df = ", df, 
                   "P-value =  ", p_value), byrow = T,ncol = 2))
    ##      [,1]            [,2]        
    ## [1,] "Deviance GOF " " "         
    ## [2,] "D ="           "24190.3581"
    ## [3,] "df = "         "3871"      
    ## [4,] "P-value =  "   "0"
  • With a p < 0.05, the deviance GOF test indicates that we can reject the hypothesis that the model is not well fitted.
  • The other test statistics is pearson goodness-of-fit test. Defined as sum of square Pearson residuals.
  • The overdispersion can be model as different ways
  • Quasi-likelihood models
  • Sandwich or robust variance estimators
  • Bootstrap standard errors
  • Negative binomial 1 (NB1)
  • Negative binomial 1 (NB2)
  • In Poisson Mean = \(\mu\)
    Variance = \(\mu\)

    In Negative Binomial1 (NB1)

    Mean = \(\mu\)
    Variance = \(\mu + \alpha \mu\) or

    Variance = \(\mu (1 + \alpha) = \mu \omega\)

    In Negative Binomial2 (NB2)

    Mean = \(\mu\)
    Variance = \(\mu + \alpha \mu ^2\)

  • Poisson regression ML variance estimators
  • Poisson regression ML with robust variance estimators
  • Bootstrap standard errors
  • Poisson regression ML, Standard error with constant scale using sqrt pearson X2 based
  • Negbin2 MLE with default standard errors
  • Negbin2 MLE with robust standard errors
  • Negbin1 MLE with default standard errors
  • Negbin1 MLE with robust standard errors
  • # Poisson Default standard errors (variance equals mean) based on MLE
    poiss_mle <- glm(docvis ~ outwork + age, data = rwm1984, family=poisson()) 
    print(summary(poiss_mle),digits=3)
    ## 
    ## Call:
    ## glm(formula = docvis ~ outwork + age, family = poisson(), data = rwm1984)
    ## 
    ## Deviance Residuals: 
    ##    Min      1Q  Median      3Q     Max  
    ## -3.457  -2.248  -1.237   0.286  25.132  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept) -0.033517   0.039181   -0.86     0.39    
    ## outwork      0.407931   0.018845   21.65   <2e-16 ***
    ## age          0.022084   0.000838   26.36   <2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for poisson family taken to be 1)
    ## 
    ##     Null deviance: 25791  on 3873  degrees of freedom
    ## Residual deviance: 24190  on 3871  degrees of freedom
    ## AIC: 31279
    ## 
    ## Number of Fisher Scoring iterations: 6
    # Robust sandwich based on poiss_mle
    library(sandwich)
    #this will gives robust covariance matrix of estimated model
    cov.robust <- vcovHC(poiss_mle, type="HC0")
    #diagonal is the variances
    se.robust <- sqrt(diag(cov.robust))
    coeffs <- coef(poiss_mle)
    z_.025 <- qnorm(.975)
    t.robust <- coeffs / se.robust
    poiss_mle_robust <- cbind(coeffs, se.robust, t.robust, pvalue=round(2*(1-pnorm(abs(coeffs/se.robust))),5), 
                              lower=coeffs - z_.025 * se.robust, upper=coeffs+ z_.025 * se.robust)    
    print(poiss_mle_robust,digits=3)
    ##              coeffs se.robust t.robust pvalue   lower  upper
    ## (Intercept) -0.0335   0.12333   -0.272  0.786 -0.2752 0.2082
    ## outwork      0.4079   0.06907    5.906  0.000  0.2726 0.5433
    ## age          0.0221   0.00283    7.808  0.000  0.0165 0.0276
    # Poisson Bootstrap standard errors (variance equals mean)
    # install.packages("boot")
    library(boot)
    boot.poiss <- function(data, indices) {
      data <- data[indices, ] 
      model <- glm(docvis ~ outwork + age, family=poisson(), data=data)
      coefficients(model)
    }
    set.seed(10101)
    # To speed up we use only 100 bootstraps. 
    summary.poissboot <- boot(rwm1984, boot.poiss, 400)
    print(summary.poissboot,digits=3)
    ## 
    ## ORDINARY NONPARAMETRIC BOOTSTRAP
    ## 
    ## 
    ## Call:
    ## boot(data = rwm1984, statistic = boot.poiss, R = 400)
    ## 
    ## 
    ## Bootstrap Statistics :
    ##     original    bias    std. error
    ## t1*  -0.0335  0.007078     0.12531
    ## t2*   0.4079  0.000811     0.07318
    ## t3*   0.0221 -0.000168     0.00294
    # Poisson with NB1 standard errors (assumes variance is a multiple of the mean) w = theta*mu or w = mu*(1 + alpha)
    #also called Poisson Quasi-MLE
    poiss_qmle <- glm(docvis ~ outwork + age, data = rwm1984, family=quasipoisson()) 
    print(summary(poiss_qmle),digits=4)
    ## 
    ## Call:
    ## glm(formula = docvis ~ outwork + age, family = quasipoisson(), 
    ##     data = rwm1984)
    ## 
    ## Deviance Residuals: 
    ##     Min       1Q   Median       3Q      Max  
    ## -3.4573  -2.2475  -1.2366   0.2863  25.1320  
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept) -0.033517   0.131962  -0.254      0.8    
    ## outwork      0.407931   0.063468   6.427 1.46e-10 ***
    ## age          0.022084   0.002821   7.827 6.39e-15 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for quasipoisson family taken to be 11.34324)
    ## 
    ##     Null deviance: 25791  on 3873  degrees of freedom
    ## Residual deviance: 24190  on 3871  degrees of freedom
    ## AIC: NA
    ## 
    ## Number of Fisher Scoring iterations: 6
    # MASS does just NB2
    # install.packages("MASS")
    library(MASS)
    # Note: MASS  reports as the overdispersion parameter 1/alpha not alpha
    # In this example R reports 0.9285 and 1/0.9285 = 1.077  
    # NB2 with default standard errors 
    NB2.MASS <- glm.nb(docvis ~ outwork + age, data = rwm1984) 
    print(summary(NB2.MASS),digits=3)
    ## 
    ## Call:
    ## glm.nb(formula = docvis ~ outwork + age, data = rwm1984, init.theta = 0.4353451729, 
    ##     link = log)
    ## 
    ## Deviance Residuals: 
    ##    Min      1Q  Median      3Q     Max  
    ## -1.532  -1.282  -0.489   0.104   5.047  
    ## 
    ## Coefficients:
    ##             Estimate Std. Error z value Pr(>|z|)    
    ## (Intercept)  -0.0396     0.1067   -0.37     0.71    
    ## outwork       0.4147     0.0554    7.48  7.5e-14 ***
    ## age           0.0221     0.0024    9.24  < 2e-16 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## (Dispersion parameter for Negative Binomial(0.4353) family taken to be 1)
    ## 
    ##     Null deviance: 4100.8  on 3873  degrees of freedom
    ## Residual deviance: 3909.6  on 3871  degrees of freedom
    ## AIC: 16674
    ## 
    ## Number of Fisher Scoring iterations: 1
    ## 
    ## 
    ##               Theta:  0.4353 
    ##           Std. Err.:  0.0135 
    ## 
    ##  2 x log-likelihood:  -16665.5250