IDRE Statistical Consulting Group
Some types of count data:
Examples of count data:
For example, number of time we flipped a coin until we observe a head for the first time. This value could be in the set of \(\{1, 2, 3, ...\}\).
Sum of outcome of rolling two dice as a random variable that takes values in the set of \(\{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}\).
Probability Mass Function
For example the PMF of Bernoulli with zero and one outcomes is:
\[f(x;p)=\begin{cases} 1 - p & {\text{if }}x=0,\\p & {\text{if }}x=1.\end{cases}\]
The PMF of Binomial(n, p) is: (optional) \[\displaystyle f(x; n,p)=\Pr(x;n,p)=\Pr(X=x)=\frac{n!}{x!(n-x)!} p^{x}(1-p)^{n-x}\] For \(x = 0, 1, 2, 3, \cdots, n\).
If we know the PMF of a random variable we can calculate the probability of observing any given value by plug in that value in the PMF.
For example for a Binomial random variable with \(p = 0.2\) and \(n = 10\) we can calculate probabilities for each value of the random variable:
\(P(x = 0) = [10! / 0!(10 - 0)!] 0.2 ^ 0 (1 - 0.2) ^ {10} = 0.8 ^ {10} \approx 0.107\)
The graph below shows the plot of PMF of a Binomial random variable with \(p = 0.2\) and \(n = 10\).
For example weight of newborn babies.
The graph below shows the plot of PDF of a normal distribution with \(\mu = 2\) and \(sd = 1\).
The discrete random variable \(Y\) is define as number of event recorded during a unit of time has a Poisson distribution with rate parameter \(\mu > 0\), if \(Y\) has PMF
\[Pr(Y = y) = \dfrac{e^{-\mu}(\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\]
\(\log_{10}(1000) = 3\)
\(10^3 = 1000\)
\(\log_{2}(64) = 6\)
\(2^6 = 64\)
Euler’s number or the number \(e\) is a mathematical constant that is the base of the natural logarithm.
\[\displaystyle e=\sum \limits _{n=0}^{\infty }{\frac {1}{n!}}={\frac {1}{1}}+{\frac {1}{1}}+{\frac {1}{1\cdot 2}}+{\frac {1}{1\cdot 2\cdot 3}}+\cdots \approx 2.718282.\]
If \(e^6 = a\) Then \(\log_e(a) = ?\)
\(e^x\) can also noted as \(\exp(x)\). \(\log_e(x)\) is also shown as \(\ln(x)\)
Example of number of accidents in a highway
Suppose number of accidents in a sunny day in freeway 405 between Santa Monica and wilshire blvd is Poisson with the rate of 2 accidents per day, then probability of 0 accidents can be calculated using the formula above:
\(P(Y = 0| \mu = 2) = \dfrac{e^{-2}(2)^0}{0!} = 1/e^2 \approx 0.1353\)
\(P(Y = 1| \mu = 2) = \dfrac{e^{-2}(2)^1}{1!} = 2/e^2 \approx 0.2707\)
\(P(Y = 2| \mu = 2) = \dfrac{e^{-2}(2)^2}{2!} = 2/e^2 \approx 0.2707\)
\(P(Y = 3| \mu = 2) = \dfrac{e^{-2}(2)^3}{3!} = 8/(6e^2) \approx 0.1804\)
\(P(Y = 4| \mu = 2) = \dfrac{e^{-2}(2)^4}{4!} = 16/(24e^2) \approx 0.0902\)
\(P(Y = 5| \mu = 2) = \dfrac{e^{-2}(2)^5}{5!} = 32/(120e^2) \approx 0.0361\)
\(P(Y = 6| \mu = 2) \approx 0.0120\)
\(P(Y = 7| \mu = 2) \approx 0.0034\)
\(P(Y = 8| \mu = 2) \approx 0.0009\)
The following graph shows the plot of PMF of a Poisson random variable with \(\mu = 2\)
This looks like binomial but not exactly the same!It can be shown that a Binomial distribution will converge to a Poisson distribution as \(n\) goes to infinite and \(p\) goes to zero such that \(np \rightarrow \mu\)
Now assume number of accidents in a rainy day in freeway 405 between Santa Monica and wilshire blvd is Poisson with the rate of 5.5 accidents per day, then probability of 0 to 8 accidents using R is:
## [1] 0.004086771 0.022477243 0.061812418 0.113322766 0.155818804 0.171400684
## [7] 0.157117294 0.123449302 0.084871395 0.051865853 0.028526219
We can also say there is 2.75 times increase in the mean of accidents per day for a rainy day compare to a sunny day.
A sample of 20 number will be look like this:
prob <- c(0.1353, 0.2707, 0.2707, 0.1804, 0.0902, 0.0361, 0.0120, 0.0034, 0.0009)
domain <- seq(0,8,1)
set.seed(1001)
y1 <- sample(x = domain, size = 20, replace = TRUE, prob = prob)
y1
## [1] 6 2 2 2 2 4 1 1 2 0 2 1 4 2 1 4 1 2 1 0
## [1] 2
## [1] 2.210526
Now lets add some zeroes and some larger values to our sample
y2 <- c(y1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 5, 6, 8, 5, 6, 7, 4, 3, 6, 10, 15, 20)
mean(y2)
## [1] 2.213115
## [1] 14.07049
rpois()
function in R.
## [1] 1.958
## [1] 2.038274
Exposure
The basic Poisson distribution assumes that each count in the distribution occures over a unit of time. However, we offten count events over a period of time or over various areas. For example if the number of accidents occure on average 2 times per day, over the course of a month (30 days) the average of accident will be \(2 \times 30 = 60\). The rate of accidents per day still is 2 accident per day. The rate parameterization of Poisson distribution will be:
\[Pr(Y = y) = \dfrac{e^{-t\mu}(t\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\] Where \(t\) represents the length of time, or exposure.
Linear regression model
The traditional linear regression model assume this relationship is linear and the randomness of the relationship is from a Gaussian or normal probability distribution The most simple case can be formulize as follow: \[ Y = \beta_0 + \beta X + \epsilon\]\[ mean(Y) = E(Y|X) = \hat\beta_0 + \hat\beta X\] Regression model for count data
There are two problem with this approach:
Formalization of count model
A count model with one predictor can be formulize as:
\[\text{log of mean response} = \ln(Y|X) = \ln(\mu) = \beta_0 + \beta X\]
From this model the expected count or mean response will be:
\[\text{mean response} = \mu =E(Y|X) = \exp( \beta_0 + \beta_1 X)\] This ensures \(\mu\) will be positive for any value of \(\beta_0\), \(\beta_1\), or \(X\).
The Poisson model
The Poisson model is the model that the count \(Y|x\) has a poisson distribution with the above mean. For an observation \(Y_i\) and single predictor \(X_i\) we have:
\[ f(y_i | x_i; \beta_0, \beta_1) = \dfrac{e^{-\exp( \beta_0 + \beta x_i)}(\exp( \beta_0 + \beta x_i))^{y_i}}{y_i!}\]
The plot below shows the distribution of the responses for a given predictor.
Distribution of the response for Poisson regression model will be
Poisson Regression Model(continue)
Back to our toy example of number of accidents in 405. Let \(X\) is a binary variable of 0 for sunny day and 1 for rainy day.
We have a sample of 20 days with 10 rainy and 10 sunny days from Poisson with rates 5 and 2 respectively.
In fact we can artifitially generate this sample using R:
y <- c(rpois(20, lambda = 2), rpois(20, lambda = 5))
x <- rep(c(0,1), each = 20)
print(data.frame(y = y,x = x))
## y x
## 1 1 0
## 2 3 0
## 3 3 0
## 4 2 0
## 5 5 0
## 6 0 0
## 7 5 0
## 8 4 0
## 9 4 0
## 10 4 0
## 11 1 0
## 12 3 0
## 13 0 0
## 14 0 0
## 15 1 0
## 16 1 0
## 17 0 0
## 18 1 0
## 19 1 0
## 20 4 0
## 21 7 1
## 22 3 1
## 23 4 1
## 24 4 1
## 25 4 1
## 26 7 1
## 27 5 1
## 28 10 1
## 29 6 1
## 30 9 1
## 31 8 1
## 32 2 1
## 33 5 1
## 34 0 1
## 35 5 1
## 36 3 1
## 37 3 1
## 38 5 1
## 39 6 1
## 40 4 1
##
## Call:
## glm(formula = y ~ x, family = poisson(link = "log"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.16228 -0.87696 -0.05176 0.84298 1.96544
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7655 0.1525 5.020 5.18e-07 ***
## x 0.8440 0.1824 4.628 3.69e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 82.391 on 39 degrees of freedom
## Residual deviance: 59.027 on 38 degrees of freedom
## AIC: 172.29
##
## Number of Fisher Scoring iterations: 5
What is this?
Let’s exponentiate the coefficients:
## (Intercept) x
## 2.150000 2.325581
Can we find the rate of accidents for each group (sunny or rainy)? How?
## [1] 2.150000 5.000000 2.325581
Poisson model log-likelihood function
This part is just for your information and is optional.
If we have more than one predictors for ith observation we will have:
\[\mu_i = E(y_i | x_{i1} , x_{i2}, \dots, x_{ip};\beta) = \exp(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip})\]
For simplicity we use \(\boldsymbol{x'\beta}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip}\) and therefore we have
\[P(Y = y_i|\boldsymbol{x}_i, \boldsymbol{\beta}) = \dfrac{e^{-\exp(\boldsymbol{x'\beta}_i)}(\exp(\boldsymbol{x'\beta}_i))^y}{y!}\] and log-likelihood function will be:
\[\cal L(\boldsymbol{\beta}) = \sum_{i=1}^n [y_i \boldsymbol{x'\beta}_i - \exp(\boldsymbol{x'\beta}_i) - \ln y_i!]\]
Part 2 Analysis of count data model
. use "C:\rwm1984.dta", clear
Since ages are range from 25 to 64 and age 0 is not a valid value for age, to interpret the intercept we center variable age.
. gen cage = age - r(mean)
R code for loading the data and descriptive statistics
# the response
rwm1984 <- read.csv("rwm1984.csv")
#you can load the data from package COUNT
library(COUNT)
#rwm1984 <- data("rwm1984")
head(rwm1984)
## docvis hospvis edlevel age outwork female married kids hhninc educ self
## 1 1 0 3 54 0 0 1 0 3.050 15.0 0
## 2 0 0 1 44 1 1 1 0 3.050 9.0 0
## 3 0 0 1 58 1 1 0 0 1.434 11.0 0
## 4 7 2 1 64 0 0 0 0 1.500 10.5 0
## 5 6 0 3 30 1 0 0 0 2.400 13.0 0
## 6 9 0 3 26 1 0 0 0 1.050 13.0 0
## edlevel1 edlevel2 edlevel3 edlevel4
## 1 0 0 1 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 0 0 0
## 5 0 0 1 0
## 6 0 0 1 0
## docvis hospvis edlevel age
## Min. : 0.000 Min. : 0.0000 Min. :1.00 Min. :25
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.:1.00 1st Qu.:35
## Median : 1.000 Median : 0.0000 Median :1.00 Median :44
## Mean : 3.163 Mean : 0.1213 Mean :1.38 Mean :44
## 3rd Qu.: 4.000 3rd Qu.: 0.0000 3rd Qu.:1.00 3rd Qu.:54
## Max. :121.000 Max. :17.0000 Max. :4.00 Max. :64
## outwork female married kids
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :1.0000 Median :0.0000
## Mean :0.3665 Mean :0.4793 Mean :0.7891 Mean :0.4491
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## hhninc educ self edlevel1
## Min. : 0.015 Min. : 7.00 Min. :0.00000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.:10.00 1st Qu.:0.00000 1st Qu.:1.0000
## Median : 2.800 Median :10.50 Median :0.00000 Median :1.0000
## Mean : 2.969 Mean :11.09 Mean :0.06118 Mean :0.8136
## 3rd Qu.: 3.590 3rd Qu.:11.50 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :25.000 Max. :18.00 Max. :1.00000 Max. :1.0000
## edlevel2 edlevel3 edlevel4
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.0524 Mean :0.0746 Mean :0.05937
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
2.1 Running a count data model
R code for runnig Poisson model is very similar to Stata
pois_m <- glm(docvis ~ outwork + age, family = poisson(link = "log"), data = rwm1984)
summary(pois_m)
##
## Call:
## glm(formula = docvis ~ outwork + age, family = poisson(link = "log"),
## data = rwm1984)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4573 -2.2475 -1.2366 0.2863 25.1320
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0335170 0.0391815 -0.855 0.392
## outwork 0.4079314 0.0188447 21.647 <2e-16 ***
## age 0.0220842 0.0008377 26.362 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 25791 on 3873 degrees of freedom
## Residual deviance: 24190 on 3871 degrees of freedom
## AIC: 31279
##
## Number of Fisher Scoring iterations: 6
And if we use centered age:
pois_m1 <- glm(docvis ~ outwork + cage, family = poisson(link = "log"), data = rwm1984)
summary(pois_m1)
##
## Call:
## glm(formula = docvis ~ outwork + cage, family = poisson(link = "log"),
## data = rwm1984)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4573 -2.2475 -1.2366 0.2863 25.1320
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9380965 0.0127571 73.53 <2e-16 ***
## outwork 0.4079314 0.0188447 21.65 <2e-16 ***
## cage 0.0220842 0.0008377 26.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 25791 on 3873 degrees of freedom
## Residual deviance: 24190 on 3871 degrees of freedom
## AIC: 31279
##
## Number of Fisher Scoring iterations: 6
## [1] 11.34322
## [1] 8.074027
## $AIC
## [1] 31278.78
##
## $AICn
## [1] 8.074027
##
## $BIC
## [1] 31297.57
##
## $BICqh
## [1] 8.07418
2.2 Interpreting Coefficients and rate ratio
In order to have a change in predictor value reflect a change in actual visits to a physician, we must exponentiate the coefficient and confidence interval of coefficients.
The standard error can approximately calculated by delta method.
Delta Method:
\[se(\exp(\beta)) = exp(\beta) * se(\beta)\] . *calculating irr
. di exp(_b[outwork])
1.5037041
. di exp(_b[outwork]) * _se[outwork]
.02833683
We can get the Rate Ratios directly from the output:
## (Intercept) outwork cage
## 2.555113 1.503704 1.022330
## (Intercept) outwork cage
## 0.0325958199 0.0283368154 0.0008564317
## 2.5 % 97.5 %
## (Intercept) 2.492018 2.619805
## outwork 1.449178 1.560282
## cage 1.020653 1.024010
2.3 Marginal effects
#marginal effect at mean
beta <- coef(pois_m)
mout <- mean(rwm1984$outwork)
mage <- mean(rwm1984$age)
xb <- sum(beta * c(1,mout,mage))
dfdxb <- exp(xb) * beta[3]
dfdxb
## age
## 0.06552846
## age
## 0.06984968
2.4 Exposure
Sometimes all data points are not occur at the same period , area or volumes. In this case we can use an offset with a model to adjust for counts of events over time period, area, and volumes.
\[ \exp(\boldsymbol{x'\beta}) = \mu / t\]
\[ \mu = \exp(\boldsymbol{x'\beta} + \log(t))\] The term \(\log(t)\) is referred as offset.
In this data set:
die : number of death
cases: number of cases
anterior: if the patient had a previous anterior myocardial infarction.
hcabg: if the patient had a history of CABG procedure.
Killip: a summary indicator of the cardiovascular health of the patient.
The disparsion appear as relatively low at 1.40, but notice the total number of observation is 5388.
R code for exposure
## die cases anterior hcabg killip kk1 kk2 kk3 kk4
## 1 5 19 0 0 4 0 0 0 1
## 2 10 83 0 0 3 0 0 1 0
## 3 15 412 0 0 2 0 1 0 0
## 4 28 1864 0 0 1 1 0 0 0
## 5 1 1 0 1 4 0 0 0 1
## 6 0 3 0 1 3 0 0 1 0
summary(fast <- glm(die ~ anterior + hcabg +factor(killip) , family = poisson,
offset = log(cases), data = fasttrakg))
##
## Call:
## glm(formula = die ~ anterior + hcabg + factor(killip), family = poisson,
## data = fasttrakg, offset = log(cases))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4791 -0.5271 -0.1766 0.4337 2.3317
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.0698 0.1459 -27.893 < 2e-16 ***
## anterior 0.6749 0.1596 4.229 2.34e-05 ***
## hcabg 0.6614 0.3267 2.024 0.0429 *
## factor(killip)2 0.9020 0.1724 5.234 1.66e-07 ***
## factor(killip)3 1.1133 0.2513 4.430 9.44e-06 ***
## factor(killip)4 2.5126 0.2743 9.160 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 116.232 on 14 degrees of freedom
## Residual deviance: 10.932 on 9 degrees of freedom
## AIC: 73.992
##
## Number of Fisher Scoring iterations: 5
## (Intercept) anterior hcabg factor(killip)2 factor(killip)3
## 0.01708133 1.96376577 1.93746487 2.46463342 3.04434885
## factor(killip)4
## 12.33745552
## (Intercept) anterior hcabg factor(killip)2 factor(killip)3
## 0.002492295 0.313359487 0.632970849 0.424784164 0.765119601
## factor(killip)4
## 3.384215172
## 2.5 % 97.5 %
## (Intercept) 0.0128329 0.02273622
## anterior 1.4363585 2.68482829
## hcabg 1.0212823 3.67554592
## factor(killip)2 1.7581105 3.45508317
## factor(killip)3 1.8602297 4.98221265
## factor(killip)4 7.2067174 21.12096274
## $AIC
## [1] 73.9917
##
## $AICn
## [1] 4.93278
##
## $BIC
## [1] 78.24
##
## $BICqh
## [1] 5.566187
2.5 Overdispersion
dev <- deviance(pois_m)
df <- df.residual(pois_m)
p_value <- 1 - pchisq(dev, df)
print(matrix(c("Deviance GOF ", " ",
"D =", round(dev,4) ,
"df = ", df,
"P-value = ", p_value), byrow = T,ncol = 2))
## [,1] [,2]
## [1,] "Deviance GOF " " "
## [2,] "D =" "24190.3581"
## [3,] "df = " "3871"
## [4,] "P-value = " "0"
In Poisson Mean = \(\mu\)
Variance = \(\mu\)
In Negative Binomial1 (NB1)
Mean = \(\mu\)
Variance = \(\mu + \alpha \mu\) or
Variance = \(\mu (1 + \alpha) = \mu \omega\)
In Negative Binomial2 (NB2)
Mean = \(\mu\)
Variance = \(\mu + \alpha \mu ^2\)
# Poisson Default standard errors (variance equals mean) based on MLE
poiss_mle <- glm(docvis ~ outwork + age, data = rwm1984, family=poisson())
print(summary(poiss_mle),digits=3)
##
## Call:
## glm(formula = docvis ~ outwork + age, family = poisson(), data = rwm1984)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.457 -2.248 -1.237 0.286 25.132
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.033517 0.039181 -0.86 0.39
## outwork 0.407931 0.018845 21.65 <2e-16 ***
## age 0.022084 0.000838 26.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 25791 on 3873 degrees of freedom
## Residual deviance: 24190 on 3871 degrees of freedom
## AIC: 31279
##
## Number of Fisher Scoring iterations: 6
# Robust sandwich based on poiss_mle
library(sandwich)
#this will gives robust covariance matrix of estimated model
cov.robust <- vcovHC(poiss_mle, type="HC0")
#diagonal is the variances
se.robust <- sqrt(diag(cov.robust))
coeffs <- coef(poiss_mle)
z_.025 <- qnorm(.975)
t.robust <- coeffs / se.robust
poiss_mle_robust <- cbind(coeffs, se.robust, t.robust, pvalue=round(2*(1-pnorm(abs(coeffs/se.robust))),5),
lower=coeffs - z_.025 * se.robust, upper=coeffs+ z_.025 * se.robust)
print(poiss_mle_robust,digits=3)
## coeffs se.robust t.robust pvalue lower upper
## (Intercept) -0.0335 0.12333 -0.272 0.786 -0.2752 0.2082
## outwork 0.4079 0.06907 5.906 0.000 0.2726 0.5433
## age 0.0221 0.00283 7.808 0.000 0.0165 0.0276
# Poisson Bootstrap standard errors (variance equals mean)
# install.packages("boot")
library(boot)
boot.poiss <- function(data, indices) {
data <- data[indices, ]
model <- glm(docvis ~ outwork + age, family=poisson(), data=data)
coefficients(model)
}
set.seed(10101)
# To speed up we use only 100 bootstraps.
summary.poissboot <- boot(rwm1984, boot.poiss, 400)
print(summary.poissboot,digits=3)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = rwm1984, statistic = boot.poiss, R = 400)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -0.0335 0.007078 0.12531
## t2* 0.4079 0.000811 0.07318
## t3* 0.0221 -0.000168 0.00294
# Poisson with NB1 standard errors (assumes variance is a multiple of the mean) w = theta*mu or w = mu*(1 + alpha)
#also called Poisson Quasi-MLE
poiss_qmle <- glm(docvis ~ outwork + age, data = rwm1984, family=quasipoisson())
print(summary(poiss_qmle),digits=4)
##
## Call:
## glm(formula = docvis ~ outwork + age, family = quasipoisson(),
## data = rwm1984)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4573 -2.2475 -1.2366 0.2863 25.1320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.033517 0.131962 -0.254 0.8
## outwork 0.407931 0.063468 6.427 1.46e-10 ***
## age 0.022084 0.002821 7.827 6.39e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 11.34324)
##
## Null deviance: 25791 on 3873 degrees of freedom
## Residual deviance: 24190 on 3871 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
# MASS does just NB2
# install.packages("MASS")
library(MASS)
# Note: MASS reports as the overdispersion parameter 1/alpha not alpha
# In this example R reports 0.9285 and 1/0.9285 = 1.077
# NB2 with default standard errors
NB2.MASS <- glm.nb(docvis ~ outwork + age, data = rwm1984)
print(summary(NB2.MASS),digits=3)
##
## Call:
## glm.nb(formula = docvis ~ outwork + age, data = rwm1984, init.theta = 0.4353451729,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.532 -1.282 -0.489 0.104 5.047
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0396 0.1067 -0.37 0.71
## outwork 0.4147 0.0554 7.48 7.5e-14 ***
## age 0.0221 0.0024 9.24 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.4353) family taken to be 1)
##
## Null deviance: 4100.8 on 3873 degrees of freedom
## Residual deviance: 3909.6 on 3871 degrees of freedom
## AIC: 16674
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.4353
## Std. Err.: 0.0135
##
## 2 x log-likelihood: -16665.5250