IDRE Statistical Consulting Group
Regression Analysis is a statistical modeling tool that is used to explain a response (criterion or dependent) variable as a function of one or more predictor (independent) variables.
\(X\): The predictor. \(x_1, x_2, \dots, x_n\) are \(n\) observations from \(X\).
\(Y\): The response. \(y_1, y_2, \dots, y_n\) are \(n\) observations from \(Y\).
In a Linear Regression Model we assume this relationship is a linear function.
We can represent a simple linear model (a linear model with only one predictor) as:
\[y = \beta_0 + \beta_1 x + \epsilon\]
Where \(\beta_0\) is the intercept of the line, \(\beta_1\) is the slope of the line and \(\epsilon\) is the error term that represent unexplained uncertainty.
index | Performance | enroll |
---|---|---|
1 | 693 | 247 |
2 | 570 | 463 |
3 | 395 | 546 |
. | . | . |
. | . | . |
. | . | . |
n | 478 | 520 |
To fit the best linear function we can use a method called Ordinary Least Square (OLS).
The regression model has two part, a linear function and an error term.
\[y = \overbrace{\beta_0 + \beta_1 x }^\text{linear function}+ \overbrace{\epsilon}^\text{error}\]
The linear function predicts the mean value of the outcome (on average) for a given value of predictor. The mean value in mathematical statistics is noted as \(\text{E}(Y | X = x)\) and reads the expected value of \(Y\) given \(X = x\).
In general when we have multiple predictors, \(X_1, X_2, X_3, \cdots, X_p\),
The predicted mean is:
\[\text{E}(Y | x_1, x_2 , \cdots x_p) = \beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p\]
The error part represent the uncertainty in the observation and is a random variable with mean equal to zero and unknown variance \(\sigma ^2\).
The error term usually, but not necessarily, assumes to have a normal distribution.
The goal is to estimate parameters \(\beta_0\) and \(\beta_1\), intercept and the slope of the line, in a way that the line fit the best with the data.
Estimate model parameters are noted as \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
The estimated model will be
\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]
\(\hat{y}\) represent the predicted mean of the outcome for a given value of \(x\). We also call it fitted value.
The predicted value for each observation can be calculated by the linear line as: \(\hat{y_i} = \hat{\beta}_0 + \hat{\beta}_1 x_i\)
The difference between observed value of \(y_i\) and the predicted value \(\hat{y_i}\) is called residual which is:
\[ e_i = y_i - \hat{y_i}\]
*note: The predictor variable may be continuous, meaning that it may assume all values within a range, for example, age or height. Or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1 or a categorical variable with more that with more than two levels. There is only one response or dependent variable, and it is continuous.
\[ \min \sum_{i = 1}^{n} (y_i - \hat{y_i}) ^ 2\]
1- Estimate the model parameters(i.e. intercept and slope of the fitted line and a measure of uncertainty)
2- Make inference about model parameters.
1 - Errors have mean equal to zero.
2 - Errors are independent.
3 - Errors have equal or constant variance (Homoscedasticity or Homogeneity of variance).
4 - Errors follows normal distribution.
5 - The relationship of dependent variable with Independent variables is linear.
Briefly, the assumption of OLS is that the error term in the model are independent with identically distributed random variables from a normal distribution with mean equal to zero for each randomly sampled outcomes \(y\) for any given value of \(x\).
Open R or R Studio and look at the help for data set attitude.
This data is from the book Regression Analysis by Example. by Chatterjee, S. and Price, B. (1977). The data are aggregated from the questionnaires of the approximately 35 employees for each of 30 (randomly selected) departments.
#help("attitude")
#copy and paste example and run it!
data(attitude)
#summary statistics of variables.
summary(attitude)
## rating complaints privileges learning raises
## Min. :40.00 Min. :37.0 Min. :30.00 Min. :34.00 Min. :43.00
## 1st Qu.:58.75 1st Qu.:58.5 1st Qu.:45.00 1st Qu.:47.00 1st Qu.:58.25
## Median :65.50 Median :65.0 Median :51.50 Median :56.50 Median :63.50
## Mean :64.63 Mean :66.6 Mean :53.13 Mean :56.37 Mean :64.63
## 3rd Qu.:71.75 3rd Qu.:77.0 3rd Qu.:62.50 3rd Qu.:66.75 3rd Qu.:71.00
## Max. :85.00 Max. :90.0 Max. :83.00 Max. :75.00 Max. :88.00
## critical advance
## Min. :49.00 Min. :25.00
## 1st Qu.:69.25 1st Qu.:35.00
## Median :77.50 Median :41.00
## Mean :74.77 Mean :42.93
## 3rd Qu.:80.00 3rd Qu.:47.75
## Max. :92.00 Max. :72.00
We use graphic to investigate relationship between all variables by pairs.
For the linear regression model of rating (Overall rating) as outcome and complaints (Handling of employee complaints) as predictor we use function lm()
.
##
## Call:
## lm(formula = rating ~ complaints, data = attitude)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.8799 -5.9905 0.1783 6.2978 9.6294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.37632 6.61999 2.172 0.0385 *
## complaints 0.75461 0.09753 7.737 1.99e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.993 on 28 degrees of freedom
## Multiple R-squared: 0.6813, Adjusted R-squared: 0.6699
## F-statistic: 59.86 on 1 and 28 DF, p-value: 1.988e-08
This plot is showing us that there is not strong evidence to suggest assumptions of zero mean and constant variance of error is not valid.
For example, number of time we flipped a coin until we observe a head for the first time. This value could be in the set of \(\{1, 2, 3, ...\}\).
Sum of outcome of rolling two dice as a random variable that takes values in the set of \(\{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}\).
Probability Mass Function
For example the PMF of a random variable of outcome of rolling a dice with outcome of \(\{1, 2, 3, 4, 5, 6\}\) is:
\(p(X = x) = 1/6\)
for \(x \in \{1, 2, 3, 4, 5, 6\}\).
Bernoulli random variable is a random variable that can get only two possible outcome, 1 with probability \(p\) and 0 with probability \(q = 1 - p\).
The PMF of Bernoulli with zero and one outcomes is:
\[f(x;p)=\begin{cases} 1 - p & {\text{if }}x=0,\\p & {\text{if }}x=1.\end{cases}\]
the mean of a Bernoulli random variable is:
\(\mu_X = E(X) = 1 \times p + 0 \times (1 - p) = p\)
and the variance is
\(\sigma^2_X = var(X) = E(X- \mu)^2 = (1 - p) ^ 2 \times p + (0 - p) ^ 2 \times (1 - p) \\ = p(1 - p) ( 1 - p + p) = p(1 - p)\)
The PMF of Binomial(n, p) is: (optional) \[\displaystyle f(x; n,p)=\Pr(x;n,p)=\Pr(X=x)=\frac{n!}{x!(n-x)!} p^{x}(1-p)^{n-x}\] For \(x = 0, 1, 2, 3, \cdots, n\).
\(P(x = 0) = [10! / 0!(10 - 0)!] 0.2 ^ 0 (1 - 0.2) ^ {10} = 0.8 ^ {10} \approx 0.107\)
\(\log_{10}(1000) = 3\)
We say the logarithm base 10 of 1000 so,
\(10^3 = 1000\)
\(\log_{2}(64) = 6\)
\(2^6 = 64\)
Euler’s number or the number \(e\) is a mathematical constant that is the base of the natural logarithm.
\[\displaystyle e=\sum \limits _{n=0}^{\infty }{\frac {1}{n!}}={\frac {1}{1}}+{\frac {1}{1}}+{\frac {1}{1\times 2}}+{\frac {1}{1\times 2\times 3}}+\cdots \approx 2.718282.\]
If \(e^6 = a\) Then \(\log_e(a) = ?\)
The answer is 6.
\(e^x\) can also noted as \(\exp(x)\). \(\log_e(x)\) is also shown as \(\ln(x)\)
Since \(\exp(x + y) = \exp(x)\exp(y)\)
\(\log(xy) = \log(x) + \log(y)\) and,
\(\log(x/y) = \log(x) - \log(y)\)
\[\ln(x) = \log_{e}x= {\dfrac {\log_{10}x}{\log_{10}e}}\]
\[Pr(Y = y) = \dfrac{e^{-\mu}(\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\]
The mean of Poisson distribution is \(\mu\) and the variance of it is also \(\mu\).
It can be shown that a Binomial distribution will converge to a Poisson distribution as \(n\) goes to infinite and \(p\) goes to zero such that \(np \rightarrow \mu\)
Suppose number of accidents in a sunny day in freeway 405 between Santa Monica and Wilshire blvd is Poisson with the rate of 2 accidents per day, then probability of 0, 1, 2, …, 8 accidents can be calculated using the PMF of Poisson:
\(P(Y = 0| \mu = 2) = \dfrac{e^{-2}(2)^0}{0!} = 1/e^2 \approx 0.1353\)
\(P(Y = 1| \mu = 2) = \dfrac{e^{-2}(2)^1}{1!} = 2/e^2 \approx 0.2707\)
\(P(Y = 2| \mu = 2) = \dfrac{e^{-2}(2)^2}{2!} = 2/e^2 \approx 0.2707\)
\(P(Y = 3| \mu = 2) = \dfrac{e^{-2}(2)^3}{3!} = 8/(6e^2) \approx 0.1804\)
\(P(Y = 4| \mu = 2) = \dfrac{e^{-2}(2)^4}{4!} = 16/(24e^2) \approx 0.0902\)
\(P(Y = 5| \mu = 2) = \dfrac{e^{-2}(2)^5}{5!} = 32/(120e^2) \approx 0.0361\)
\(P(Y = 6| \mu = 2) \approx 0.0120\)
\(P(Y = 7| \mu = 2) \approx 0.0034\)
\(P(Y = 8| \mu = 2) \approx 0.0009\)
The following graph shows the plot of PMF of a Poisson random variable with \(\mu = 2\)
This looks like binomial but not exactly the same!
Now assume number of accidents in a rainy day in freeway 405 between Santa Monica and wilshire blvd is Poisson with the rate of 5.5 accidents per day, then probability of each \(x = 0, 1, 2, \cdots, 8\) accidents using R is:
## [1] 0.004086771 0.022477243 0.061812418 0.113322766 0.155818804 0.171400684
## [7] 0.157117294 0.123449302 0.084871395
We can also say there is 2.75 times increase in the mean of accidents per day for a rainy day compare to a sunny day.
A sample of 20 number will be look like this:
prob <- c(0.1353, 0.2707, 0.2707, 0.1804, 0.0902, 0.0361, 0.0120, 0.0034, 0.0009)
domain <- seq(0,8,1)
set.seed(1001)
y1 <- sample(x = domain, size = 20, replace = TRUE, prob = prob)
y1
## [1] 6 2 2 2 2 4 1 1 2 0 2 1 4 2 1 4 1 2 1 0
## [1] 2
## [1] 2.210526
Now lets add some zeroes and some larger values to our sample
y2 <- c(y1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 5, 6, 8, 5, 6, 7, 4, 3, 6, 10, 15, 20)
mean(y2)
## [1] 2.213115
## [1] 14.07049
rpois()
function in R.
## [1] 1.958
## [1] 2.038274
Poisson distribution for unit time and time t
The basic Poisson distribution assumes that each count in the distribution occurs over a unit of time. However, we often count events over a period of time or over various areas. For example if the number of accidents occur on average 2 times per day, over the course of a month (30 days) the average of accident will be \(2 \times 30 = 60\). The rate of accidents per day still is 2 accident per day. The rate parameterization of Poisson distribution will be:
\[Pr(Y = y) = \dfrac{e^{-t\mu}(t\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\] Where \(t\) represents the length of time. We will talk about this when we talk about exposure in poission regression.
For example weight of newborn babies.
The graph below shows the plot of PDF of a normal distribution with \(\mu = 2\) and \(sd = 1\).
In Regression model, a linear prediction is a linear combination of predictors.
Recall linear regression model where the outcome is continues and error has normal distribution, the linear prediction was.
\[ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p\] The linear prediction can get values from \(-\inf\) to \(+\inf\)
But if the outcome is from a count distribution like Poisson distribution which has positive mean. Or a Bernoulli distribution, where the mean is between 0 and 1, the linear prediction might not be plausible.
In generalized linear regression we use a link function to transform linear prediction to get a plausible expected outcome.
The table below shows link functions for different distributions and regression models.
We start with an example:
A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.
This data set has a binary response (outcome, dependent) variable called admit.
There are three predictor variables: gre, gpa and rank.
We will treat the variables gre and gpa as continuous.
The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.
We can get basic descriptive for the entire data set by using summary. To get the standard deviations, we use sapply
to apply the sd
function to each variable in the data set.
#reading the data from idre stat website:
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
#summary statistics of variables
summary(mydata)
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.395 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
## admit gre gpa rank
## 0.4660867 115.5165364 0.3805668 0.9444602
This is also can be seen as regression model with intercept only.
Variable admit is binary and it has value of 1 if a person admitted and 0 for not admitted. For the variable admit, the mean is the same as proportion of 1 or estimated probability of getting admit in a grad school.
\[ \hat{p} = \dfrac{X}{n} = \dfrac{\sum_{i=1}^nx_i}{n} \]
Where \(X\) is total number of people admitted and \(n\) is total number of students and \(x_i\) is value of admit for the \(i\)-th student.
We can calculate standard deviation directly from the mean. Note that this variable is binary and assuming a binary outcome is a result of a Bernoulli trial then, \(\hat\mu_{admit} = \hat p = \sum_{i=1}^{n} x_i /n\)
Since in Bernoulli distribution the variance is \(var(x) = p(1-p)\) we can use this formula to calculate the variance.
## [1] 0.3175
## [1] 0.2166937
#but this is population variance and need to be adjusted for the sample variance
(mean(mydata$admit) *(1-mean(mydata$admit))) * 400 / (400 - 1)
## [1] 0.2172368
## [1] 0.2172368
Why we talk about this?
Remember in OLS, where the error assumed to be normally distributed and so the outcome, the variance assumed to be constant but in here the variance depends on the predicted mean value.
Of course not in the intercept only model because the expected mean is constant!
Now run a regression model of admit predicted by gre, assuming admit is a continues variable.
We are trying to estimate the mean of admit for given value of gre. Since the admit is 0 and 1 this can be seen as probability of admit.
##
## Call:
## lm(formula = admit ~ gre, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4755 -0.3415 -0.2522 0.5989 0.8966
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.1198407 0.1190510 -1.007 0.314722
## gre 0.0007442 0.0001988 3.744 0.000208 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4587 on 398 degrees of freedom
## Multiple R-squared: 0.03402, Adjusted R-squared: 0.03159
## F-statistic: 14.02 on 1 and 398 DF, p-value: 0.0002081
As we can see the variance of the predicted outcome is systematically related to the predicted outcome and thus is not constant.
Another issue with this model is that:
In linear model we allow the predicted outcome to be any number between negative infinity to positive infinity.
But the probability can not be a negative number or a number greater than zero. To overcome this issue and also the issue of equality of variance in OLS we need to use a different model.
Also we need to use a transformation function such that we can relate the outcome to a value between 0 and 1.
In the previous example we estimated that the probability of getting admitted to a graduate school regardless of gpa or gre, .. on average is 0.3175.
This is an estimation of unconditional population mean (probability or proportion) or in regression term, this is estimation of intercept for an intercept only model.
Suppose the actual true population proportion is \(p = 0.3\). Then We define the odds of getting to a graduate school to be:
\[ \tt{odds(admit)} = \dfrac{p}{1 - p} = \dfrac{0.3}{1 - 0.3} = 0.4285714 \]
\[ \tt{odds(not \ admit)} = \dfrac{p}{1 - p} = \dfrac{0.7}{1 - 0.7} = 2.333333 \]
As we can see odds is a positive value between 0 and infinity. Thus, the inverse of odd is a number between zero and 1.
Question: CNN analysts predict that Los Angeles Lakers has odds of 9 to 1 to go to the play off in the next NBA season. What is the probability that the Los Angeles Lakers do not make the play off in the next NBA season?
\(odds = \dfrac{p}{1 - p}\)
Lets calculate probability of Lakers goes to play off. From the odds formula:
\(p = odds * (1 - p)\) and \(p + p \times odds = odds\)
\(p = \dfrac{odds} {1 + odds} = 9 / (9+1) = 0.9\)
Then probability of not going to play off will be 0.1.
Another useful function is log. If we have a value between zero and infinity then the log of that value is between negative and positive infinity.
The inverse log function is an exponential function. We usually use natural log and exponential function.
In summary, we can use log(odds) transformation to transform a value between zero and one to a value between negative infinity to positive infinity. This function, log(odds) is called logit function.
This suggest that the logit transformation can be model as a linear relationship to predictors.
glm
to run a generalized linear model. The distribution of the outcome is defined by argument family with the option for the link function.##
## Call:
## glm(formula = admit ~ gre, family = binomial(link = "logit"),
## data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.1623 -0.9052 -0.7547 1.3486 1.9879
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.901344 0.606038 -4.787 1.69e-06 ***
## gre 0.003582 0.000986 3.633 0.00028 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 486.06 on 398 degrees of freedom
## AIC: 490.06
##
## Number of Fisher Scoring iterations: 4
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.901344270 0.6060376103 -4.787400 1.689561e-06
## gre 0.003582212 0.0009860051 3.633056 2.800845e-04
## 2.5 % 97.5 %
## (Intercept) -4.119988259 -1.739756286
## gre 0.001679963 0.005552748
Some types of count data:
Examples of count data:
Linear regression model
The traditional linear regression model assume this relationship is linear and the randomness of the relationship is from a Gaussian or normal probability distribution The most simple case can be formulize as follow: \[ Y = \beta_0 + \beta X + \epsilon\]\[ mean(Y) = E(Y|X) = \hat\beta_0 + \hat\beta X\] Regression model for count data
There are two problem with this approach:
Formalization of count model
A count model with one predictor can be formulize as:
\[\text{log of mean response} = \ln(Y|X) = \ln(\mu) = \beta_0 + \beta X\]
From this model the expected count or mean response will be:
\[\text{mean response} = \mu =E(Y|X) = \exp( \beta_0 + \beta_1 X)\] This ensures \(\mu\) will be positive for any value of \(\beta_0\), \(\beta_1\), or \(X\).
The Poisson model
The Poisson model is the model that the count \(Y|x\) has a poisson distribution with the above mean. For an observation \(Y_i\) and single predictor \(X_i\) we have:
\[ f(y_i | x_i; \beta_0, \beta_1) = \dfrac{e^{-\exp( \beta_0 + \beta x_i)}(\exp( \beta_0 + \beta x_i))^{y_i}}{y_i!}\]
The plot below shows the distribution of the responses for a given predictor.
Distribution of the response for Poisson regression model will be
Back to our toy example of number of accidents in 405. Let \(X\) is a binary variable of 0 for sunny day and 1 for rainy day.
We have a sample of 20 days with 10 rainy and 10 sunny days from Poisson with rates 5 and 2 respectively.
In fact we can artifitially generate this sample using R:
y <- c(rpois(20, lambda = 2), rpois(20, lambda = 5))
x <- rep(c(0,1), each = 20)
print(data.frame(y = y,x = x))
## y x
## 1 1 0
## 2 3 0
## 3 3 0
## 4 2 0
## 5 5 0
## 6 0 0
## 7 5 0
## 8 4 0
## 9 4 0
## 10 4 0
## 11 1 0
## 12 3 0
## 13 0 0
## 14 0 0
## 15 1 0
## 16 1 0
## 17 0 0
## 18 1 0
## 19 1 0
## 20 4 0
## 21 7 1
## 22 3 1
## 23 4 1
## 24 4 1
## 25 4 1
## 26 7 1
## 27 5 1
## 28 10 1
## 29 6 1
## 30 9 1
## 31 8 1
## 32 2 1
## 33 5 1
## 34 0 1
## 35 5 1
## 36 3 1
## 37 3 1
## 38 5 1
## 39 6 1
## 40 4 1
##
## Call:
## glm(formula = y ~ x, family = poisson(link = "log"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.16228 -0.87696 -0.05176 0.84298 1.96544
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7655 0.1525 5.020 5.18e-07 ***
## x 0.8440 0.1824 4.628 3.69e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 82.391 on 39 degrees of freedom
## Residual deviance: 59.027 on 38 degrees of freedom
## AIC: 172.29
##
## Number of Fisher Scoring iterations: 5
What is this?
Let’s exponentiate the coefficients:
## (Intercept) x
## 2.150000 2.325581
Can we find the rate of accidents for each group (sunny or rainy)? How?
## [1] 2.150000 5.000000 2.325581
This part is just for your information and is optional.
If we have more than one predictors for ith observation we will have:
\[\mu_i = E(y_i | x_{i1} , x_{i2}, \dots, x_{ip};\beta) = \exp(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip})\]
For simplicity we use \(\boldsymbol{x'\beta}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip}\) and therefore we have
\[P(Y = y_i|\boldsymbol{x}_i, \boldsymbol{\beta}) = \dfrac{e^{-\exp(\boldsymbol{x'\beta}_i)}(\exp(\boldsymbol{x'\beta}_i))^y}{y!}\] and log-likelihood function will be:
\[\cal L(\boldsymbol{\beta}) = \sum_{i=1}^n [y_i \boldsymbol{x'\beta}_i - \exp(\boldsymbol{x'\beta}_i) - \ln y_i!]\]
Since ages are range from 25 to 64 and age 0 is not a valid value for age, to interpret the intercept we center variable age.
. gen cage = age - r(mean)
R code for loading the data and descriptive statistics
# the response
rwm1984 <- read.csv("rwm1984.csv")
#you can load the data from package COUNT
library(COUNT)
#rwm1984 <- data("rwm1984")
head(rwm1984)
## docvis hospvis edlevel age outwork female married kids hhninc educ self
## 1 1 0 3 54 0 0 1 0 3.050 15.0 0
## 2 0 0 1 44 1 1 1 0 3.050 9.0 0
## 3 0 0 1 58 1 1 0 0 1.434 11.0 0
## 4 7 2 1 64 0 0 0 0 1.500 10.5 0
## 5 6 0 3 30 1 0 0 0 2.400 13.0 0
## 6 9 0 3 26 1 0 0 0 1.050 13.0 0
## edlevel1 edlevel2 edlevel3 edlevel4
## 1 0 0 1 0
## 2 1 0 0 0
## 3 1 0 0 0
## 4 1 0 0 0
## 5 0 0 1 0
## 6 0 0 1 0
## docvis hospvis edlevel age
## Min. : 0.000 Min. : 0.0000 Min. :1.00 Min. :25
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.:1.00 1st Qu.:35
## Median : 1.000 Median : 0.0000 Median :1.00 Median :44
## Mean : 3.163 Mean : 0.1213 Mean :1.38 Mean :44
## 3rd Qu.: 4.000 3rd Qu.: 0.0000 3rd Qu.:1.00 3rd Qu.:54
## Max. :121.000 Max. :17.0000 Max. :4.00 Max. :64
## outwork female married kids
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :1.0000 Median :0.0000
## Mean :0.3665 Mean :0.4793 Mean :0.7891 Mean :0.4491
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## hhninc educ self edlevel1
## Min. : 0.015 Min. : 7.00 Min. :0.00000 Min. :0.0000
## 1st Qu.: 2.000 1st Qu.:10.00 1st Qu.:0.00000 1st Qu.:1.0000
## Median : 2.800 Median :10.50 Median :0.00000 Median :1.0000
## Mean : 2.969 Mean :11.09 Mean :0.06118 Mean :0.8136
## 3rd Qu.: 3.590 3rd Qu.:11.50 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :25.000 Max. :18.00 Max. :1.00000 Max. :1.0000
## edlevel2 edlevel3 edlevel4
## Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.0000 Median :0.00000
## Mean :0.0524 Mean :0.0746 Mean :0.05937
## 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0.00000
## Max. :1.0000 Max. :1.0000 Max. :1.00000
Run a poission model of docvis as independent variable and outwork and age as predictors
pois_m <- glm(docvis ~ outwork + age, family = poisson(link = "log"), data = rwm1984)
summary(pois_m)
##
## Call:
## glm(formula = docvis ~ outwork + age, family = poisson(link = "log"),
## data = rwm1984)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4573 -2.2475 -1.2366 0.2863 25.1320
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.0335170 0.0391815 -0.855 0.392
## outwork 0.4079314 0.0188447 21.647 <2e-16 ***
## age 0.0220842 0.0008377 26.362 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 25791 on 3873 degrees of freedom
## Residual deviance: 24190 on 3871 degrees of freedom
## AIC: 31279
##
## Number of Fisher Scoring iterations: 6
Use centered ag and run the model again:
pois_m1 <- glm(docvis ~ outwork + cage, family = poisson(link = "log"), data = rwm1984)
summary(pois_m1)
##
## Call:
## glm(formula = docvis ~ outwork + cage, family = poisson(link = "log"),
## data = rwm1984)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4573 -2.2475 -1.2366 0.2863 25.1320
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.9380965 0.0127571 73.53 <2e-16 ***
## outwork 0.4079314 0.0188447 21.65 <2e-16 ***
## cage 0.0220842 0.0008377 26.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 25791 on 3873 degrees of freedom
## Residual deviance: 24190 on 3871 degrees of freedom
## AIC: 31279
##
## Number of Fisher Scoring iterations: 6
## [1] 11.34322
## [1] 8.074027
## $AIC
## [1] 31278.78
##
## $AICn
## [1] 8.074027
##
## $BIC
## [1] 31297.57
##
## $BICqh
## [1] 8.07418
Interpreting Coefficients and rate ratio
In order to have a change in predictor value reflect a change in actual visits to a physician, we must exponentiate the coefficient and confidence interval of coefficients.
The standard error can approximately calculated by delta method.
Delta Method:
\[se(\exp(\beta)) = exp(\beta) * se(\beta)\] . *calculating irr
. di exp(_b[outwork])
1.5037041
. di exp(_b[outwork]) * _se[outwork]
.02833683
## (Intercept) outwork cage
## 2.555113 1.503704 1.022330
## (Intercept) outwork cage
## 0.0325958199 0.0283368154 0.0008564317
## 2.5 % 97.5 %
## (Intercept) 2.492018 2.619805
## outwork 1.449178 1.560282
## cage 1.020653 1.024010