index	Performance	enroll
1	693	247
2	570	463
3	395	546
.	.	.
.	.	.
.	.	.
n	478	520

Ordinary Least Square

To fit the best linear function we can use a method called Ordinary Least Square (OLS).
The regression model has two part, a linear function and an error term.

\[y = \overbrace{\beta_0 + \beta_1 x }^\text{linear function}+ \overbrace{\epsilon}^\text{error}\]

The linear function predicts the mean value of the outcome (on average) for a given value of predictor. The mean value in mathematical statistics is noted as \(\text{E}(Y | X = x)\) and reads the expected value of \(Y\) given \(X = x\).
In general when we have multiple predictors, \(X_1, X_2, X_3, \cdots, X_p\),

The predicted mean is:

\[\text{E}(Y | x_1, x_2 , \cdots x_p) = \beta_0 + \beta_1 x1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p\]

The error part represent the uncertainty in the observation and is a random variable with mean equal to zero and unknown variance \(\sigma ^2\).
The error term usually, but not necessarily, assumes to have a normal distribution.
The goal is to estimate parameters \(\beta_0\) and \(\beta_1\), intercept and the slope of the line, in a way that the line fit the best with the data.

Ordinary Least Square (cont.)

Estimate model parameters are noted as \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
The estimated model will be

\[\hat{y} = \hat{\beta}_0 + \hat{\beta}_1 x\]

\(\hat{y}\) represent the predicted mean of the outcome for a given value of \(x\). We also call it fitted value.
The predicted value for each observation can be calculated by the linear line as: \(\hat{y_i} = \hat{\beta}_0 + \hat{\beta}_1 x_i\)
The difference between observed value of \(y_i\) and the predicted value \(\hat{y_i}\) is called residual which is:

\[ e_i = y_i - \hat{y_i}\]

*note: The predictor variable may be continuous, meaning that it may assume all values within a range, for example, age or height. Or it may be dichotomous, meaning that the variable may assume only one of two values, for example, 0 or 1 or a categorical variable with more that with more than two levels. There is only one response or dependent variable, and it is continuous.

Ordinary Least Square in graph

In the OLS we estimate the parameters of the model (intercept and slope) by minimizing sum of square of residuals.

\[ \min \sum_{i = 1}^{n} (y_i - \hat{y_i}) ^ 2\]

Assumptions in OLS

In OLS we make some statistical assumptions to be able to

1- Estimate the model parameters(i.e. intercept and slope of the fitted line and a measure of uncertainty)

2- Make inference about model parameters.

In summary assumptions of the OLS are:

1 - Errors have mean equal to zero.

2 - Errors are independent.

3 - Errors have equal or constant variance (Homoscedasticity or Homogeneity of variance).

4 - Errors follows normal distribution.

5 - The relationship of dependent variable with Independent variables is linear.

Briefly, the assumption of OLS is that the error term in the model are independent with identically distributed random variables from a normal distribution with mean equal to zero for each randomly sampled outcomes \(y\) for any given value of \(x\).

R example for linear regression

For review and start with R, we run an example of linear model in R.

Open R or R Studio and look at the help for data set attitude.

This data is from the book Regression Analysis by Example. by Chatterjee, S. and Price, B. (1977). The data are aggregated from the questionnaires of the approximately 35 employees for each of 30 (randomly selected) departments.

#help("attitude")
#copy and paste example and run it!
data(attitude)
#summary statistics of variables.
summary(attitude)

##      rating        complaints     privileges       learning         raises     
##  Min.   :40.00   Min.   :37.0   Min.   :30.00   Min.   :34.00   Min.   :43.00  
##  1st Qu.:58.75   1st Qu.:58.5   1st Qu.:45.00   1st Qu.:47.00   1st Qu.:58.25  
##  Median :65.50   Median :65.0   Median :51.50   Median :56.50   Median :63.50  
##  Mean   :64.63   Mean   :66.6   Mean   :53.13   Mean   :56.37   Mean   :64.63  
##  3rd Qu.:71.75   3rd Qu.:77.0   3rd Qu.:62.50   3rd Qu.:66.75   3rd Qu.:71.00  
##  Max.   :85.00   Max.   :90.0   Max.   :83.00   Max.   :75.00   Max.   :88.00  
##     critical        advance     
##  Min.   :49.00   Min.   :25.00  
##  1st Qu.:69.25   1st Qu.:35.00  
##  Median :77.50   Median :41.00  
##  Mean   :74.77   Mean   :42.93  
##  3rd Qu.:80.00   3rd Qu.:47.75  
##  Max.   :92.00   Max.   :72.00

We use graphic to investigate relationship between all variables by pairs.

pairs(attitude, main = "attitude data")

For the linear regression model of rating (Overall rating) as outcome and complaints (Handling of employee complaints) as predictor we use function lm().

fm2 <- lm(rating ~ complaints, data = attitude)
summary(fm2)

## 
## Call:
## lm(formula = rating ~ complaints, data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8799  -5.9905   0.1783   6.2978   9.6294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.37632    6.61999   2.172   0.0385 *  
## complaints   0.75461    0.09753   7.737 1.99e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.993 on 28 degrees of freedom
## Multiple R-squared:  0.6813, Adjusted R-squared:  0.6699 
## F-statistic: 59.86 on 1 and 28 DF,  p-value: 1.988e-08

The slope of the estimated line is 0.75461 and the intercept is 14.37632

To interpret slope we can say: for one unit increase on score of handling of employee complaints on average we expect the overall rating increase by 0.75461

The intercept in the expected overall score when the score of handling of employee complaints is zero

linear regression model diagnostics for model of rating vs. complaints

We will look at model diagnostics plots to evaluate the model assumptions.

opar <- par(mfrow = c(2, 2), 
            mar = c(4.1, 4.1, 2.1, 1.1))
plot(fm2)

par(opar)

The plot in the left corner shows residual vs. fit plot.

This plot is showing us that there is not strong evidence to suggest assumptions of zero mean and constant variance of error is not valid.

If we can confirm that the error has zero mean, then we can be sure if we increase the sample size then the estimated parameter is getting closer to the true value of parameter (i.e. \(\beta\)s). We say estimated parameter is unbiased.

Constant variance assumption is needed to insure the estimation method, OLS, achieve the minimal variability for parameters that has been estimated.

The Normal Q-Q plot is shows whether the error are distributed from a normal distribution using estimated errors (residuals).

The normal assumption is needed for statistical tests of parameters. Specially for small samples. For large sample sizes, this assumption is not essential

What should we do if we do not have a continuous outcome and the error does not have equal variance. What if the predicted value from the linear model is not plausible because of the range of the outcome?

1.2 Discrete Random variable and Probability Mass Function

A discrete random variable is a random variable that can only take values from a countable set.

For example, number of time we flipped a coin until we observe a head for the first time. This value could be in the set of \(\{1, 2, 3, ...\}\).

Sum of outcome of rolling two dice as a random variable that takes values in the set of \(\{2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12\}\).

Bernoulli random variable with values of 0 and 1 is a discrete random variable.

Probability Mass Function

A probability Mass Function is a function that gives the probability of a discrete random variable at each and every value of that random variable.

For example the PMF of a random variable of outcome of rolling a dice with outcome of \(\{1, 2, 3, 4, 5, 6\}\) is:

\(p(X = x) = 1/6\)

for \(x \in \{1, 2, 3, 4, 5, 6\}\).

Bernoulli distribution

Bernoulli random variable is a random variable that can get only two possible outcome, 1 with probability \(p\) and 0 with probability \(q = 1 - p\).
The PMF of Bernoulli with zero and one outcomes is:

\[f(x;p)=\begin{cases} 1 - p & {\text{if }}x=0,\\p & {\text{if }}x=1.\end{cases}\]

the mean of a Bernoulli random variable is:

\(\mu_X = E(X) = 1 \times p + 0 \times (1 - p) = p\)

and the variance is

\(\sigma^2_X = var(X) = E(X- \mu)^2 = (1 - p) ^ 2 \times p + (0 - p) ^ 2 \times (1 - p) \\ = p(1 - p) ( 1 - p + p) = p(1 - p)\)

Binomial distribution

Binomial is another popular discrete random variable that defines as number of success in \(n\) independent Bernoulli trial with probability of success equal to \(p\).

The PMF of Binomial(n, p) is: (optional) \[\displaystyle f(x; n,p)=\Pr(x;n,p)=\Pr(X=x)=\frac{n!}{x!(n-x)!} p^{x}(1-p)^{n-x}\] For \(x = 0, 1, 2, 3, \cdots, n\).

Sum of \(n\) independent Bernoulli random variable with the same parameter \(p\) has Binomial distribution with parameters \(p\) and \(n\).

If we know the PMF of a random variable, we can calculate the probability of observing any given value by plug in that value in the PMF.

For example for a Binomial random variable with \(p = 0.2\) and \(n = 10\) we can calculate probabilities for each value of the random variable:

\(P(x = 0) = [10! / 0!(10 - 0)!] 0.2 ^ 0 (1 - 0.2) ^ {10} = 0.8 ^ {10} \approx 0.107\)

The graph below shows the plot of PMF of a Binomial random variable with \(p = 0.2\) and \(n = 10\).

The mean of binomial is \(np\) and the variance is \(np(1 - p)\)

Natural logarithm and exponential function

Before we go over Poisson distribution we go over logarithm and exponential functions

logarithm is the inverse function to exponentiation.

\(\log_{10}(1000) = 3\)

We say the logarithm base 10 of 1000 so,

\(10^3 = 1000\)

\(\log_{2}(64) = 6\)

\(2^6 = 64\)

Euler’s number or the number \(e\) is a mathematical constant that is the base of the natural logarithm.

\[\displaystyle e=\sum \limits _{n=0}^{\infty }{\frac {1}{n!}}={\frac {1}{1}}+{\frac {1}{1}}+{\frac {1}{1\times 2}}+{\frac {1}{1\times 2\times 3}}+\cdots \approx 2.718282.\]

If \(e^6 = a\) Then \(\log_e(a) = ?\)

The answer is 6.

\(e^x\) can also noted as \(\exp(x)\). \(\log_e(x)\) is also shown as \(\ln(x)\)

Multiplication rule

Since \(\exp(x + y) = \exp(x)\exp(y)\)

\(\log(xy) = \log(x) + \log(y)\) and,

\(\log(x/y) = \log(x) - \log(y)\)

Change of base

\[\ln(x) = \log_{e}x= {\dfrac {\log_{10}x}{\log_{10}e}}\]

Poisson Random variable and Poisson distribution

The discrete random variable \(Y\) is define as number of event recorded during a unit of time has a Poisson distribution with rate parameter \(\mu > 0\), if \(Y\) has PMF

\[Pr(Y = y) = \dfrac{e^{-\mu}(\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\]

The mean of Poisson distribution is \(\mu\) and the variance of it is also \(\mu\).

In Poisson, we only have one parameter. The mean and variance of Poisson distribution are equal.

In a large number of Bernoulli trial with a very small probability of success the total number of success can be approximated by a Poisson distribution. This is called the law of rare events.

It can be shown that a Binomial distribution will converge to a Poisson distribution as \(n\) goes to infinite and \(p\) goes to zero such that \(np \rightarrow \mu\)

An Example of Poisson Random variable

Suppose number of accidents in a sunny day in freeway 405 between Santa Monica and Wilshire blvd is Poisson with the rate of 2 accidents per day, then probability of 0, 1, 2, …, 8 accidents can be calculated using the PMF of Poisson:

\(P(Y = 0| \mu = 2) = \dfrac{e^{-2}(2)^0}{0!} = 1/e^2 \approx 0.1353\)

\(P(Y = 1| \mu = 2) = \dfrac{e^{-2}(2)^1}{1!} = 2/e^2 \approx 0.2707\)

\(P(Y = 2| \mu = 2) = \dfrac{e^{-2}(2)^2}{2!} = 2/e^2 \approx 0.2707\)

\(P(Y = 3| \mu = 2) = \dfrac{e^{-2}(2)^3}{3!} = 8/(6e^2) \approx 0.1804\)

\(P(Y = 4| \mu = 2) = \dfrac{e^{-2}(2)^4}{4!} = 16/(24e^2) \approx 0.0902\)

\(P(Y = 5| \mu = 2) = \dfrac{e^{-2}(2)^5}{5!} = 32/(120e^2) \approx 0.0361\)

\(P(Y = 6| \mu = 2) \approx 0.0120\)

\(P(Y = 7| \mu = 2) \approx 0.0034\)

\(P(Y = 8| \mu = 2) \approx 0.0009\)

The following graph shows the plot of PMF of a Poisson random variable with \(\mu = 2\)

This looks like binomial but not exactly the same!

Now assume number of accidents in a rainy day in freeway 405 between Santa Monica and wilshire blvd is Poisson with the rate of 5.5 accidents per day, then probability of each \(x = 0, 1, 2, \cdots, 8\) accidents using R is:

 dpois(0:8, lambda = 5.5)

## [1] 0.004086771 0.022477243 0.061812418 0.113322766 0.155818804 0.171400684
## [7] 0.157117294 0.123449302 0.084871395

The ratio of rates of accident is 5.5 / 2 = 2.75

We can also say there is 2.75 times increase in the mean of accidents per day for a rainy day compare to a sunny day.

We can draw a sample of number 0 to 8 with probabilities that we calculated from Poisson(2)

A sample of 20 number will be look like this:

prob <- c(0.1353, 0.2707, 0.2707, 0.1804, 0.0902, 0.0361, 0.0120, 0.0034, 0.0009)
domain <- seq(0,8,1)
set.seed(1001)
y1 <- sample(x = domain, size = 20, replace = TRUE, prob = prob)
y1

##  [1] 6 2 2 2 2 4 1 1 2 0 2 1 4 2 1 4 1 2 1 0

mean(y1)

## [1] 2

var(y1)

## [1] 2.210526

Now lets add some zeroes and some larger values to our sample

y2 <- c(y1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
            0, 0, 0, 0, 0, 0, 5, 6, 8, 5, 6, 7, 4, 3, 6, 10, 15, 20)
mean(y2)

## [1] 2.213115

var(y2)

## [1] 14.07049

par(mfrow = c(1,2))
hist(y1, breaks = 10)
hist(y2, breaks = 10)

We can see for sample y2 the variance is larger than mean. This is called Overdispersion.

The way we draw our sample is close to Poisson with \(\mu = 2\) but not exactly. If we want to sample from a true Poisson we can use rpois() function in R.

y <- rpois(1000, 2)
mean(y)

## [1] 1.958

var(y)

## [1] 2.038274

hist(y, breaks = 15)

Poisson distribution for unit time and time t

The basic Poisson distribution assumes that each count in the distribution occurs over a unit of time. However, we often count events over a period of time or over various areas. For example if the number of accidents occur on average 2 times per day, over the course of a month (30 days) the average of accident will be \(2 \times 30 = 60\). The rate of accidents per day still is 2 accident per day. The rate parameterization of Poisson distribution will be:

\[Pr(Y = y) = \dfrac{e^{-t\mu}(t\mu)^y}{y!}, \text{ for } y = 0, 1, 2, \dots .\] Where \(t\) represents the length of time. We will talk about this when we talk about exposure in poission regression.

Continuous random ariable

A continuous random variable is a random variable which can take infinity many values in an interval.

For example weight of newborn babies.

The graph below shows the plot of PDF of a normal distribution with \(\mu = 2\) and \(sd = 1\).

Part 2 Logistic Regression

The link function

In Regression model, a linear prediction is a linear combination of predictors.
Recall linear regression model where the outcome is continues and error has normal distribution, the linear prediction was.

\[ \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + ... + \beta_p x_p\] The linear prediction can get values from \(-\inf\) to \(+\inf\)

But if the outcome is from a count distribution like Poisson distribution which has positive mean. Or a Bernoulli distribution, where the mean is between 0 and 1, the linear prediction might not be plausible.
In generalized linear regression we use a link function to transform linear prediction to get a plausible expected outcome.
The table below shows link functions for different distributions and regression models.

2.1 An example of logistic regression

logistic regression is used where the outcome has only 1 and 0 values.

We start with an example:

The data that we are using here is called binary data from IDRE stat website.

A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.

This data set has a binary response (outcome, dependent) variable called admit.
There are three predictor variables: gre, gpa and rank.
We will treat the variables gre and gpa as continuous.
The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.

We can get basic descriptive for the entire data set by using summary. To get the standard deviations, we use sapply to apply the sd function to each variable in the data set.

#reading the data from idre stat website:
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
#summary statistics of variables
summary(mydata)

##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000

#standard deviation of variables
sapply(mydata, sd)

##       admit         gre         gpa        rank 
##   0.4660867 115.5165364   0.3805668   0.9444602

Estimation on the mean of admit without any predictor.

This is also can be seen as regression model with intercept only.
Variable admit is binary and it has value of 1 if a person admitted and 0 for not admitted. For the variable admit, the mean is the same as proportion of 1 or estimated probability of getting admit in a grad school.

\[ \hat{p} = \dfrac{X}{n} = \dfrac{\sum_{i=1}^nx_i}{n} \]

Where \(X\) is total number of people admitted and \(n\) is total number of students and \(x_i\) is value of admit for the \(i\)-th student.

We can calculate standard deviation directly from the mean. Note that this variable is binary and assuming a binary outcome is a result of a Bernoulli trial then, \(\hat\mu_{admit} = \hat p = \sum_{i=1}^{n} x_i /n\)

Since in Bernoulli distribution the variance is \(var(x) = p(1-p)\) we can use this formula to calculate the variance.

mean(mydata$admit)

## [1] 0.3175

#the variance can be calculated using p hat.
mean(mydata$admit) *(1-mean(mydata$admit))

## [1] 0.2166937

#but this is population variance and need to be adjusted for the sample variance
(mean(mydata$admit) *(1-mean(mydata$admit))) * 400 / (400 - 1)

## [1] 0.2172368

#Which is exactly equal to sample variance
var(mydata$admit)

## [1] 0.2172368

Why we talk about this?

Regression model of admit predicted by gre

Remember in OLS, where the error assumed to be normally distributed and so the outcome, the variance assumed to be constant but in here the variance depends on the predicted mean value.
Of course not in the intercept only model because the expected mean is constant!
Now run a regression model of admit predicted by gre, assuming admit is a continues variable.
We are trying to estimate the mean of admit for given value of gre. Since the admit is 0 and 1 this can be seen as probability of admit.

lm.admit <- lm(admit ~ gre, data = mydata)
summary(lm.admit)

## 
## Call:
## lm(formula = admit ~ gre, data = mydata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4755 -0.3415 -0.2522  0.5989  0.8966 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.1198407  0.1190510  -1.007 0.314722    
## gre          0.0007442  0.0001988   3.744 0.000208 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4587 on 398 degrees of freedom
## Multiple R-squared:  0.03402,    Adjusted R-squared:  0.03159 
## F-statistic: 14.02 on 1 and 398 DF,  p-value: 0.0002081

opar <- par(mfrow = c(2, 2), 
            mar = c(4.1, 4.1, 2.1, 1.1))
plot(lm.admit)

par(opar)

phat <- lm.admit$fitted.values

plot(phat * (1 - phat) ~ phat)

As we can see the variance of the predicted outcome is systematically related to the predicted outcome and thus is not constant.

Another issue with this model is that:

In linear model we allow the predicted outcome to be any number between negative infinity to positive infinity.
But the probability can not be a negative number or a number greater than zero. To overcome this issue and also the issue of equality of variance in OLS we need to use a different model.

Also we need to use a transformation function such that we can relate the outcome to a value between 0 and 1.

2.2 Understanding the odds

In the previous example we estimated that the probability of getting admitted to a graduate school regardless of gpa or gre, .. on average is 0.3175.
This is an estimation of unconditional population mean (probability or proportion) or in regression term, this is estimation of intercept for an intercept only model.
Suppose the actual true population proportion is \(p = 0.3\). Then We define the odds of getting to a graduate school to be:

\[ \tt{odds(admit)} = \dfrac{p}{1 - p} = \dfrac{0.3}{1 - 0.3} = 0.4285714 \]

and odds of not getting admitted to a graduate school will be:

\[ \tt{odds(not \ admit)} = \dfrac{p}{1 - p} = \dfrac{0.7}{1 - 0.7} = 2.333333 \]

As we can see odds is a positive value between 0 and infinity. Thus, the inverse of odd is a number between zero and 1.
Question: CNN analysts predict that Los Angeles Lakers has odds of 9 to 1 to go to the play off in the next NBA season. What is the probability that the Los Angeles Lakers do not make the play off in the next NBA season?

\(odds = \dfrac{p}{1 - p}\)

Lets calculate probability of Lakers goes to play off. From the odds formula:

\(p = odds * (1 - p)\) and \(p + p \times odds = odds\)

\(p = \dfrac{odds} {1 + odds} = 9 / (9+1) = 0.9\)

Then probability of not going to play off will be 0.1.

Another useful function is log. If we have a value between zero and infinity then the log of that value is between negative and positive infinity.

The inverse log function is an exponential function. We usually use natural log and exponential function.

In summary, we can use log(odds) transformation to transform a value between zero and one to a value between negative infinity to positive infinity. This function, log(odds) is called logit function.

This suggest that the logit transformation can be model as a linear relationship to predictors.

logistic Regression model example

In R we use function glm to run a generalized linear model. The distribution of the outcome is defined by argument family with the option for the link function.

glm.mod <- glm(admit ~ gre, family = binomial(link = "logit"), data = mydata)

summary(glm.mod)

## 
## Call:
## glm(formula = admit ~ gre, family = binomial(link = "logit"), 
##     data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1623  -0.9052  -0.7547   1.3486   1.9879  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.901344   0.606038  -4.787 1.69e-06 ***
## gre          0.003582   0.000986   3.633  0.00028 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 486.06  on 398  degrees of freedom
## AIC: 490.06
## 
## Number of Fisher Scoring iterations: 4

coef(summary(glm.mod))

##                 Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept) -2.901344270 0.6060376103 -4.787400 1.689561e-06
## gre          0.003582212 0.0009860051  3.633056 2.800845e-04

confint(glm.mod)

##                    2.5 %       97.5 %
## (Intercept) -4.119988259 -1.739756286
## gre          0.001679963  0.005552748

Part3 Regression of count data

3.1 What are count data?

Count data are observations that have only nonnegative integer values.

Count can range from zero to some grater undetermind value.

Theoretically count can range from zero to infinity but in practice they are always limited to some lesser value.

Some types of count data:

A count of items or events occuring within a period of time.

Count of item or events occouring in a given geographical or spatial area.

Count of number of people having a particular disease, adjusted by the size of the population.

Examples of count data:

Number of accidents in a highway.

Number of patients died at a hospital within 48 hours of having myocardial infarction

Count of number of people having a particular disease, adjusted by the size of the population.

3.2 Poisson Regression Model

A statistical model describes the relationship between one or more variable to another variable or variables.

Usually we are interested to study relationship between one (response, dependent or outcome) variable to one or more variables (explanatory, independent or predictors).

Linear regression model

The traditional linear regression model assume this relationship is linear and the randomness of the relationship is from a Gaussian or normal probability distribution The most simple case can be formulize as follow: \[ Y = \beta_0 + \beta X + \epsilon\]

By estimating the model we try to find the mean of response for a given predictor as a linear relationship this proccess also known as ordinary least square method

\[ mean(Y) = E(Y|X) = \hat\beta_0 + \hat\beta X\] Regression model for count data

Regression model for count data referes to regression models such that the response variable is a non-negative integer.

Can we model the count response as a continuous random variable and by using ordinary least square estimate the parameters?

There are two problem with this approach:

In linear OLS model we may get the mean response as a negative value. This is not correct for count data

In linear OLS model we assume the variance of response for any given predictor is constant. This may not be correct for count data and in fact almost all the time it is not

Poisson regression model is the standard model for count data

In Poisson regression model we assume that the response variable for a given set of predictor variables is distributed as Poisson distribution with the parameter \(\mu\) depends of values of predictors

Since parameter \(\mu\) should be greater than zero for \(p\) number of predictors, we use the exponentiate function as a positive link function

Recall OLS model, the response variable is assume to have a normal distribution

Nearly all of the count models have the basic structure of the linear model. The difference is that the left-hand side of the model equation in in log form.

Formalization of count model

A count model with one predictor can be formulize as:

\[\text{log of mean response} = \ln(Y|X) = \ln(\mu) = \beta_0 + \beta X\]

From this model the expected count or mean response will be:

\[\text{mean response} = \mu =E(Y|X) = \exp( \beta_0 + \beta_1 X)\] This ensures \(\mu\) will be positive for any value of \(\beta_0\), \(\beta_1\), or \(X\).

The Poisson model

The Poisson model is the model that the count \(Y|x\) has a poisson distribution with the above mean. For an observation \(Y_i\) and single predictor \(X_i\) we have:

\[ f(y_i | x_i; \beta_0, \beta_1) = \dfrac{e^{-\exp( \beta_0 + \beta x_i)}(\exp( \beta_0 + \beta x_i))^{y_i}}{y_i!}\]

The plot below shows the distribution of the responses for a given predictor.

Distribution of the response for Poisson regression model will be

Poisson Regression Model(continue)

Back to our toy example of number of accidents in 405. Let \(X\) is a binary variable of 0 for sunny day and 1 for rainy day.

We have a sample of 20 days with 10 rainy and 10 sunny days from Poisson with rates 5 and 2 respectively.

In fact we can artifitially generate this sample using R:

y <- c(rpois(20, lambda = 2), rpois(20, lambda = 5))
x <- rep(c(0,1), each = 20)
print(data.frame(y = y,x = x))

accidents_model <- glm(y ~ x, family = poisson(link = "log"))
summary(accidents_model)

## 
## Call:
## glm(formula = y ~ x, family = poisson(link = "log"))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -3.16228  -0.87696  -0.05176   0.84298   1.96544  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.7655     0.1525   5.020 5.18e-07 ***
## x             0.8440     0.1824   4.628 3.69e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 82.391  on 39  degrees of freedom
## Residual deviance: 59.027  on 38  degrees of freedom
## AIC: 172.29
## 
## Number of Fisher Scoring iterations: 5

What is this?

Let’s exponentiate the coefficients:

exp(accidents_model$coeff)

## (Intercept)           x 
##    2.150000    2.325581

Can we find the rate of accidents for each group (sunny or rainy)? How?

mu0 <- mean(y[x==0])
mu1 <- mean(y[x==1])
c(mu0, mu1, mu1/mu0)

## [1] 2.150000 5.000000 2.325581

Poisson model log-likelihood function

This part is just for your information and is optional.

If we have more than one predictors for ith observation we will have:

\[\mu_i = E(y_i | x_{i1} , x_{i2}, \dots, x_{ip};\beta) = \exp(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip})\]

For simplicity we use \(\boldsymbol{x'\beta}_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots+ \beta_p x_{ip}\) and therefore we have

\[P(Y = y_i|\boldsymbol{x}_i, \boldsymbol{\beta}) = \dfrac{e^{-\exp(\boldsymbol{x'\beta}_i)}(\exp(\boldsymbol{x'\beta}_i))^y}{y!}\] and log-likelihood function will be:

\[\cal L(\boldsymbol{\beta}) = \sum_{i=1}^n [y_i \boldsymbol{x'\beta}_i - \exp(\boldsymbol{x'\beta}_i) - \ln y_i!]\]

Analysis of count data model

For this part We will use German national health registry data set as an example of Poisson regression. The data set is also exist in package COUNT in R. We are going to use only year 1984.

Source: German Health Reform Registry, years pre-reform 1984-1988, From Hilbe and Greene (2008).

The response variable is the number of visit.

We are using outwork and age as predictors. outwork is a binary variable with 1 = not working and 0 = working. age is a continuous variable range 25-64.

Since ages are range from 25 to 64 and age 0 is not a valid value for age, to interpret the intercept we center variable age.

. gen cage = age - r(mean)

R code for loading the data and descriptive statistics

# the response 
rwm1984 <- read.csv("rwm1984.csv")
#you can load the data from package COUNT
library(COUNT)
#rwm1984 <- data("rwm1984")
head(rwm1984)

##   docvis hospvis edlevel age outwork female married kids hhninc educ self
## 1      1       0       3  54       0      0       1    0  3.050 15.0    0
## 2      0       0       1  44       1      1       1    0  3.050  9.0    0
## 3      0       0       1  58       1      1       0    0  1.434 11.0    0
## 4      7       2       1  64       0      0       0    0  1.500 10.5    0
## 5      6       0       3  30       1      0       0    0  2.400 13.0    0
## 6      9       0       3  26       1      0       0    0  1.050 13.0    0
##   edlevel1 edlevel2 edlevel3 edlevel4
## 1        0        0        1        0
## 2        1        0        0        0
## 3        1        0        0        0
## 4        1        0        0        0
## 5        0        0        1        0
## 6        0        0        1        0

summary(rwm1984)

##      docvis           hospvis           edlevel          age    
##  Min.   :  0.000   Min.   : 0.0000   Min.   :1.00   Min.   :25  
##  1st Qu.:  0.000   1st Qu.: 0.0000   1st Qu.:1.00   1st Qu.:35  
##  Median :  1.000   Median : 0.0000   Median :1.00   Median :44  
##  Mean   :  3.163   Mean   : 0.1213   Mean   :1.38   Mean   :44  
##  3rd Qu.:  4.000   3rd Qu.: 0.0000   3rd Qu.:1.00   3rd Qu.:54  
##  Max.   :121.000   Max.   :17.0000   Max.   :4.00   Max.   :64  
##     outwork           female          married            kids       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.3665   Mean   :0.4793   Mean   :0.7891   Mean   :0.4491  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      hhninc            educ            self            edlevel1     
##  Min.   : 0.015   Min.   : 7.00   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.:10.00   1st Qu.:0.00000   1st Qu.:1.0000  
##  Median : 2.800   Median :10.50   Median :0.00000   Median :1.0000  
##  Mean   : 2.969   Mean   :11.09   Mean   :0.06118   Mean   :0.8136  
##  3rd Qu.: 3.590   3rd Qu.:11.50   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :25.000   Max.   :18.00   Max.   :1.00000   Max.   :1.0000  
##     edlevel2         edlevel3         edlevel4      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :0.0524   Mean   :0.0746   Mean   :0.05937  
##  3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000

#centerin the age predictor
rwm1984$cage <- rwm1984$age - mean(rwm1984$age)

Run a poission model of docvis as independent variable and outwork and age as predictors

pois_m <- glm(docvis ~ outwork + age, family = poisson(link = "log"), data = rwm1984)
summary(pois_m)

## 
## Call:
## glm(formula = docvis ~ outwork + age, family = poisson(link = "log"), 
##     data = rwm1984)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4573  -2.2475  -1.2366   0.2863  25.1320  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.0335170  0.0391815  -0.855    0.392    
## outwork      0.4079314  0.0188447  21.647   <2e-16 ***
## age          0.0220842  0.0008377  26.362   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 25791  on 3873  degrees of freedom
## Residual deviance: 24190  on 3871  degrees of freedom
## AIC: 31279
## 
## Number of Fisher Scoring iterations: 6

Use centered ag and run the model again:

pois_m1 <- glm(docvis ~ outwork + cage, family = poisson(link = "log"), data = rwm1984)
summary(pois_m1)

## 
## Call:
## glm(formula = docvis ~ outwork + cage, family = poisson(link = "log"), 
##     data = rwm1984)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4573  -2.2475  -1.2366   0.2863  25.1320  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.9380965  0.0127571   73.53   <2e-16 ***
## outwork     0.4079314  0.0188447   21.65   <2e-16 ***
## cage        0.0220842  0.0008377   26.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 25791  on 3873  degrees of freedom
## Residual deviance: 24190  on 3871  degrees of freedom
## AIC: 31279
## 
## Number of Fisher Scoring iterations: 6

pr <- sum(residuals(pois_m1, type = "pearson") ^ 2)
pr / pois_m1$df.resid

## [1] 11.34322

pois_m1$aic / (pois_m1$df.null + 1)

## [1] 8.074027

COUNT::modelfit(pois_m1)

## $AIC
## [1] 31278.78
## 
## $AICn
## [1] 8.074027
## 
## $BIC
## [1] 31297.57
## 
## $BICqh
## [1] 8.07418

Interpreting Coefficients and rate ratio

Coefficients are in log format and we can directly interpret them in term of log response. In general we say: Log mean of the response variable increases by \(\beta\) for a one unit increase of the corresponding predictor variable, other predictors are held constant.

in our results, the expected log number of visits for patients who are out of work is 0.4 more than log number of visits for patients who are working with the same age.

For each one year increase in age, there is an increase in the expected log count of visit to the doctor of 0.022, holding outwork the same.

In order to have a change in predictor value reflect a change in actual visits to a physician, we must exponentiate the coefficient and confidence interval of coefficients.

The standard error can approximately calculated by delta method.

Delta Method:

\[se(\exp(\beta)) = exp(\beta) * se(\beta)\] . *calculating irr

. di exp(_b[outwork])

1.5037041

. di exp(_b[outwork]) * _se[outwork]

.02833683

exp(coef(pois_m1))

## (Intercept)     outwork        cage 
##    2.555113    1.503704    1.022330

exp(coef(pois_m1)) * sqrt(diag(vcov(pois_m1))) #delta method

##  (Intercept)      outwork         cage 
## 0.0325958199 0.0283368154 0.0008564317

exp(confint.default(pois_m1))

##                2.5 %   97.5 %
## (Intercept) 2.492018 2.619805
## outwork     1.449178 1.560282
## cage        1.020653 1.024010

Generalized linear Regression Models

Table of contents

Part 1 Introduction

1.1 Review of linear regression model

Scatter plot

Ordinary Least Square

Ordinary Least Square (cont.)

Ordinary Least Square in graph

Assumptions in OLS

R example for linear regression

linear regression model diagnostics for model of rating vs. complaints

1.2 Discrete Random variable and Probability Mass Function

Bernoulli distribution

Binomial distribution

Natural logarithm and exponential function

Poisson Random variable and Poisson distribution

An Example of Poisson Random variable

Continuous random ariable

Part 2 Logistic Regression

The link function

2.1 An example of logistic regression

Estimation on the mean of admit without any predictor.

Regression model of admit predicted by gre

2.2 Understanding the odds

logistic Regression model example

Part3 Regression of count data

3.1 What are count data?

3.2 Poisson Regression Model

3.2 Poisson Regression Model

Poisson Regression Model(continue)

Poisson model log-likelihood function

Analysis of count data model