ttest |
t-test |

anova |
Analysis of variance |

regress |
Regression |

predict |
Predicts after model estimation |

test |
Test linear hypotheses after model estimation |

contrast |
Contrasts and linear hypothesis tests after estimation |

margins |
Predicted means |

marginsplot |
Plot predicted means |

kdensity |
Kernel density estimates and graphs |

qnorm |
Graphs a quantile plot |

logit |
Logistic regression |

tabulate |
Crosstabs with chi-square test |

signtest |
Tests the equality of matched pairs of data |

signrank |
Wilcoxon matched-pairs signed rank test |

ranksum |
Mann-Whitney two-sample test |

kwallis |
Nonparametric analog to the one-way anova |

#### 2.0 Demonstration and explanation

We will begin by downloading the dataset for this unit over the internet.

use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear

#### A) Analysis of normally-distributed outcomes

Each of the following tests assumes that that the outcome is normally distributed (more accurately, that the residuals are normally distributed).

*A1. t-tests*

The t-test is usually used to test the equality of 2 sample means, but can also test the equality of a sample mean to some hypothesized population mean.

A one-sample t-test, testing whether the sample of writing scores was drawn from a population with a mean of 50.

ttest write = 50

A paired t-test, testing whether or not the mean of **write** equals the
mean of **read**.

ttest write = read

A two-sample independent t-test with pooled (equal) variances, testing equality of means of write between males and females.

ttest write, by(female)

This is the two-sample independent t-test but with separate (unequal) variances.

ttest write, by(female) unequal

*A2. Analysis of Variance*

ANOVA is used to test for the equality of means among more than one group. It is equivalent to linear regression, which is more commonly used today.

Here is an example of a one-way analysis of variance, testing the equality of the mean of write among prog groups. The **i.** specification tells Stata that **prog** is a categorical variable, which Stata will then convert into dummy variables. Stata then enters all but one of those dummies (by default all but the first) into the model.

anova write i.prog

*A3. Linear Regression*

Linear regression is used to estimate the effect of multiple predictors, which can include both continuous and categorical variables, on a normally-distribued outcome. We use **c.** to indicate continuous predictors, and **i.** to indicate categorical predictors.

regress write c.read i.prog

We can specify the interaction of predictors using the **#** symbol. A single **#** between variables requests just the interaction, while the specification **##** requests both the main effects and the interaction. Below, we request the main effects of read and female as well as their interaction:

regress write c.read##i.prog

#### B) Postestimation – analysis after running the model

Stata has a large suite of commands that can estimate and graph various statistics after a model has been run.

*B1. Custom hypothesis testing and contrasts with test and contrast*

We may be interested in performing additional tests that are not part of the specified regression model. The **test** command allows us to test linear combinations of the regression coefficients. For example, we may wish to test whether the coefficients are the same for prog=2 and prog=3.

test 2.prog = 3.prog

The **contrast** is a powerful, flexible command that can perform several custom contrasts with a single command. Below we show run a new model with an interaction of **prog** and **female**, and then use **contrast** to test for the significance of the female effect within each prog, and the signficance of prog within each gender:

regress write i.female##i.prog contrast female@prog contrast prog@female

*B2. Marginal means and effects with margins and marginsplot*

The **margins** is among Stata’s most flexible and powerful commands, which can estimate marginal (population averaged) means and effects. It is typically used to estimate cell means of an effect (often an interaction), averaged over the other covariates in the population. Additionally, the **marginsplot** command provides an easy way to plot the results of the **margins** command. Below, we estimate the marginal means of each cell of the female#prog interaction, and then plot the means

regress write i.female##i.prog margins female#prog marginsplot

*B3. Residual analysis with predict and graphing*

The **predict** command can be used to estimate predicted values, influence statistics, and residuals after an estimation model. Here we estimate predicted scores on the outcome of the previous linear regression and store it in a variable, **pred**.

predict pred

The **resid** option requests residual, which we store in the variable **res**.

predict res, resid

Let’s look at the predicted values and residuals in the first 20 observations.

list write pred res in 1/20

We can graph the residuals to check the linear regression normality assumption
We use the **kdensity** command with the **normal** option to displays a density graph of the residuals with an normal
distribution superimposed on the graph.

kdensity res, normal

The **qnorm** command produces a normal quantile plot. It plots the observed distribution of the variable against a theoretical normal distribution with the same mean and variance. Deviation from a straight line along the diagonal indicates deviation from normality.

qnorm res

#### B4. Influence analysis

We can check if any observations seem to be having too much influence on the model using measures such as Cook’s D. We can use the **predict** command with the **cooksd** option to request Cook’s D scores. We then create a spike plot of Cook’s D for each ID number to check for overly influential observations.

predict cook, cooksd

twoway spike cook id

#### C) Analysis of categorical outcomes

Analyzing categorical outcomes in linear regression model violates at least a few of the assumptions of linear regression (normality of residuals, homoskedasticity). Therefore, other tests and models are used — here we demonstrate the chi-square test and logistic regression.

*C1. Chi-square test of independence with tab*

The **tabulate** command will compute the chi-square
test of independence and other measures of association with the option **all**.

tabulate prog ses, all

The chi-square test p-value is less trustworty if any cell has an expected count less than 5. We can display expected frequencies with the **expected** option.

tabulate prog ses, all expected

*C2. Logistic regression with logit*

Logistic regression allows estimation of the effect of multiple predictors on a binary outcome. We demonstrate the logistic regression command with the binary outcome **honors** (representing membership to the honors program).

tab honors

The default output for the **logit** command is given as coefficients in the log odds metric.
To obtain odds ratios,
use the **or** option.

logit honors c.read i.female logit, or

The **predict** and **margins** commands can be used after **logit** models, as well as most other regression models. Here we estimate the predicted probabilities for each observation using **predict**, and the marginal predicted probabilities of honors for each gender using **margins**.

predict prob, pr margins female

#### D) Non-parametric Tests

Non-parametric tests make no assumptions about the distribution of the outcome, so are useful when the generating distribution is unknown. However, no more than one predicted can be modeled at once.

The **signtest** is the nonparametric analog of the single-sample t-test.

signtest write = 50

The **signrank** command computes a Wilcoxon sign-ranked test,
the nonparametric analog of the paired t-test.

signrank write = read

The **ranksum** test is the nonparametric analog of the independent two-sample t-test and
is know as the Mann-Whitney or Wilcoxon test.

ranksum write, by(female)

The **kwallis** command computes a Kruskal-Wallis test, the non-parametric analog
of
the one-way ANOVA.

kwallis write, by(prog)

Most of the postestimation commands like **predict**, **contrast** and **margins** are not available after nonparametric tests.

#### 3.0 For more information

**Statistics with Stata 12**- Chapters 4, 7-13

**Gentle Introduction to Stata, Revised Third Edition**- Chapters 6-11

**Data Analysis Using Stata, Third Edition**- Chapters 8-10

**An Introduction to Stata for Health Researchers, Third Edition**- Chapters 11-15

**Interpreting and Visualizing Regression Models Using Stata****Stata Web Books****Regression with Stata Webbook**

Includes such topics as diagnostics, categorical predictors, testing interactions and testing contrasts

**Choosing the Correct Statistical Test**

Includes guidelines for choosing the correct non-parametric test**Data Analysis Examples**

Gives examples of common analysis and interpretation of the output**Annotated Output**

Fully annotates the output from common statistical procedures**Frequently Asked Questions**

Covers many topics, including ANOVA, linear regression, logistic regression and use of the**margins**command