**Version info:** Code for this page was tested in Stata 12.

This module will give a brief overview of some common statistical tests in Stata. Let’s use the
**auto**
data file that we will use for our examples.

sysuse auto

## t-tests

Let’s do a t-test comparing the miles per gallon (**mpg**)
of foreign and domestic cars.

ttest mpg , by(foreign)Two-sample t test with equal variances ------------------------------------------------------------------------------ Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------- 0 | 52 19.82692 .657777 4.743297 18.50638 21.14747 1 | 22 24.77273 1.40951 6.611187 21.84149 27.70396 ---------+-------------------------------------------------------------------- combined | 74 21.2973 .6725511 5.785503 19.9569 22.63769 ---------+-------------------------------------------------------------------- diff | -4.945804 1.362162 -7.661225 -2.230384 ------------------------------------------------------------------------------ Degrees of freedom: 72 Ho: mean(0) - mean(1) = diff = 0 Ha: diff <0 Ha: diff ~="0" Ha: diff> 0 t = -3.6308 t = -3.6308 t = -3.6308 P < t = 0.0003 P > |t| = 0.0005 P > t = 0.9997

As you see in the output above, the domestic cars had significantly lower **mpg**
(19.8) than the foreign cars (24.7).

## Chi-square

Let’s compare the repair rating (**rep78**) of the foreign and domestic cars. We can make a crosstab of
**rep78** by **foreign**. We may want to ask whether these variables are independent. We can use the
**chi2** option to request a chi-square test of independence as well as the crosstab.

tabulate rep78 foreign, chi2| foreign rep78 | 0 1 | Total -----------+----------------------+---------- 1 | 2 0 | 2 2 | 8 0 | 8 3 | 27 3 | 30 4 | 9 9 | 18 5 | 2 9 | 11 -----------+----------------------+---------- Total | 48 21 | 69 Pearson chi2(4) = 27.2640 Pr = 0.000

The chi-square is not really valid when you have empty cells. In such cases when you have empty cells, or cells with small frequencies, you can request Fisher’s exact test with the
**exact**
option.

tabulate rep78 foreign, chi2 exact| foreign rep78 | 0 1 | Total -----------+----------------------+---------- 1 | 2 0 | 2 2 | 8 0 | 8 3 | 27 3 | 30 4 | 9 9 | 18 5 | 2 9 | 11 -----------+----------------------+---------- Total | 48 21 | 69 Pearson chi2(4) = 27.2640 Pr = 0.000 Fisher's exact = 0.000

## Correlation

We can use the **correlate** command to get the
correlations among variables. Let’s look at the correlations among **price**
**mpg** **weight** and **rep78**. (We use **rep78**
in the correlation even though it is not continuous to illustrate what happens
when you use correlate with variables with missing data.)

correlate price mpg weight rep78(obs=69) | price mpg weight rep78 ---------+------------------------------------ price | 1.0000 mpg | -0.4559 1.0000 weight | 0.5478 -0.8055 1.0000 rep78 | 0.0066 0.4023 -0.4003 1.0000

Note that the output above said (obs=69). The **correlate** command drops data on a
**listwise** basis,
meaning that if any of the variables are missing, then the entire observation
is omitted from the correlation analysis.

We can use **pwcorr** (pairwise correlations) if we want to obtain correlations that deletes missing data on a
**pairwise **basis instead of a listwise basis. We will use the **obs**
option to show the number of observations used for calculating each
correlation.

pwcorr price mpg weight rep78, obs

| price mpg weight rep78 ----------+------------------------------------ price | 1.0000 | 74 | mpg | -0.4686 1.0000 | 74 74 | weight | 0.5386 -0.8072 1.0000 | 74 74 74 | rep78 | 0.0066 0.4023 -0.4003 1.0000 | 69 69 69 69 |

Note how the correlations that involve **rep78** have an N of 69 compared to the other correlations that have an N of 74. This is because
**rep78**
has five missing values, so it only had 69 valid observations, but the other
variables had no missing data so they had 74 valid observations.

## Regression

Let’s look at doing regression analysis in Stata. For this example,
let’s drop the cases where **rep78** is 1 or 2 or missing.

drop if (rep78 <= 2) | (rep78==.)(15 observations deleted)

Now, let’s predict **mpg** from **price** and **weight**. As you see below,
**weight** is a significant predictor of **mpg**, but
**price** is not.

regress mpg price weightSource | SS df MS Number of obs = 59 ---------+------------------------------ F( 2, 56) = 47.87 Model | 1375.62097 2 687.810483 Prob > F = 0.0000 Residual | 804.616322 56 14.3681486 R-squared = 0.6310 ---------+------------------------------ Adj R-squared = 0.6178 Total | 2180.23729 58 37.5902981 Root MSE = 3.7905 ------------------------------------------------------------------------------ mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval] ---------+-------------------------------------------------------------------- price | -.0000139 .0002108 -0.066 0.948 -.0004362 .0004084 weight | -.005828 .0007301 -7.982 0.000 -.0072906 -.0043654 _cons | 39.08279 1.855011 21.069 0.000 35.36676 42.79882 ------------------------------------------------------------------------------

What if we wanted to predict **mpg** from **rep78** as well.
**rep78** is really more of a categorical variable than it is a continuous variable. To include it in the regression, we should convert
**rep78** into dummy variables. Fortunately, Stata makes dummy variables easily using tabulate. The
**gen(rep)** option tells Stata that we want to generate dummy variables from
**rep78** and we want the stem of the dummy variables to be **rep**.

tabulate rep78, gen(rep)rep78 | Freq. Percent Cum. ------------+----------------------------------- 3 | 30 50.85 50.85 4 | 18 30.51 81.36 5 | 11 18.64 100.00 ------------+----------------------------------- Total | 59 100.00

Stata has created **rep1** (1 if **rep78** is 3), **rep2** (1 if
**rep78** is 4) and **rep3** (1 if** rep78** is 5). We can use the
**tabulate** command to verify that the dummy variables were created
properly.

tabulate rep78 rep1| rep78== 3.0000 rep78 | 0 1 | Total -----------+----------------------+---------- 3 | 0 30 | 30 4 | 18 0 | 18 5 | 11 0 | 11 -----------+----------------------+---------- Total | 29 30 | 59tabulate rep78 rep2| rep78== 4.0000 rep78 | 0 1 | Total -----------+----------------------+---------- 3 | 30 0 | 30 4 | 0 18 | 18 5 | 11 0 | 11 -----------+----------------------+---------- Total | 41 18 | 59tabulate rep78 rep3| rep78== 5.0000 rep78 | 0 1 | Total -----------+----------------------+---------- 3 | 30 0 | 30 4 | 18 0 | 18 5 | 0 11 | 11 -----------+----------------------+---------- Total | 48 11 | 59

Now we can include **rep1** and
**rep2** as dummy variables in the regression model.

regress mpg price weight rep1 rep2Source | SS df MS Number of obs = 59 -------------+------------------------------ F( 4, 54) = 26.04 Model | 1435.91975 4 358.979938 Prob > F = 0.0000 Residual | 744.317536 54 13.7836581 R-squared = 0.6586 -------------+------------------------------ Adj R-squared = 0.6333 Total | 2180.23729 58 37.5902981 Root MSE = 3.7126 ------------------------------------------------------------------------------ mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- price | -.0001126 .0002133 -0.53 0.600 -.0005403 .0003151 weight | -.005107 .0008236 -6.20 0.000 -.0067584 -.0034557 rep1 | -2.886288 1.504639 -1.92 0.060 -5.902908 .1303314 rep2 | -2.88417 1.484817 -1.94 0.057 -5.861048 .0927086 _cons | 39.89189 1.892188 21.08 0.000 36.09828 43.6855 ------------------------------------------------------------------------------

## Analysis of variance

If you wanted to do an analysis of variance looking at the differences in **mpg** among the
three repair groups, you can use the **oneway** command to do
this.

oneway mpg rep78Analysis of Variance Source SS df MS F Prob > F ------------------------------------------------------------------------ Between groups 506.325167 2 253.162583 8.47 0.0006 Within groups 1673.91212 56 29.8912879 ------------------------------------------------------------------------ Total 2180.23729 58 37.5902981 Bartlett's test for equal variances: chi2(2) = 9.9384 Prob>chi2 = 0.007

If you include the tabulate option, you get mean **mpg** for the three groups, which shows that the group with the best repair rating (**rep78** of 5) also has the highest
**mpg**
(27.3).

oneway mpg rep78, tabulate| Summary of mpg rep78 | Mean Std. Dev. Freq. ------------+------------------------------------ 3 | 19.433333 4.1413252 30 4 | 21.666667 4.9348699 18 5 | 27.363636 8.7323849 11 ------------+------------------------------------ Total | 21.59322 6.1310927 59 Analysis of Variance Source SS df MS F Prob > F ------------------------------------------------------------------------ Between groups 506.325167 2 253.162583 8.47 0.0006 Within groups 1673.91212 56 29.8912879 ------------------------------------------------------------------------ Total 2180.23729 58 37.5902981 Bartlett's test for equal variances: chi2(2) = 9.9384 Prob>chi2 = 0.007

If you want to include covariates, you need to use the **anova** command. The
**continuous(price weight)** option tells Stata that those variables are
covariates.

anova mpg rep78 c.price c.weightNumber of obs = 59 R-squared = 0.6586 Root MSE = 3.71263 Adj R-squared = 0.6333 Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 1435.91975 4 358.979938 26.04 0.0000 | rep78 | 60.2987853 2 30.1493926 2.19 0.1221 price | 3.8421233 1 3.8421233 0.28 0.5997 weight | 529.932889 1 529.932889 38.45 0.0000 | Residual | 744.317536 54 13.7836581 -----------+---------------------------------------------------- Total | 2180.23729 58 37.5902981