A statistical sampler in Stata | Stata Learning Modules

Version info: Code for this page was tested in Stata 12.

This module will give a brief overview of some common statistical tests in Stata. Let’s use the auto data file that we will use for our examples.

sysuse auto

t-tests

Let’s do a t-test comparing the miles per gallon (mpg) of foreign and domestic cars.


ttest mpg , by(foreign)
Two-sample t test with equal variances

------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]
---------+--------------------------------------------------------------------
       0 |      52    19.82692     .657777    4.743297    18.50638    21.14747
       1 |      22    24.77273     1.40951    6.611187    21.84149    27.70396
---------+--------------------------------------------------------------------
combined |      74     21.2973    .6725511    5.785503     19.9569    22.63769
---------+--------------------------------------------------------------------
    diff |           -4.945804    1.362162               -7.661225   -2.230384
------------------------------------------------------------------------------
Degrees of freedom: 72

                      Ho: mean(0) - mean(1) = diff = 0

     Ha: diff <0 Ha: diff ~="0" Ha: diff> 0
       t =  -3.6308                t =  -3.6308              t =  -3.6308
   P < t =   0.0003          P > |t| =   0.0005          P > t =   0.9997

As you see in the output above, the domestic cars had significantly lower mpg (19.8) than the foreign cars (24.7).

Chi-square

Let’s compare the repair rating (rep78) of the foreign and domestic cars. We can make a crosstab of rep78 by foreign. We may want to ask whether these variables are independent. We can use the chi2 option to request a chi-square test of independence as well as the crosstab.

tabulate rep78 foreign, chi2

           |        foreign
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69 

          Pearson chi2(4) =  27.2640   Pr = 0.000

The chi-square is not really valid when you have empty cells. In such cases when you have empty cells, or cells with small frequencies, you can request Fisher’s exact test with the exact option.

tabulate rep78 foreign, chi2 exact

           |        foreign
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69 

          Pearson chi2(4) =  27.2640   Pr = 0.000
          Fisher's exact =                 0.000

Correlation

We can use the correlate command to get the correlations among variables. Let’s look at the correlations among price mpg weight and rep78. (We use rep78 in the correlation even though it is not continuous to illustrate what happens when you use correlate with variables with missing data.)

correlate price mpg weight rep78

 (obs=69)

         |    price      mpg   weight    rep78
---------+------------------------------------
   price |   1.0000
     mpg |  -0.4559   1.0000
  weight |   0.5478  -0.8055   1.0000
   rep78 |   0.0066   0.4023  -0.4003   1.0000

Note that the output above said (obs=69). The correlate command drops data on a listwise basis, meaning that if any of the variables are missing, then the entire observation is omitted from the correlation analysis.

We can use pwcorr (pairwise correlations) if we want to obtain correlations that deletes missing data on a pairwise basis instead of a listwise basis. We will use the obs option to show the number of observations used for calculating each correlation.

pwcorr price mpg weight rep78, obs

           |    price      mpg   weight    rep78
----------+------------------------------------
    price |   1.0000 
          |       74
          |
      mpg |  -0.4686   1.0000 
          |       74       74
          |
   weight |   0.5386  -0.8072   1.0000 
          |       74       74       74
          |
    rep78 |   0.0066   0.4023  -0.4003   1.0000 
          |       69       69       69       69
          |

Note how the correlations that involve rep78 have an N of 69 compared to the other correlations that have an N of 74. This is because rep78 has five missing values, so it only had 69 valid observations, but the other variables had no missing data so they had 74 valid observations.

Regression

Let’s look at doing regression analysis in Stata. For this example, let’s drop the cases where rep78 is 1 or 2 or missing.

drop if (rep78 <= 2) | (rep78==.)

 (15 observations deleted)

Now, let’s predict mpg from price and weight. As you see below, weight is a significant predictor of mpg, but price is not.

regress mpg price weight

 
  Source |       SS       df       MS                  Number of obs =      59
---------+------------------------------               F(  2,    56) =   47.87
   Model |  1375.62097     2  687.810483               Prob > F      =  0.0000
Residual |  804.616322    56  14.3681486               R-squared     =  0.6310
---------+------------------------------               Adj R-squared =  0.6178
   Total |  2180.23729    58  37.5902981               Root MSE      =  3.7905

------------------------------------------------------------------------------
     mpg |      Coef.   Std. Err.       t     P>|t|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
   price |  -.0000139   .0002108     -0.066   0.948      -.0004362    .0004084
  weight |   -.005828   .0007301     -7.982   0.000      -.0072906   -.0043654
   _cons |   39.08279   1.855011     21.069   0.000       35.36676    42.79882
------------------------------------------------------------------------------

What if we wanted to predict mpg from rep78 as well. rep78 is really more of a categorical variable than it is a continuous variable. To include it in the regression, we should convert rep78 into dummy variables. Fortunately, Stata makes dummy variables easily using tabulate. The gen(rep) option tells Stata that we want to generate dummy variables from rep78 and we want the stem of the dummy variables to be rep.

tabulate rep78, gen(rep)

      rep78 |      Freq.     Percent        Cum.
------------+-----------------------------------
          3 |         30       50.85       50.85
          4 |         18       30.51       81.36
          5 |         11       18.64      100.00
------------+-----------------------------------
      Total |         59      100.00

Stata has created rep1 (1 if rep78 is 3), rep2 (1 if rep78 is 4) and rep3 (1 if rep78 is 5). We can use the tabulate command to verify that the dummy variables were created properly.

tabulate rep78 rep1

           |  rep78==     3.0000
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         3 |         0         30 |        30 
         4 |        18          0 |        18 
         5 |        11          0 |        11 
-----------+----------------------+----------
     Total |        29         30 |        59

tabulate rep78 rep2

           |  rep78==     4.0000
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         3 |        30          0 |        30 
         4 |         0         18 |        18 
         5 |        11          0 |        11 
-----------+----------------------+----------
     Total |        41         18 |        59

tabulate rep78 rep3

           |  rep78==     5.0000
     rep78 |         0          1 |     Total
-----------+----------------------+----------
         3 |        30          0 |        30 
         4 |        18          0 |        18 
         5 |         0         11 |        11 
-----------+----------------------+----------
     Total |        48         11 |        59

Now we can include rep1 and rep2 as dummy variables in the regression model.

regress mpg price weight rep1 rep2 
	
	       Source |       SS       df       MS              Number of obs =      59
-------------+------------------------------           F(  4,    54) =   26.04
       Model |  1435.91975     4  358.979938           Prob > F      =  0.0000
    Residual |  744.317536    54  13.7836581           R-squared     =  0.6586
-------------+------------------------------           Adj R-squared =  0.6333
       Total |  2180.23729    58  37.5902981           Root MSE      =  3.7126

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       price |  -.0001126   .0002133    -0.53   0.600    -.0005403    .0003151
      weight |   -.005107   .0008236    -6.20   0.000    -.0067584   -.0034557
        rep1 |  -2.886288   1.504639    -1.92   0.060    -5.902908    .1303314
        rep2 |   -2.88417   1.484817    -1.94   0.057    -5.861048    .0927086
       _cons |   39.89189   1.892188    21.08   0.000     36.09828     43.6855
------------------------------------------------------------------------------

Analysis of variance

If you wanted to do an analysis of variance looking at the differences in mpg among the three repair groups, you can use the oneway command to do this.

oneway mpg rep78

                         Analysis of Variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      506.325167      2   253.162583      8.47     0.0006
 Within groups      1673.91212     56   29.8912879
------------------------------------------------------------------------
    Total           2180.23729     58   37.5902981

Bartlett's test for equal variances:  chi2(2) =   9.9384  Prob>chi2 = 0.007

If you include the tabulate option, you get mean mpg for the three groups, which shows that the group with the best repair rating (rep78 of 5) also has the highest mpg (27.3).

oneway mpg rep78, tabulate

 
            |           Summary of mpg
      rep78 |        Mean   Std. Dev.       Freq.
------------+------------------------------------
          3 |   19.433333   4.1413252          30
          4 |   21.666667   4.9348699          18
          5 |   27.363636   8.7323849          11
------------+------------------------------------
      Total |    21.59322   6.1310927          59

                        Analysis of Variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups      506.325167      2   253.162583      8.47     0.0006
 Within groups      1673.91212     56   29.8912879
------------------------------------------------------------------------
    Total           2180.23729     58   37.5902981

Bartlett's test for equal variances:  chi2(2) =   9.9384  Prob>chi2 = 0.007

If you want to include covariates, you need to use the anova command. The continuous(price weight) option tells Stata that those variables are covariates.

anova mpg rep78 c.price c.weight
                           Number of obs =      59     R-squared     =  0.6586
                           Root MSE      = 3.71263     Adj R-squared =  0.6333

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |  1435.91975     4  358.979938      26.04     0.0000
                         |
                   rep78 |  60.2987853     2  30.1493926       2.19     0.1221
                   price |   3.8421233     1   3.8421233       0.28     0.5997
                  weight |  529.932889     1  529.932889      38.45     0.0000
                         |
                Residual |  744.317536    54  13.7836581   
              -----------+----------------------------------------------------
                   Total |  2180.23729    58  37.5902981