How can I perform bootstrap estimation with multiply imputed data?

Little and Rubin (2002 p. 87) recommend a three step procedure for using multiple imputation with bootstrap standard errors:

Generate bootstrap samples from the unimputed data;
Impute missing values in each bootstrap sample;
Run MI analyses in each of the bootstrap samples.

This page will show you how to perform these steps in Stata, along with some practical advice for doing so. The steps for programming this in Stata are as follows:

Start with an analysis model (or models) in mind;
Design the imputation model and write the appropriate syntax;
Write the syntax for the analysis model;
Place the imputation and analysis models in a Stata program so that they can be run using bootstrap;
Run the program from step 3 using the bootstrap command

Lets start by opening our dataset, and taking a very basic look at the dataset. The dataset has demographic and academic achievement information on 200 students, with some missing values.

use https://stats.idre.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear
(highschool and beyond (200 cases))

sum

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
          id |       200       100.5    57.87918          1        200
      female |       182    .5549451    .4983428          0          1
        race |       200        3.43    1.039472          1          4
         ses |       200       2.055    .7242914          1          3
      schtyp |       200        1.16     .367526          1          2
-------------+--------------------------------------------------------
        prog |       182    2.027473    .6927511          1          3
        read |       191    52.28796    10.21072         28         76
       write |       183    52.95082    9.257773         31         67
        math |       185     52.8973    9.360837         33         75
     science |       184    51.30978    9.817833         26         74
-------------+--------------------------------------------------------
       socst |       200      52.405    10.73579         26         71

For this example, we will use students’ reading ability (read) and gender (female) to predict their social studies achievement score (socst). Without the MI or bootstrap, the model we want to test looks like this:

regress socst write female

Next we want to by set up an imputation model for your data. You can do this working with the original data. Based on the analysis model, correlations, patterns of missingness, and our substantive knowledge of the data, we might set up the following imputation model:

mi impute mvn read write math female = socst, add(5)

There are a few important things to note about this imputation model. First, we have used the mi impute mvn (introduced in Stata 11) to perform a multivariate normal imputation in this example, but we also could have used another command to impute, for example we could have used the user-written program ice or mi impute chained (introduced in Stata 12) to perform imputation by chained equations. Second, because this model is used only for illustrative purposes, it is very simple, it is unlikely that your imputation model with be this small. Finally, we have specified m(5) indicating that only 5 imputations will be used, you will probably want to use more imputations.

Based on the above, the steps for running the imputation model using mi impute mvn, and then running the analysis using mi estimate are shown below. Note that we have set the seed for the random number generator so that we can reproduce the imputation results

mi set wide
mi register imputed write read math female 
mi impute mvn read math female = socst, add(5)

Performing EM optimization:
  observed log likelihood = -903.61117 at iteration 7

Performing MCMC data augmentation ... 

Multivariate imputation                 Imputations =        5
Multivariate normal regression                added =        5
Imputed: m=1 through m=5                    updated =        0

Prior: uniform                           Iterations =      500
                                            burn-in =      100
                                            between =      100

               |              Observations per m              
               |----------------------------------------------
      Variable |   complete   incomplete   imputed |     total
---------------+-----------------------------------+----------
          read |        191            9         9 |       200
          math |        185           15        15 |       200
        female |        182           18        18 |       200
--------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
 of the number of filled in observations.)


mi estimate: regress socst write female

Multiple-imputation estimates                     Imputations     =          5
Linear regression                                 Number of obs   =        183
                                                  Average RVI     =     0.0762
                                                  Complete DF     =        180
DF adjustment:   Small sample                     DF:     min     =      65.35
                                                          avg     =     139.04
                                                          max     =     177.51
Model F test:       Equal FMI                     F(   2,   99.3) =      46.37
Within VCE type:          OLS                     Prob > F        =     0.0000

------------------------------------------------------------------------------
       socst |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       write |   .7140314   .0709243    10.07   0.000     .5740502    .8540127
      female |  -1.877115    1.42335    -1.32   0.192    -4.719454    .9652245
       _cons |   15.79222   3.704673     4.26   0.000     8.481352    23.10309
------------------------------------------------------------------------------

After we run the regression using mi estimate: the regression coefficients are stored in the matrix e(b_mi). Below we show this matrix using the matrix list command, the values in the matrix will match the coefficient estimates in the regression table shown above. When we turn this analysis into a program, we will need to access this matrix in order to store the coefficient estimates for the bootstrap command.

matrix list e(b_mi)

e(b_mi)[1,3]
         write      female       _cons
r1   .71403142  -1.8771148   15.792222

Note that we could also have used the user-written program mim: to estimate this model, in which case the coefficients would be stored in the matrix e(MIM_Q).

Now that we have syntax for the imputation and analysis models we can turn them into a program. We start our program using the program define command, this is followed by the name of the program, in this case myboot. The rclass option following the comma is necessary because it allows us to store the coefficient estimates for use by another command, which is necessary if we want to use our program with the bootstrap command. The first two lines of the program should look familiar, as they contain the imputation and analysis models shown above. The following three lines all store the regression results for later use. The first return scalar command stores the value of el(e(b_mi),1,1) under the name b_w. The el(…) function takes a single element from a matrix, in this case, it takes the value from the first row, and first column of the matrix e(b_mi), which is the matrix that contains our coefficients. In other words, this stores the value of the coefficient for write under the name b_w. The following two lines do the same for the coefficient for female and the intercept, by taking values from the second and third columns of the matrix e(b_mi). Finally the end command tells Stata that this is the end of the program myboot.

program define myboot, rclass
   mi impute mvn read write math female = socst, add(5) 
   mi estimate: regress read write female

   return scalar b_w = el(e(b_mi),1,1)
   return scalar b_f = el(e(b_mi),1,2)
   return scalar b_c = el(e(b_mi),1,3)
end

To store the program in Stata’s working memory, we just run the above syntax as a single block. Next we can run our program with the bootstrap command to get bootstrapped standard errors. In the syntax below we open the dataset and mi set the data, followed by registering the imputed variables using mi register. Next we set the seed, so that the results can be reproduced. Finally, we use the bootstrap command to run our program (i.e. myboot). Following the name of the bootstrap command we see the portion of the command that tells bootstrap what information to collect across the samples and how to label them. For example, b_write=r(b_w) tells the bootstrap command that the program myboot returns a scalar called r(b_w) and that this information should be collected and output using the label b_write. In this case, b_write is the coefficient for write, b_female, and intercept are the coefficients for female and the intercept respectively. Following the instructions for what information is to be collected, and the comma, the reps(#) option controls the number of bootstrap samples that will be used to produce the estimates. For this example we have requested 5 replicates so that the example will run quickly, however, in practice you probably want far more replicates. The Stata manual suggests that 50-200 replicates may be sufficient for estimation of standard errors under certain assumptions, however, depending on the specific situation, 1000 or more replicates may be necessary to obtain good bootstrap estimates. Because missing values must be imputed in each bootstrap sample, which can itself be time consuming, there may be a trade off between the number of imputations in each bootstrap sample, and the number of bootstrap samples used.

use https://stats.idre.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear
(highschool and beyond (200 cases))

mi set wide
mi register imputed write read math female 

set seed 23543

bootstrap b_write=r(b_w) b_female=r(b_f) intercept=r(b_c), reps(5) : myboot 
(running myboot on estimation sample)

Warning:  Because myboot is not an estimation command or does not set e(sample),
          bootstrap has no way to determine which observations are used in calculating
          the statistics and so assumes that all observations are used.  This means that
          no observations will be excluded from the resampling because of missing values
          or other reasons.

          If the assumption is not true, press Break, save the data, and drop the
          observations that are to be excluded.  Be sure that the dataset in memory
          contains only the relevant data.

Bootstrap replications (5)
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
.....

Bootstrap results                               Number of obs      =       200
                                                Replications       =         5

      command:  myboot
      b_write:  r(b_w)
     b_female:  r(b_f)
    intercept:  r(b_c)

------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     b_write |   .7115871   .0530001    13.43   0.000     .6077088    .8154654
    b_female |  -1.273985   1.308135    -0.97   0.330    -3.837882    1.289911
   intercept |   15.45521   3.289257     4.70   0.000     9.008381    21.90203
------------------------------------------------------------------------------

As a final word of caution, Little and Rubin note that with moderate (as opposed to large) sample sizes, it may be necessary to modify the imputation model for some of the bootstrap samples. This automated approach will not work in this case.

References

Little, Roderick J. A., Donald B. Rubin. 2002. Statistical Analysis with Missing Data, Second Edition. Wiley-Interscience: Hoboken, New Jersey.