Little and Rubin (2002 p. 87) recommend a three step procedure for using multiple imputation with bootstrap standard errors:

- Generate bootstrap samples from the unimputed data;
- Impute missing values in each bootstrap sample;
- Run MI analyses in each of the bootstrap samples.

This page will show you how to perform these steps in Stata, along with some practical advice for doing so. The steps for programming this in Stata are as follows:

- Start with an analysis model (or models) in mind;
- Design the imputation model and write the appropriate syntax;
- Write the syntax for the analysis model;
- Place the imputation and analysis models in a Stata program so that they can be run using bootstrap;
- Run the program from step 3 using the bootstrap command

Lets start by opening our dataset, and taking a very basic look at the dataset. The dataset has demographic and academic achievement information on 200 students, with some missing values.

use https://stats.idre.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear(highschool and beyond (200 cases))sumVariable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 200 100.5 57.87918 1 200 female | 182 .5549451 .4983428 0 1 race | 200 3.43 1.039472 1 4 ses | 200 2.055 .7242914 1 3 schtyp | 200 1.16 .367526 1 2 -------------+-------------------------------------------------------- prog | 182 2.027473 .6927511 1 3 read | 191 52.28796 10.21072 28 76 write | 183 52.95082 9.257773 31 67 math | 185 52.8973 9.360837 33 75 science | 184 51.30978 9.817833 26 74 -------------+-------------------------------------------------------- socst | 200 52.405 10.73579 26 71

For this example, we will use students’ reading ability (**read**) and gender (**female**) to predict
their social studies achievement score (**socst**). Without the MI or bootstrap, the model we want to test looks like this:

regress socst write female

Next we want to by set up an imputation model for your data. You can do this working with the original data. Based on the analysis model, correlations, patterns of missingness, and our substantive knowledge of the data, we might set up the following imputation model:

mi impute mvn read write math female = socst, add(5)

There are a few important things to note about this imputation model. First, we have used the **mi impute mvn** (introduced
in Stata 11) to perform a multivariate normal imputation in this example, but we
also could have used another
command to impute, for example we could have used the user-written
program **ice** or **mi impute chained** (introduced in Stata 12) to perform imputation by chained equations.
Second, because this model is used only for illustrative purposes, it is very
simple, it is unlikely that your imputation model with be this small. Finally,
we have specified **m(5)** indicating that only 5 imputations will be used,
you will probably want to use more imputations.

Based on the above, the steps for running the imputation model using **mi
impute mvn**, and then
running the analysis using **mi estimate** are shown below. Note that we have
set the seed for the random number generator so that we can reproduce the
imputation results

mi set wide mi register imputed write read math female mi impute mvn read math female = socst, add(5)Performing EM optimization: observed log likelihood = -903.61117 at iteration 7 Performing MCMC data augmentation ... Multivariate imputation Imputations = 5 Multivariate normal regression added = 5 Imputed: m=1 through m=5 updated = 0 Prior: uniform Iterations = 500 burn-in = 100 between = 100 | Observations per m |---------------------------------------------- Variable | complete incomplete imputed | total ---------------+-----------------------------------+---------- read | 191 9 9 | 200 math | 185 15 15 | 200 female | 182 18 18 | 200 -------------------------------------------------------------- (complete + incomplete = total; imputed is the minimum across m of the number of filled in observations.)mi estimate: regress socst write femaleMultiple-imputation estimates Imputations = 5 Linear regression Number of obs = 183 Average RVI = 0.0762 Complete DF = 180 DF adjustment: Small sample DF: min = 65.35 avg = 139.04 max = 177.51 Model F test: Equal FMI F( 2, 99.3) = 46.37 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .7140314 .0709243 10.07 0.000 .5740502 .8540127 female | -1.877115 1.42335 -1.32 0.192 -4.719454 .9652245 _cons | 15.79222 3.704673 4.26 0.000 8.481352 23.10309 ------------------------------------------------------------------------------

After we run the regression using **mi estimate:** the regression
coefficients are stored in the matrix **e(b_mi)**. Below we show this matrix using the
**matrix list** command, the values in the matrix will match the coefficient
estimates in the regression table shown above. When we turn this analysis into a
program, we will need to access this matrix in order to store the coefficient
estimates for the bootstrap command.

matrix list e(b_mi)e(b_mi)[1,3] write female _cons r1 .71403142 -1.8771148 15.792222

Note that we could also have used the user-written program **mim:** to
estimate this model, in which case the coefficients would be stored in the
matrix **e(MIM_Q)**.

Now that we have syntax for the imputation and analysis models we can turn
them into a program. We start our program using the **program define**
command, this is followed by the name of the program, in this case **myboot**.
The **rclass** option following the comma is necessary because it allows us
to store the coefficient estimates for use by another command, which is
necessary if we want to use our program with the **bootstrap** command. The
first two lines of the program should look familiar, as they contain the
imputation and analysis models shown above. The following three lines all store
the regression results for later use. The first **return scalar** command
stores the value of **el(e(b_mi),1,1) **under the name **b_w**. The **
el(…)** function takes a single element from a matrix, in this case, it
takes the value from the first row, and first column of the matrix **e(b_mi)**,
which is the matrix that contains our coefficients. In other words, this stores
the value of the coefficient for **write** under the name **b_w**. The
following two lines do the same for the coefficient for **female** and the
intercept, by taking values from the second and third columns of the matrix **
e(b_mi)**. Finally the **end** command tells Stata that this is the end of
the program **myboot**.

program define myboot, rclass mi impute mvn read write math female = socst, add(5) mi estimate: regress read write female return scalar b_w = el(e(b_mi),1,1) return scalar b_f = el(e(b_mi),1,2) return scalar b_c = el(e(b_mi),1,3) end

To store the program in Stata’s working memory, we just run the above syntax
as a single block. Next we can run our program with the bootstrap command to get
bootstrapped standard errors. In the syntax below we open the dataset and **mi
set** the data, followed by registering the imputed variables using **mi
register**. Next we set the seed, so that the results can be reproduced. Finally,
we use the **bootstrap** command to run our program (i.e. **myboot**).
Following the name of the bootstrap command we see the portion of the command
that tells bootstrap what information to collect across the samples and how to
label them. For example, **b_write=r(b_w)** tells the bootstrap command that
the program myboot returns a scalar called **r(b_w)** and that this
information should be collected and output using the label **b_write**. In
this case, **b_write** is the coefficient for write, **b_female**, and **
intercept** are the coefficients for female and the intercept respectively.
Following the instructions for what information is to be collected, and the
comma, the **reps(#)** option controls the number of bootstrap samples that
will be used to produce the estimates. For this example we have requested 5
replicates so that the example will run quickly, however, in practice you
probably want far more replicates. The Stata manual suggests that 50-200
replicates may be sufficient for estimation of standard errors under certain
assumptions, however, depending on the specific situation, 1000 or more
replicates may be necessary to obtain good bootstrap estimates. Because missing
values must be imputed in each bootstrap sample, which can itself be time
consuming, there may be a trade off between the number of imputations in each
bootstrap sample, and the number of bootstrap samples used.

use https://stats.idre.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear(highschool and beyond (200 cases))mi set widemi register imputed write read math femaleset seed 23543bootstrap b_write=r(b_w) b_female=r(b_f) intercept=r(b_c), reps(5) : myboot(running myboot on estimation sample) Warning: Because myboot is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means that no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (5) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 ..... Bootstrap results Number of obs = 200 Replications = 5 command: myboot b_write: r(b_w) b_female: r(b_f) intercept: r(b_c) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- b_write | .7115871 .0530001 13.43 0.000 .6077088 .8154654 b_female | -1.273985 1.308135 -0.97 0.330 -3.837882 1.289911 intercept | 15.45521 3.289257 4.70 0.000 9.008381 21.90203 ------------------------------------------------------------------------------

As a final word of caution, Little and Rubin note that with moderate (as opposed to large) sample sizes, it may be necessary to modify the imputation model for some of the bootstrap samples. This automated approach will not work in this case.

**References**

Little, Roderick J. A., Donald B. Rubin. 2002. Statistical Analysis with Missing Data, Second Edition. Wiley-Interscience: Hoboken, New Jersey.