Little and Rubin (2002 p. 87) recommend a three step procedure for using multiple imputation with bootstrap standard errors:
- Generate bootstrap samples from the unimputed data;
- Impute missing values in each bootstrap sample;
- Run MI analyses in each of the bootstrap samples.
This page will show you how to perform these steps in Stata, along with some practical advice for doing so. The steps for programming this in Stata are as follows:
- Start with an analysis model (or models) in mind;
- Design the imputation model and write the appropriate syntax;
- Write the syntax for the analysis model;
- Place the imputation and analysis models in a Stata program so that they can be run using bootstrap;
- Run the program from step 3 using the bootstrap command
Lets start by opening our dataset, and taking a very basic look at the dataset. The dataset has demographic and academic achievement information on 200 students, with some missing values.
use https://stats.idre.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear (highschool and beyond (200 cases)) sum Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- id | 200 100.5 57.87918 1 200 female | 182 .5549451 .4983428 0 1 race | 200 3.43 1.039472 1 4 ses | 200 2.055 .7242914 1 3 schtyp | 200 1.16 .367526 1 2 -------------+-------------------------------------------------------- prog | 182 2.027473 .6927511 1 3 read | 191 52.28796 10.21072 28 76 write | 183 52.95082 9.257773 31 67 math | 185 52.8973 9.360837 33 75 science | 184 51.30978 9.817833 26 74 -------------+-------------------------------------------------------- socst | 200 52.405 10.73579 26 71
For this example, we will use students’ reading ability (read) and gender (female) to predict their social studies achievement score (socst). Without the MI or bootstrap, the model we want to test looks like this:
regress socst write female
Next we want to by set up an imputation model for your data. You can do this working with the original data. Based on the analysis model, correlations, patterns of missingness, and our substantive knowledge of the data, we might set up the following imputation model:
mi impute mvn read write math female = socst, add(5)
There are a few important things to note about this imputation model. First, we have used the mi impute mvn (introduced in Stata 11) to perform a multivariate normal imputation in this example, but we also could have used another command to impute, for example we could have used the user-written program ice or mi impute chained (introduced in Stata 12) to perform imputation by chained equations. Second, because this model is used only for illustrative purposes, it is very simple, it is unlikely that your imputation model with be this small. Finally, we have specified m(5) indicating that only 5 imputations will be used, you will probably want to use more imputations.
Based on the above, the steps for running the imputation model using mi impute mvn, and then running the analysis using mi estimate are shown below. Note that we have set the seed for the random number generator so that we can reproduce the imputation results
mi set wide mi register imputed write read math female mi impute mvn read math female = socst, add(5) Performing EM optimization: observed log likelihood = -903.61117 at iteration 7 Performing MCMC data augmentation ... Multivariate imputation Imputations = 5 Multivariate normal regression added = 5 Imputed: m=1 through m=5 updated = 0 Prior: uniform Iterations = 500 burn-in = 100 between = 100 | Observations per m |---------------------------------------------- Variable | complete incomplete imputed | total ---------------+-----------------------------------+---------- read | 191 9 9 | 200 math | 185 15 15 | 200 female | 182 18 18 | 200 -------------------------------------------------------------- (complete + incomplete = total; imputed is the minimum across m of the number of filled in observations.) mi estimate: regress socst write female Multiple-imputation estimates Imputations = 5 Linear regression Number of obs = 183 Average RVI = 0.0762 Complete DF = 180 DF adjustment: Small sample DF: min = 65.35 avg = 139.04 max = 177.51 Model F test: Equal FMI F( 2, 99.3) = 46.37 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ socst | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- write | .7140314 .0709243 10.07 0.000 .5740502 .8540127 female | -1.877115 1.42335 -1.32 0.192 -4.719454 .9652245 _cons | 15.79222 3.704673 4.26 0.000 8.481352 23.10309 ------------------------------------------------------------------------------
After we run the regression using mi estimate: the regression coefficients are stored in the matrix e(b_mi). Below we show this matrix using the matrix list command, the values in the matrix will match the coefficient estimates in the regression table shown above. When we turn this analysis into a program, we will need to access this matrix in order to store the coefficient estimates for the bootstrap command.
matrix list e(b_mi) e(b_mi)[1,3] write female _cons r1 .71403142 -1.8771148 15.792222
Note that we could also have used the user-written program mim: to estimate this model, in which case the coefficients would be stored in the matrix e(MIM_Q).
Now that we have syntax for the imputation and analysis models we can turn them into a program. We start our program using the program define command, this is followed by the name of the program, in this case myboot. The rclass option following the comma is necessary because it allows us to store the coefficient estimates for use by another command, which is necessary if we want to use our program with the bootstrap command. The first two lines of the program should look familiar, as they contain the imputation and analysis models shown above. The following three lines all store the regression results for later use. The first return scalar command stores the value of el(e(b_mi),1,1) under the name b_w. The el(…) function takes a single element from a matrix, in this case, it takes the value from the first row, and first column of the matrix e(b_mi), which is the matrix that contains our coefficients. In other words, this stores the value of the coefficient for write under the name b_w. The following two lines do the same for the coefficient for female and the intercept, by taking values from the second and third columns of the matrix e(b_mi). Finally the end command tells Stata that this is the end of the program myboot.
program define myboot, rclass mi impute mvn read write math female = socst, add(5) mi estimate: regress read write female return scalar b_w = el(e(b_mi),1,1) return scalar b_f = el(e(b_mi),1,2) return scalar b_c = el(e(b_mi),1,3) end
To store the program in Stata’s working memory, we just run the above syntax as a single block. Next we can run our program with the bootstrap command to get bootstrapped standard errors. In the syntax below we open the dataset and mi set the data, followed by registering the imputed variables using mi register. Next we set the seed, so that the results can be reproduced. Finally, we use the bootstrap command to run our program (i.e. myboot). Following the name of the bootstrap command we see the portion of the command that tells bootstrap what information to collect across the samples and how to label them. For example, b_write=r(b_w) tells the bootstrap command that the program myboot returns a scalar called r(b_w) and that this information should be collected and output using the label b_write. In this case, b_write is the coefficient for write, b_female, and intercept are the coefficients for female and the intercept respectively. Following the instructions for what information is to be collected, and the comma, the reps(#) option controls the number of bootstrap samples that will be used to produce the estimates. For this example we have requested 5 replicates so that the example will run quickly, however, in practice you probably want far more replicates. The Stata manual suggests that 50-200 replicates may be sufficient for estimation of standard errors under certain assumptions, however, depending on the specific situation, 1000 or more replicates may be necessary to obtain good bootstrap estimates. Because missing values must be imputed in each bootstrap sample, which can itself be time consuming, there may be a trade off between the number of imputations in each bootstrap sample, and the number of bootstrap samples used.
use https://stats.idre.ucla.edu/stat/stata/seminars/missing_data/hsb2_mar, clear (highschool and beyond (200 cases)) mi set wide mi register imputed write read math female set seed 23543 bootstrap b_write=r(b_w) b_female=r(b_f) intercept=r(b_c), reps(5) : myboot (running myboot on estimation sample) Warning: Because myboot is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations are used in calculating the statistics and so assumes that all observations are used. This means that no observations will be excluded from the resampling because of missing values or other reasons. If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that the dataset in memory contains only the relevant data. Bootstrap replications (5) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 ..... Bootstrap results Number of obs = 200 Replications = 5 command: myboot b_write: r(b_w) b_female: r(b_f) intercept: r(b_c) ------------------------------------------------------------------------------ | Observed Bootstrap Normal-based | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- b_write | .7115871 .0530001 13.43 0.000 .6077088 .8154654 b_female | -1.273985 1.308135 -0.97 0.330 -3.837882 1.289911 intercept | 15.45521 3.289257 4.70 0.000 9.008381 21.90203 ------------------------------------------------------------------------------
As a final word of caution, Little and Rubin note that with moderate (as opposed to large) sample sizes, it may be necessary to modify the imputation model for some of the bootstrap samples. This automated approach will not work in this case.
References
Little, Roderick J. A., Donald B. Rubin. 2002. Statistical Analysis with Missing Data, Second Edition. Wiley-Interscience: Hoboken, New Jersey.