Note that Stata’s **mi** commands were implemented in version 11, the code below
is not applicable to earlier versions.

One common storage method for multiply imputed (MI) datasets is to include the m (i.e. number of imputations) MI datasets in a single file. For example, if 5 imputations were created, there would be five copies of each case (i.e. five rows in the dataset for each case) in a single file. Some MI datasets also contain an additional copy of the data, the original (pre-imputation) data, so that there would be six rows for each case if there were five imputations. Either format provides the information necessary to carry out data analysis on the MI datasets, however, Stata’s MI commands expect that the original (pre-imputation) data is included in the MI dataset. If the original data is not included, the commands won’t work properly. Below we explain the problem and describe how to modify a dataset released without the original data so that the original data is included in the MI file.

Note that if you are working with a National Health and Nutrition Examination
Survey (NHANES) or similarly formatted MI datasets, you may want to use Stata’s **mi import nhanes1**
command instead of the procedure described below. For information on using **mi
import nhanes1**, type "help mi import nhanes1" (without the
quotes) in the Stata command window.

## Explanation of the problem

Below is a small example of an original (pre-imputation) dataset,
**id** is the case id variable, and **v1**–**v3** are variables with some missing values.

id v1 v2 v3 1 9 3 4 2 4 . 2 3 . 2 .

If we created three imputations (i.e. m=3), the dataset might look like the
dataset shown below, where **m** is the imputation number. Since case 1 (**id**=1) had
complete data, all three of its rows are identical, in the other two cases, the
imputed values vary across the imputations.

m id v1 v2 v3 1 1 9 3 4 1 2 4 2 2 1 3 4 2 5 2 1 9 3 4 2 2 4 3 2 2 3 2 2 3 3 1 9 3 4 3 2 4 3 2 3 3 2 2 4

As we discussed above, there is nothing wrong with this format. However, Stata expects that the original (unimputed) dataset is included (denoted m=0). In this format, the example dataset from above would look like this:

m id v1 v2 v3 0 1 9 3 4 0 2 4 . 2 0 3 . 2 . 1 1 9 3 4 1 2 4 2 2 1 3 4 2 5 2 1 9 3 4 2 2 4 3 2 2 3 2 2 3 3 1 9 3 4 3 2 4 3 2 3 3 2 2 4

Below we show what happens when one tries to use the **mi import** command
to import data without the original data (m=0). First we open a dataset and
tabulate the variable **m**, the variable **m** takes on five values, one
for each imputation.

use http://www.ats.ucla.edu/stat/stata/faq/hsb2_no_m_0(highschool and beyond (200 cases))tab mimputation | number | Freq. Percent Cum. ------------+----------------------------------- 1 | 200 20.00 20.00 2 | 200 20.00 40.00 3 | 200 20.00 60.00 4 | 200 20.00 80.00 5 | 200 20.00 100.00 ------------+----------------------------------- Total | 1,000 100.00

Below we use the **mi import** command to tell Stata that our data is multiply imputed.
The **m(**…**)** option identifies the variable that contains the imputation number,
**id(**…**)** gives
the individual id number for each case, and **imputed(**…**)** gives the
names of the variables that have
been imputed.

mi import flong, m(m) id(id) imputed(female math write read science)(36 values of imputed variable female in m>0 updated to match values in m=0) (48 values of imputed variable math in m>0 updated to match values in m=0) (52 values of imputed variable write in m>0 updated to match values in m=0) (76 values of imputed variable read in m>0 updated to match values in m=0) (92 values of imputed variable science in m>0 updated to match values in m=0)

This produces five messages from Stata, each message informs us that Stata
has changed the values in the imputed datasets to match existing values in what
it assumes to be the original data (i.e. m=0).
But the dataset we started with didn’t contain m=0, so Stata assumed the lowest
value of **m** (m=1) was actually m=0. Below is a cross tab of the **m** variable
from our dataset, and the system variable **_mi_m** (created when
we used the **mi import** command to index the imputations). The cross tab
shows how Stata renumbered the imputations. In and of itself, this isn’t a problem, **m** is just an identifier, so in many ways its value
is arbitrary. However, Stata makes the assumption that m=0 (i.e. _mi_m=0)
is the pre-imputation dataset. Since m=1 (which became _mi_m=0) contains no
missing data, Stata assumes that the values are actually complete, and replaces the imputed values in the other four MI datasets
with the value in m=1. If m=1 were the original data, this would make perfect sense,
after all, we don’t need imputed values for cases where we have observed values.
However, because m=1 does not contain the original (unimputed) data, this creates
problems.

tab m _mi_mimputation | _mi_m number | 0 1 2 3 4 | Total -----------+-------------------------------------------------------+---------- 1 | 200 0 0 0 0 | 200 2 | 0 200 0 0 0 | 200 3 | 0 0 200 0 0 | 200 4 | 0 0 0 200 0 | 200 5 | 0 0 0 0 200 | 200 -----------+-------------------------------------------------------+---------- Total | 200 200 200 200 200 | 1,000

To further show what is going on, we use the **mi describe** command. It tells us that
there are 200 complete observations (since all observations in m=1 are complete), and that M=4 (i.e.
we have four imputed datasets), when there are actually five.

mi describeStyle: flong last mi update approximately 1 minute ago Obs.: complete 200 incomplete 0 (M = 4 imputations) --------------------- total 200 Vars.: imputed: 5; female(0) math(0) write(0) read(0) science(0) passive: 0 regular: 0 system: 3; _mi_m _mi_id _mi_miss (there are 12 unregistered variables)

## The solution

In order to have **mi import** properly import our data, we need to create a dataset of the form Stata expects,
that is, a dataset where m=0
contains the original (unimputed) data, and m>0 contains the multiply imputed datasets.

Below we start by loading the MI dataset. Next we keep only one of the imputations (**keep if m==1**)
and set the value of
m to 0 (using the **replace** command).

use http://www.ats.ucla.edu/stat/stata/faq/hsb2_no_m_0, clear(highschool and beyond (200 cases))keep if m==1(800 observations deleted)replace m=0(200 real changes made)

For the next step, we need to know which variables have imputed values, and
for each imputed variable, we need a variable that indicates which observations were imputed. In our case the indicator variables are the name of the
imputed variable
prefixed by **i_**, for example, the variable **i_female** is equal to 1 when the value of
**female** has been imputed, and 0 when it has not. (If your dataset does not
have this type of imputation indicator, it can be created, see below.) We can use these
indicator variables to recreate the missing values in the original dataset. For
the variable **female** the command to create missing values where **female** has
been imputed is: **
replace female = . if i_female==1** . However, if you had more than a few
imputed variables, writing out the command for each variable would be somewhat tedious, so instead, we use a loop to
do the same thing. The **foreach** command tells Stata that for each variable
that follow the keyword **varlist**, it should perform the action in the brackets,
filling in **`var’** with the name of the variable. The output from running
this command shows how many values were changed for each variable, for example,
for **
female** (the first variable in the list), 9 values were changed to missing.

foreach var of varlist female math write read science { replace `var' = . if i_`var'==1 }(9 real changes made, 9 to missing) (12 real changes made, 12 to missing) (13 real changes made, 13 to missing) (19 real changes made, 19 to missing) (23 real changes made, 23 to missing)

Now we have a dataset with the same missing data structure as the original (unimputed) dataset,
below we use **append** to add the cases from our starting dataset (**hsb2_no_m_0.dta**).
The output tells us that value labels in the dataset we’re appending already exist in the current
dataset, this is expected because they began as the same dataset. Next, we tabulate
**m**, the variable
for imputation number. Note that we now have values of **m** from 0 to 5.
Finally, we save the new dataset under a different name.

append using hsb2_no_m_0(label rl already defined) (label sl already defined) (label scl already defined) (label sel already defined) (label fl already defined)tab mimputation | number | Freq. Percent Cum. ------------+----------------------------------- 0 | 200 16.67 16.67 1 | 200 16.67 33.33 2 | 200 16.67 50.00 3 | 200 16.67 66.67 4 | 200 16.67 83.33 5 | 200 16.67 100.00 ------------+----------------------------------- Total | 1,200 100.00save hsb2_mfile hsb2_m.dta saved

Now when we use the **mi import** command, instead of changing values as it
did above, Stata marks the 66 observations with missing values as incomplete.
The output from the command **mi describe** reports
66 incomplete observations, and M = 5 imputations. Further
down the output also lists the variables that have been imputed, as well as what
we know to be the
correct number of imputed values for each.

mi import flong, m(m) id(id) imputed(female math write read science)(66 m=0 obs. now marked as incomplete)mi describeStyle: flong last mi update 0 seconds ago Obs.: complete 134 incomplete 66 (M = 5 imputations) --------------------- total 200 Vars.: imputed: 5; female(9) math(12) write(13) read(19) science(23) passive: 0 regular: 0 system: 3; _mi_m _mi_id _mi_miss (there are 12 unregistered variables)

## Creating indicators for imputed values

The example above assumes that there are variables that mark the imputed
observations. That is, for each variable that has been imputed, there is a
variable marking which cases have imputed values and which have observed values.
If you are working with a dataset that does not have these indicators, it is
possible to create them. The first two lines of
code below open the dataset
and sort by case id (**id**). The third line of code below creates a new variable
**i_female**
that is equal to the standard deviation of **female**, for each value of **id**. If
**female** was observed for a given case, the value of **female** will be the same across
the imputed datasets and the standard deviation within that case will be equal to zero.
If **female** was imputed, the value is likely to vary across the imputations (see note
below), and the standard deviation will be greater than 0. In the final line, we
replace values of **i_female** greater than zero with the value 1, so that **
i_female** is equal to 1 if **female** appears to have been imputed, and 0
otherwise.

use http://www.ats.ucla.edu/stat/stata/faq/hsb2_no_m_0_ind, clear sort id by id: egen i_female = sd(female) replace i_female = 1 if i_female>0

For a single variable, this works fine, however, if the dataset
contains many imputed variables, this process would be labor intensive
and error prone. So instead of writing out the code for each variable, we can use a loop to do it.
Below we open the dataset and sort the cases by **id**. Next we use the **foreach** command to generate
the indicator variables for each of the variables listed after the keyword **varlist**.
Stata will run the commands in the brackets once for each variable in the list, each
time replacing the **`var’** with the name of the appropriate variable.

use hsb2_no_m_0_ind, clear sort id foreach var of varlist female math write read science { by id: egen i_`var' = sd(`var') replace i_`var' = 1 if i_`var'>0 }(45 real changes made) (60 real changes made) (65 real changes made) (95 real changes made) (115 real changes made)

Note that this technique assumes that the imputed value for a single case will vary across
imputations. This is likely to be true with continuous variables and with
categorical variables imputed using the multivariate normal approach. However
categorical variables imputed using the chained equation approach (implemented in the Stata package ice
as well as in other packges), this may be true, but there may also be some exceptions (i.e. the same value was imputed across all imputations). That
said, if the imputed value does not vary across imputations, the
approach outlined above will still work to set up the data for use with Stata’s
**mi** commands, at least mechanically.