The purpose of this seminar is to explore some issues in the analysis of survey data in Stata 9. Before we begin looking at examples in Stata, we will quickly review some basic issues and concepts in survey data analysis.

## Why do we need survey data analysis software?

Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, when surveys are conducted, a simple random sample is rarely collected. Not only is it nearly impossible to do so, but it is not as efficient (both financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used and simple random sampling. This is because the sampling design affects both the calculation of the point estimates and the standard errors of the estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, the standard errors will likely be underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between variables within the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.

## Sampling designs

Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.

Below are some common features of many sampling designs.

__Sampling weights__: There are several types of weights that
can be associated with a survey. Perhaps the most common is the sampling weight, sometimes called a probability weight,
which is used to weight the sample back to the population from which the sample
was drawn. By definition, this weight is the inverse of the probability of being included in the
sample due to the sampling design (except for a certainty PSU, see below).
The probability weight, called a pweight in Stata, is calculated as N/n, where N = the number of elements in the
population and n = the number of elements in the sample. For example, if a population has 10
elements and 3 are sampled at random with replacement, then the probability weight would be
10/3 = 3.33. In a two-stage design, the probability weight is calculated as f_{1}f_{2},
which means that the inverse of the sampling fraction for the first stage is
multiplied by the inverse of the sampling fraction for the second stage.
Under many sampling plans, the sum of the probability weights will equal the population total.

While many textbooks will end their discussion of probability weights here, this definition does not fully describe the probability weights that are included with actual survey data sets. Rather, the "final weight" usually starts with the inverse of the sampling fraction, but then incorporates several other values, such as corrections for unit non-response, errors in the sampling frame (sometimes called non-coverage), and poststratification. Because these other values are included in the probability weight that is included with the data set, it is often inadvisable to modify the probability weights, such as trying to standardize them for a particular variable, e.g., age.

__PSU__: This is the **p**rimary **s**ampling **u**nit.
This is the first unit that is sampled in the design. For example, school
districts from California may be sampled and then schools within districts may
be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each
state, and then schools from within each district, then states would be the PSU.
One does not need to use the same sampling method at all levels of sampling.
For example, probability-proportional-to-size sampling may be used at
level 1 (to select states), while cluster sampling is used at level 2
(to select school districts). In the case of a simple random sample, the
PSUs and the elementary units are the same. In general, accounting for the
clustering in the data (i.e., using the PSUs), will increase the standard errors
of the estimates. Conversely, ignoring the PSUs will tend to yield
standard errors that are too small, leading to false positives when doing
significance tests.

__Strata__: Stratification is a method of breaking up the
population into different groups, often by demographic variables such as gender,
race or SES. Each element in the population must belong to one, and only
one, strata. Once the strata have been defined, one samples from each
stratum as if it were independent of all of the other strata. For example,
if a sample is to be stratified on gender, men and women would be sampled
independent of one another. This means that the probability weights for men will
likely be different from the probability weights for the women. In most cases, you
need to have two or more PSUs in each stratum. The purpose of
stratification is to reduce the standard error of the estimates, and stratification works most
effectively when the variance of the dependent variable is smaller within the
strata than in the sample as a whole.

__FPC__: This is the **f**inite **p**opulation **c**orrection.
This is used when the sampling fraction (the number of elements or respondents
sampled relative to the population) becomes large. The FPC is used in the
calculation of the standard error of the estimate. If the value of the FPC
is close to 1, it will have little impact and can be safely ignored. In
some survey data analysis programs, such as SUDAAN, this information will be
needed if you specify that the data were collected without replacement
(see below for a definition of "without replacement"). The
formula for calculating the FPC is ((N-n)/(N-1))^{1/2}, where N is the
number of elements in the population and n is the number of elements in the
sample. To see the impact of the FPC for
samples of various proportions, suppose that you had
a population of 10,000 elements.

Sample size (n) FPC 1 1.0000 10 .9995 100 .9950 500 .9747 1000 .9487 5000 .7071 9000 .3162

__Replicate weights__: Replicate weights are a series of weight
variables that are used to correct the standard errors for the sampling plan.
They serve the same function as the PSU and strata (which use a Taylor series
linearization) to correct the standard errors of the estimates for the sampling
design. Many public use data sets are now being released with
replicate weights instead of PSUs and strata in an effort to more securely
protect the identity of the respondents. In theory, the same standard
errors will be obtained using either the PSU and strata or the replicate
weights. There are different ways of creating replicate weights; the
method used is determined by the sampling plan. The most common are
balanced repeated and jackknife replicate weights. You will need to read
the documentation for the survey data set carefully to learn what type of
replicate weight is included in the data set; specifying the wrong type of
replicate weight will likely lead to incorrect standard errors. For more
information on replicate weights, please see
Stata
Library: Replicate Weights and Appendix D of the
WesVar Manual
by Westat, Inc. Several statistical packages, including Stata 9, SUDAAN 9,
WesVar and R, allow the use of replicate weights.

## Consequences of not using the design elements

Sampling design elements include the sampling weights, post-stratification weights (if provided), PSUs, strata, and replicate weights. Rarely are all of these elements included in a particular public-use data set. However, ignoring the design elements that are included can often lead to inaccurate point estimates and/or inaccurate standard errors.

## Sampling with and without replacement

Most samples collected in the real world are collected "without replacement". This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.

## Examples

For the examples in this seminar, we will use the Adult data set from NHANES
III. The data set, set up file and documentation can be downloaded from
the NHANES web site. An
executable file is available that contains the data, SAS code to set up the
data, and the documentation: (** ADULT.exe
**(16 MB):

**(65 MB),**

__Data__**(128 KB),**

__SAS code__**(1 MB)) . The NHANES III data sets were released with both pseudo-PSUs/pseudo-strata and replicate weights. We will show examples using both of these methods of variance estimation. For more information on the setup for NHANES III using other packages, and setups using other commonly used public-use survey data sets, please see our page on sample setups for commonly used survey data sets .**

__Documentation__## Reading the documentation

The first step in analyzing any survey data set is to read the documentation. With many of the public use data sets, the documentation can be quite extensive and sometimes even intimidating. Instead of trying to read the documentation "cover to cover", there are some parts you will want to focus on. First, read the Introduction. This is usually an "easy read" and will orient you to the survey. There is usually a section or chapter called "Sample Design and Analysis Guidelines", "Variance Estimation", etc. This is the part that tells you about the design elements included with the survey and how to use them. Some even give example code (although usually for SUDAAN). If multiple sampling weights have been included in the data set, there will be some instruction about when to use which one. If there is a section or chapter on missing data or imputation, please read that. This will tell you how missing data were handled. You should also read any documentation regarding the specific variables that you intend to use. As we will see little later on, we will need to look at the documentation to get the value labels for the variables. This is especially important because some of the values are actually missing data codes, and you need to do something so that Stata doesn’t treat those as valid values (or you will get some very "interesting" means, totals, etc.). Despite the length of the SAS code that comes with the data set, the value labels are not included.

## Getting the data into Stata

The first, and most obvious, thing that you need to do after downloading the data, is to get the data into Stata. The data file itself is actually an ASCII file, and ASCII files can easily be read into Stata. However, the SAS code contains a lot (but not all) of the information that is needed to make the numbers in the ASCII file meaningful. For example, the SAS code contains the column numbers for each variable, the variable name, and the variable label. It does not, however, contain the value labels; you need to get those from the documentation. If you really had to, you could open the SAS code in any text editor and "copy and paste" the information into a Stata do-file, modify it as needed to make a program that could read the ASCII file. Even if you are not a SAS user, this is probably much more work than making the few necessary modifications to the SAS code provided by NHANES. Let’s look at some of the SAS code to see what we need to modify to get it to run. The first few lines of the SAS code look like this:

FILENAME ADULT "D:QuestionnaireDATADULT.DAT" LRECL=

3348;

*** LRECL includes 2 positions for CRLF, assuming use of PC SAS;

DATAWORK;

INFILE ADULT MISSOVER;

LENGTH

SEQN7

DMPFSEQ5

and the last few lines look like this:

HAZMNK5R = "Average K5 BP from household and MEC"

HAZNOK5R = "Number of BP’s used for average K5";

There are two things that you will need to change. The first is the path specification in the first line. Leave the quotes and the file name (ADULT.DAT). The second thing that needs to be changed is at the very end of the file. Replace the box with

run;

Once you have made these changes, you can click on the running person at the top of the SAS screen (assuming that nothing is highlighted), and the whole program will run. You should glance through the log file looking for anything in red print that indicates an error (such as the .dat file isn’t in the location that you specified, which is a common mistake). To make sure that everything worked as planned, you can run the following command:

proc contents data = work; run;

This should give you some output telling you about the data set. Once
you know that the data set is in SAS correctly, we can now move it over into
Stata. First, we need to save the SAS data set to our hard drive. On
the **set** statement, specify the path where you want the SAS data set saved.

data "D:dataworkingnhanes_adult1"; set work; run;

Now that the SAS data set is saved to our hard drive, we can use a program like Stat/Transfer to transfer the data into Stata format.

## The svyset command

Before you do anything else, please make sure that your Stata is up-to-date. You can type

update all

and follow the instructions given. Stata has made many nice upgrades to
the **svy:** commands since Stata 9 was released, so updating is a really
good idea.

When you first try to open the NHANES data set in Stata, you will likely get an error message about being out of memory. Unlike SAS and SPSS, Stata holds a data set in memory, so if an insufficient amount of RAM is allocated to Stata, Stata won’t be able to read in the data set. You get around this by increasing the memory:

set memory 50m

Now you should be able to open the data set. (If 50m doesn’t do it on
your computer, try 70m, 90m, etc.) Now that the data are in Stata, we need
to do one more thing before starting our analyses: we need to issue the **
svyset** command. The **svyset** command tells Stata about the design
elements in the survey. Once this command has been issued, all you need to
do for your analyses is use the **svy:** prefix before each command.
Because the NHANES III data were released with both PSUs/strata and replicate
weights, we have a choice of how to specify the **svyset** command. We
will illustrate both ways, starting the use of the PSU/strata variables.
Now if you read the NHANES documentation on variance estimation, you would know that these
aren’t really the PSUs and strata used in the data collection. Rather,
they are pseudo-PSUs and pseudo-strata. Noise was added to the original
PSU and strata variables to help protect the identity of the survey respondents,
which is why they are called pseudo-PSUs and pseudo-strata. Stata doesn’t
care about these variables being "pseudo", and they are used in the **svyset**
command as regular PSU and strata variables. In Stata 9 (but not earlier
versions of Stata), the **svyset** command looks like this:

use "D:dataworkingnhanes_adult1", clear svyset sdppsu6 [pweight = wtpfqx6], strata(sdpstra6)pweight: wtpfqx6 VCE: linearized Strata 1: sdpstra6 SU 1: sdppsu6 FPC 1: <zero>svydesSurvey: Describing stage 1 sampling units pweight: wtpfqx6 VCE: linearized Strata 1: sdpstra6 SU 1: sdppsu6 FPC 1: <zero> #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 2 418 194 209.0 224 2 2 478 222 239.0 256 3 2 476 216 238.0 260 4 2 381 180 190.5 201 5 2 388 182 194.0 206 6 2 404 194 202.0 210 7 2 444 210 222.0 234 8 2 430 210 215.0 220 9 2 419 198 209.5 221 10 2 471 229 235.5 242 11 2 439 201 219.5 238 12 2 379 179 189.5 200 13 2 414 203 207.0 211 14 2 482 228 241.0 254 15 2 461 227 230.5 234 16 2 555 273 277.5 282 17 2 509 249 254.5 260 18 2 474 236 237.0 238 19 2 517 231 258.5 286 20 2 480 233 240.0 247 21 2 410 193 205.0 217 22 2 499 227 249.5 272 23 2 433 186 216.5 247 24 2 467 224 233.5 243 25 2 423 203 211.5 220 26 2 438 179 219.0 259 27 2 478 203 239.0 275 28 2 472 233 236.0 239 29 2 505 252 252.5 253 30 2 475 224 237.5 251 31 2 525 246 262.5 279 32 2 427 207 213.5 220 33 2 539 262 269.5 277 34 2 492 222 246.0 270 35 2 264 130 132.0 134 36 2 219 108 109.5 111 37 2 167 77 83.5 90 38 2 185 86 92.5 99 39 2 303 135 151.5 168 40 2 218 98 109.0 120 41 2 236 86 118.0 150 42 2 197 91 98.5 106 43 2 423 190 211.5 233 44 2 497 220 248.5 277 45 2 345 166 172.5 179 46 2 217 107 108.5 110 47 2 226 97 113.0 129 48 2 594 274 297.0 320 49 2 357 172 178.5 185 -------- -------- -------- -------- -------- -------- 49 98 20050 77 204.6 320svy: tab hssex(running tabulate on estimation sample) Number of strata = 49 Number of obs = 20050 Number of PSUs = 98 Population size = 1.876e+08 Design df = 49 ----------------------- sex | proportions ----------+------------ 1 | .4777 2 | .5223 | Total | 1 ----------------------- Key: proportions = cell proportionssvy: tab hssex, count missing(running tabulate on estimation sample) Number of strata = 49 Number of obs = 20050 Number of PSUs = 98 Population size = 1.876e+08 Design df = 49 ---------------------- sex | count ----------+----------- 1 | 9.0e+07 2 | 9.8e+07 | Total | 1.9e+08 ---------------------- Key: count = weighted countssvy: tab hssex, count cellwidth(10) format(%15.2g)(running tabulate on estimation sample) Number of strata = 49 Number of obs = 20050 Number of PSUs = 98 Population size = 1.876e+08 Design df = 49 ---------------------- sex | count ----------+----------- 1 | 89637541 2 | 98009665 | Total | 1.9e+08 ---------------------- Key: count = weighted countslabel define sex 1 male 2 female label values hssex sex label define race 1 white 2 black 3 other 8 MAUR * MAUR = "Mexican-American of Unknown Race" label values dmaracer race label define mar 1 "married house" 2 "married not in house" /// 3 "living as married" 4 widowed 5 divorced 6 separated /// 7 "never married" 8 blank 9 DK label values hfa12 mar label define food 1 enough 2 sometimes 3 "often not enough" 8 blank label values hff4 food label define yn 1 yes 2 no 8 blank 9 DK foreach var of varlist hfa13 hfe7 hfe8a hfe8b hfe8c hfe8d hfe8e hff6a /// hff6b hff6c hff6d hff7 hff8 hff9 hff10 hff11 hah13-hah17 hav5 { label values `var' yn } label define hah 1 "no difficulty" 2 "some difficulty" 3 "much difficulty" /// 4 "unable to do" 8 blank 9 DK foreach var of varlist hah1-hah12 { label values `var' hah }svy: tab hah1, count cellwidth(15) format(%15.2f) missing(running tabulate on estimation sample) Number of strata = 49 Number of obs = 20050 Number of PSUs = 98 Population size = 1.876e+08 Design df = 49 --------------------------- difficult | y walking | a quarter | of a mile | count ----------+---------------- no diffi | 97132055.68 some dif | 8379187.23 much dif | 3032776.77 unable t | 4769359.13 blank | 130435.68 DK | 1236016.62 . | 72967375.21 | Total | 187647206.32 --------------------------- Key: count = weighted countsereturn listscalars: e(stages) = 1 e(N_over) = 1 e(census) = 0 e(singleton) = 0 e(N_strata_omit) = 0 e(df_r) = 49 e(N) = 20050 e(N_strata) = 49 e(N_psu) = 98 e(N_pop) = 187647206.319999 e(r) = 7 e(c) = 1 e(total) = 187647206.319999 e(mgdeff) = . e(cvgdeff) = .display %15.2g e(N_pop)187647206svy: tab dmaracer, count cellwidth(15) format(%15.2f) missing(running tabulate on estimation sample) Number of strata = 49 Number of obs = 20050 Number of PSUs = 98 Population size = 1.876e+08 Design df = 49 --------------------------- race | count ----------+---------------- white | 158131126.11 black | 21728087.51 other | 7773929.35 MAUR | 14063.35 | Total | 187647206.32 --------------------------- Key: count = weighted countssvy: tab hfa12, count cellwidth(15) format(%15.2f) missing(running tabulate on estimation sample) Number of strata = 49 Number of obs = 20050 Number of PSUs = 98 Population size = 1.876e+08 Design df = 49 --------------------------- marital | status | count ----------+---------------- married | 106397053.60 married | 2367517.23 living a | 7442555.25 widowed | 13310800.60 divorced | 14540696.43 separate | 4570546.85 never ma | 38181973.78 88 | 826108.26 99 | 9954.32 | Total | 187647206.32 --------------------------- Key: count = weighted counts* "Highest grade or yr of school completed" svy: mean hfa8r(running mean on estimation sample) Survey: Mean estimation Number of strata = 49 Number of obs = 20050 Number of PSUs = 98 Population size = 1.9e+08 Design df = 49 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | 12.90589 .1268616 12.65095 13.16082 --------------------------------------------------------------## The single PSU per stratum problem

Sometimes using Stata you will get an error like the following:

* "# days had no food/money in past month" svy: mean hff5(running mean on estimation sample) Survey: Mean estimation Number of strata = 49 Number of obs = 1333 Number of PSUs = 97 Population size = 7.1e+06 Design df = 48 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hff5 | 15.65079 . . . -------------------------------------------------------------- Note: Missing standard error due to stratum with single sampling unit; see help svydes.* search nmissing nmissing hff5hff5 18717svydes if hff5 != .Survey: Describing stage 1 sampling units pweight: wtpfqx6 VCE: linearized Strata 1: sdpstra6 SU 1: sdppsu6 FPC 1: <zero> #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 2 16 5 8.0 11 2 2 14 6 7.0 8 3 2 42 15 21.0 27 4 2 15 6 7.5 9 5 2 24 10 12.0 14 6 2 13 6 6.5 7 7 2 9 2 4.5 7 8 2 8 4 4.0 4 9 2 12 5 6.0 7 10 2 15 7 7.5 8 11 2 27 12 13.5 15 12 2 10 5 5.0 5 13 2 24 10 12.0 14 14 2 26 11 13.0 15 15 2 20 3 10.0 17 16 2 28 14 14.0 14 17 1* 9 9 9.0 9 18 2 21 10 10.5 11 19 2 50 9 25.0 41 20 2 29 11 14.5 18 21 2 10 4 5.0 6 22 2 33 2 16.5 31 23 2 13 5 6.5 8 24 2 22 8 11.0 14 25 2 50 22 25.0 28 26 2 44 14 22.0 30 27 2 28 9 14.0 19 28 2 82 30 41.0 52 29 2 81 25 40.5 56 30 2 45 16 22.5 29 31 2 34 14 17.0 20 32 2 51 22 25.5 29 33 2 19 9 9.5 10 34 2 24 9 12.0 15 35 2 13 6 6.5 7 36 2 23 10 11.5 13 37 2 7 2 3.5 5 38 2 6 2 3.0 4 39 2 31 9 15.5 22 40 2 16 5 8.0 11 41 2 36 10 18.0 26 42 2 21 10 10.5 11 43 2 26 10 13.0 16 44 2 35 13 17.5 22 45 2 46 19 23.0 27 46 2 18 5 9.0 13 47 2 20 7 10.0 13 48 2 38 19 19.0 19 49 2 49 19 24.5 30 -------- -------- -------- -------- -------- -------- 49 97 1333 2 13.7 56 18717 = #Obs with missing values in the -------- survey characteristics 20050

As you can see from the output above, stratum 17 is the problem. There are a couple of possible solutions. Perhaps the
best solution is to use the replicate weights instead of the PSUs/strata, which
is what we will show below. However, that is rarely an option, as few data
sets come with both PSU/strata and replicate weights. A second option is
to see if the data set was released with imputed values. As we will see
towards the end of this seminar, some of the NHANES data sets were released as
multiply imputed data sets. A third possibility is to reissue the **
svyset** command without specifying the strata. This will affect the
standard errors, but in general, they should be larger than when the
stratification is used. Hence, not using the stratification should lead a
more conservative test, which isn’t the worst thing in the world. A fourth
option is to use a different program, such as SUDAAN. SUDAAN handles this
situation differently than Stata does, so you don’t get have a single PSU per
stratum problem.

Now we get to what may be the less desirable options. One is to try and put the singleton PSU into another stratum. However, rarely does the documentation give you enough detail to know which stratum the PSU should be put into. Also, this can be really tricky if there are multiple strata with a single PSU. Another problem with this option is that you want to be consistent across analyses. This means that you would have to identify all of the variables that you will use in the paper or presentation and then figure out which strata have a single PSU. This could just lead to a mess. Another unsavory option is to try to impute the missing values. I don’t like this option because it is unlikely that any two users of the data would impute the missing values in exactly the same way. This means that even if both users ran exactly the same analysis, they would likely obtain different point estimates (because they are in effect no longer using the same data set).

Let’s reissue the **svyset** command, this time using the replicate
weights. First, we will clear the current settings.

svyset, clear

Next, we need to do a little math to turn the value of Fay’s adjustment into
the Fay’s value desired by Stata. Here is the formula: Fay=1-1/sqrt(adjfay)
. We will use the **vce(brr)** and **mse** options to obtain the standard errors given by SUDAAN.

display 1-(1/sqrt(1.7)).23303501svyset [pweight = wtpfqx6], brrweight(wtpqrp1 - wtpqrp52) fay(.23303501) vce(brr) msepweight: wtpfqx6 VCE: brr MSE: on brrweight: wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6 wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14 wtpqrp15 wtpqrp16 wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24 wtpqrp25 wtpqrp26 wtpqrp27 wtpqrp28 wtpqrp29 wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33 wtpqrp34 wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40 wtpqrp41 wtpqrp42 wtpqrp43 wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51 wtpqrp52 fay: .23303501 Strata 1: <one> SU 1: <observations> FPC 1: <zero>svy: mean hff5(running mean on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Mean estimation Number of obs = 1333 Population size = 7.1e+06 Replications = 52 Design df = 51 -------------------------------------------------------------- | BRR * | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hff5 | 15.65079 1.789433 12.05836 19.24323 --------------------------------------------------------------

As you can see, the mean for the variable **hff5** is the same whether we
used the PSU/strata or the replicate weights, which it must be, since these
things only adjust the standard errors. However, we now have the standard
error.

Let’s look at one more example to see what other trouble missing data can get us
into. We will run the **nmissing** command to confirm that only **hff5**
has missing data, and then we will run two **svy: mean** commands, the first with only the variable
**hfa8r**, and
the second with both variables. As you can see, the mean for **hfa8r**
is different when run with **hff5**, because Stata is doing a listwise
deletion of incomplete cases before calculating the means. In other words,
even though there are no missing data from **hfa8r**, only 1333 of the 20050
data points are used in the calculation of the mean of **hfa8r** when you use both variables
together.

nmissing hfa8r hff5hff5 18717svy: mean hfa8r(running mean on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Mean estimation Number of obs = 20050 Population size = 1.9e+08 Replications = 52 Design df = 51 -------------------------------------------------------------- | BRR * | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | 12.90589 .1109928 12.68306 13.12871 --------------------------------------------------------------svy: mean hfa8r hff5(running mean on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Mean estimation Number of obs = 1333 Population size = 7.1e+06 Replications = 52 Design df = 51 -------------------------------------------------------------- | BRR * | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | 17.9756 1.739453 14.48351 21.4677 hff5 | 15.65079 1.789433 12.05836 19.24323 --------------------------------------------------------------## Comparing variance estimation techniques

Now that we have seen the use of both the PSU/strata and the replicate
weights to adjust the standard errors, let’s compare the two side-by-side.
We will start by reissuing the **svyset** command using the PSU/strata.

svyset, clearno survey characteristics are setsvyset sdppsu6 [pweight = wtpfqx6], strata(sdpstra6)pweight: wtpfqx6 VCE: linearized Strata 1: sdpstra6 SU 1: sdppsu6 FPC 1: <zero>svy: mean hfa8r(running mean on estimation sample) Survey: Mean estimation Number of strata = 49 Number of obs = 20050 Number of PSUs = 98 Population size = 1.9e+08 Design df = 49 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | 12.90589 .1268616 12.65095 13.16082 --------------------------------------------------------------

We can also issue the **estat** command. The Deff and the Deft are types of design
effects, which tell you about the efficiency of your sample. The Deff is a
ratio of two variances. In the numerator we have the variance estimate
from the current sample (including all of its design elements), and in the
denominator we have the variance from a hypothetical sample of the same size
drawn as an SRS. In other words, the Deff tells you how efficient your
sample is compared to an SRS of equal size. If the Deff is less than 1,
your sample is more efficient than SRS; usually the Deff is greater than 1.
The Deft is the ratio of two standard error estimates. Again, the
numerator is the standard error estimate from the current sample. The
denominator is a hypothetical SRS (with replacement) standard error from a
sample of the same size as the current sample. You can also use the **meff** and the
**meft** option to get the misspecification effects.
Misspecification effects are a ratio of the variance estimate from the current
analysis to a hypothetical variance estimated from a misspecified model.
Please see the Stata documentation for more details on how these are calculated.

estat effects---------------------------------------------------------- | Linearized | Mean Std. Err. Deff Deft -------------+-------------------------------------------- hfa8r | 12.90589 .1268616 5.19536 2.27933 ----------------------------------------------------------

Now let’s use the replicate weights and run the same analyses.

svyset, clearno survey characteristics are setsvyset [pweight = wtpfqx6], brrweight(wtpqrp1 - wtpqrp52) fay(.23303501) vce(brr) msepweight: wtpfqx6 VCE: brr MSE: on brrweight: wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6 wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14 wtpqrp15 wtpqrp16 wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24 wtpqrp25 wtpqrp26 wtpqrp27 wtpqrp28 wtpqrp29 wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33 wtpqrp34 wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40 wtpqrp41 wtpqrp42 wtpqrp43 wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51 wtpqrp52 fay: .23303501 Strata 1: <one> SU 1: <observations> FPC 1: <zero>svy: mean hfa8r(running mean on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Mean estimation Number of obs = 20050 Population size = 1.9e+08 Replications = 52 Design df = 51 -------------------------------------------------------------- | BRR * | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | 12.90589 .1109928 12.68306 13.12871 --------------------------------------------------------------estat effects---------------------------------------------------------- | BRR * | Mean Std. Err. Deff Deft -------------+-------------------------------------------- hfa8r | 12.90589 .1109928 3.9769 1.99422 ----------------------------------------------------------

As you can see, every part of the output is exactly the same except the
denominator df, the design df and the standard errors. Notice that the
standard errors are always larger for the analyses with the PSU/strata set than
with the replicate weights. That is not because the replicate weight
method of variance correction is more efficient than the linearization method.
Rather, these are pseudo-PSUs and pseudo-strata, and the noise added to the PSUs
and strata to make them pseudo is causing the inflation in the standard errors.
If we had access to the real PSUs and strata and used those in the analyses, I
suspect that the standard errors would be extremely close to those obtained with
the replicate weights.

## Analyses of subpopulations

The analysis of subpopulations is one place where survey data and
experimental data are quite different. If you have data from an experiment
(or quasi-experiment), and you want to analyze the responses from, say, just the
women, or just people over age 50, you can just delete the unwanted cases from
the data set or use the **by:** prefix. Survey data are different.
With survey data, you (almost) never get to delete any cases from the data set,
even if you will never use them in any of your analyses. Because of the
way the **by:** prefix works, you usually don’t use it with survey data
either. Instead, Stata has provided two options that allow you to
correctly analyze subpopulations of your survey data. These options are **
subpop** and **over**. The **subpop** option is sort of like
deleting unwanted cases (without really deleting them, of course), and the **
over** option is very similar to **by:** processing. We will start
with some examples of the **subpop** option.

But first, let’s take a second to see why deleting cases from a survey data
set can be so problematic. If the data set is subset (meaning that
observations not to be included in the subpopulation are deleted from the data
set), the standard errors of the estimates cannot be calculated correctly. When
the subpopulation option(s) is used, only the cases defined by the subpopulation
are used in the calculation of the estimate, but all cases are used in the
calculation of the standard errors. For more information on this issue, please
see
Sampling Techniques, Third Edition by William G. Cochran (1977) and
Small Area Estimation by J. N. K. Rao (2003). Also, if you look in the
Stata 9 Survey manual, you will find an entire section (pages 38-43) dedicated
to the analysis of subpopulations. The formulas for using both **if**
and **subpop** are given, along with an explanation of how they are
different. If you look at the help for any **svy:** command, you will
see the same warning:

Warning: Use of if or in restrictions will not produce correct variance estimates for subpopulations in many cases. To compute estimates for subpopulations, use the subpop() option. The full specification for subpop() is subpop([varname] [if])

So now we know what not to do, let’s see how to do this right. We will
start with a simple mean and then use **hssex **as our subpopulation
variable.

svy: mean hfa8r(running mean on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Mean estimation Number of obs = 20050 Population size = 1.9e+08 Replications = 52 Design df = 51 -------------------------------------------------------------- | BRR * | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | 12.90589 .1109928 12.68306 13.12871 --------------------------------------------------------------svy, subpop(hssex): mean hfa8rNote: subpop() takes on values other than 0 and 1 subpop() != 0 indicates subpopulation (running mean on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Mean estimation Number of obs = 20050 Population size = 1.9e+08 Subpop. no. obs = 20050 Subpop. size = 1.9e+08 Replications = 52 Design df = 51 -------------------------------------------------------------- | BRR * | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | 12.90589 .1109928 12.68306 13.12871 --------------------------------------------------------------

Clearly, something went wrong here. The note at the top of the output
tells us what happened: Stata wants the subpopulation variable to be coded 0/1.
Let’s look at the coding of **hssex**.

codebook hssex--------------------------------------------------------------------------------------------------------------------------------------------- hssex sex --------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (byte) label: sex range: [1,2] units: 1 unique values: 2 missing .: 0/20050 tabulation: Freq. Numeric Label 9401 1 male 10649 2 female

Let’s recode **hssex** into a new variable, which we will call **hssex1**,
and rerun the analysis.

generate hssex1 = hssexrecode hssex1 (2 = 0)(hssex1: 10649 changes made)svy, subpop(hssex1): mean hfa8r(running mean on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Mean estimation Number of obs = 20050 Population size = 1.9e+08 Subpop. no. obs = 9401 Subpop. size = 9.0e+07 Replications = 52 Design df = 51 -------------------------------------------------------------- | BRR * | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | 13.0614 .1534762 12.75328 13.36951 --------------------------------------------------------------

We can also use the **over** option to get estimates for all categories of the
variable. In this case, we get the mean of the highest year of school
completed for men and women (1 = male and 2 = female). The **over**
option allows for variables coded 1/2 and for multicategory variables.

svy: mean hfa8r, over(hssex)(running mean on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Mean estimation Number of obs = 20050 Population size = 1.9e+08 Replications = 52 Design df = 51 male: hssex = male female: hssex = female -------------------------------------------------------------- | BRR * Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | male | 13.0614 .1534762 12.75328 13.36951 female | 12.76366 .1046096 12.55365 12.97367 --------------------------------------------------------------svy: mean hfa8r, over(hfa12)(running mean on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Mean estimation Number of obs = 20050 Population size = 1.9e+08 Replications = 52 Design df = 51 _subpop_1: hfa12 = married house _subpop_2: hfa12 = married not in house _subpop_3: hfa12 = living as married widowed: hfa12 = widowed divorced: hfa12 = divorced separated: hfa12 = separated _subpop_7: hfa12 = never married 88: hfa12 = 88 99: hfa12 = 99 -------------------------------------------------------------- | BRR * Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | _subpop_1 | 12.81528 .1065852 12.6013 13.02926 _subpop_2 | 11.49089 .3356867 10.81697 12.16481 _subpop_3 | 12.41727 .244054 11.92731 12.90723 widowed | 10.84711 .1758893 10.494 11.20022 divorced | 12.79547 .1453802 12.50361 13.08733 separated | 11.27132 .2968793 10.67531 11.86733 _subpop_7 | 12.79864 .1511575 12.49518 13.1021 88 | 81.54784 2.655634 76.21643 86.87925 99 | 62.79067 24.52562 13.55343 112.0279 --------------------------------------------------------------

The **over** option is available only for **svy: mean**, **svy: proportion**,
**svy: ratio** and **svy: total**. Unfortunately, there are no nice display options for
**svy: total** like there are for **svy: tab** to show the actual values of
the totals. We can use the **matrix list** command to list out the
values stored in the matrix, although sometimes those are in scientific notation
as well. We can use the **estat size** command to get the unweighted
and weighted size (i.e., count) of each subpopulation. This is a good
thing to do, because you need to know how many observations are in each
subpopulation. If "99" was a subpopulation of interest to us, we would be
in trouble because there are only three observations in that subpopulation.

svy: total hssex, over(hfa12)(running total on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 ..Survey: Total estimation Number of obs = 20050 Population size = 1.9e+08 Replications = 52 Design df = 51 _subpop_1: hfa12 = married house _subpop_2: hfa12 = married not in house _subpop_3: hfa12 = living as married widowed: hfa12 = widowed divorced: hfa12 = divorced separated: hfa12 = separated _subpop_7: hfa12 = never married 88: hfa12 = 88 99: hfa12 = 99 -------------------------------------------------------------- | BRR * Over | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hssex | _subpop_1 | 1.59e+08 2200084 1.54e+08 1.63e+08 _subpop_2 | 3573113 284678.1 3001597 4144628 _subpop_3 | 1.11e+07 651988.5 9826916 1.24e+07 widowed | 2.45e+07 506533.4 2.35e+07 2.55e+07 divorced | 2.36e+07 962190.5 2.17e+07 2.56e+07 separated | 7756844 456243 6840898 8672790 _subpop_7 | 5.53e+07 1789277 5.17e+07 5.89e+07 88 | 1203385 243289.5 714960.5 1691809 99 | 15765.66 10722.36 -5760.372 37291.69 --------------------------------------------------------------mat list e(b)e(b)[1,9] hssex: hssex: hssex: hssex: hssex: hssex: hssex: hssex: hssex: _subpop_1 _subpop_2 _subpop_3 widowed divorced separated _subpop_7 88 99 y1 1.585e+08 3573112.5 11135838 24474595 23634198 7756844.2 55314851 1203384.5 15765.66estat size_subpop_1: hfa12 = married house _subpop_2: hfa12 = married not in house _subpop_3: hfa12 = living as married widowed: hfa12 = widowed divorced: hfa12 = divorced separated: hfa12 = separated _subpop_7: hfa12 = never married 88: hfa12 = 88 99: hfa12 = 99 ---------------------------------------------------------------------- | BRR * Over | Total Std. Err. Obs Size -------------+-------------------------------------------------------- hssex | _subpop_1 | 1.59e+08 2200084 10241 106397053.6 _subpop_2 | 3573113 284678.1 364 2367517.23 _subpop_3 | 1.11e+07 651988.5 781 7442555.25 widowed | 2.45e+07 506533.4 2352 13310800.6 divorced | 2.36e+07 962190.5 1388 14540696.43 separated | 7756844 456243 686 4570546.85 _subpop_7 | 5.53e+07 1789277 4135 38181973.78 88 | 1203385 243289.5 100 826108.26 99 | 15765.66 10722.36 3 9954.32 ----------------------------------------------------------------------

The **subpop** option can be combined with the **over** option. This
is handy because **if** cannot be used with the **over** option. By
combining the options, you can have "the best of both worlds." In the
example below, our subpopulation includes only white males, and the mean of
education is given for each marital status. Notice that for category "88",
the mean is really high (83.44). This is because Stata considers 88 to be
a valid value, not the missing data code that it is (according to the
documentation).

svy, subpop(hssex1 if dmaracer==1): mean hfa8r, over(hfa12)(running mean on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Mean estimation Number of obs = 20050 Population size = 1.9e+08 Subpop. no. obs = 6498 Subpop. size = 7.6e+07 Replications = 52 Design df = 51 _subpop_1: hfa12 = married house _subpop_2: hfa12 = married not in house _subpop_3: hfa12 = living as married widowed: hfa12 = widowed divorced: hfa12 = divorced separated: hfa12 = separated _subpop_7: hfa12 = never married 88: hfa12 = 88 -------------------------------------------------------------- | BRR * Over | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ hfa8r | _subpop_1 | 12.85045 .1013903 12.64691 13.054 _subpop_2 | 11.15848 .6515755 9.850389 12.46657 _subpop_3 | 11.90575 .2359017 11.43216 12.37934 widowed | 10.8672 .3819635 10.10037 11.63402 divorced | 12.8703 .1584319 12.55224 13.18837 separated | 12.6802 1.564755 9.53882 15.82157 _subpop_7 | 12.72527 .2070723 12.30955 13.14098 88 | 83.43886 3.637654 76.13596 90.74175 --------------------------------------------------------------

For more information on analyzing subpopulations in Stata, please see our Stata FAQ: How can I analyze a subpopulation of my survey data in Stata 9?

## Regression analyses

Let’s look at some regression analyses. Stata 9 has a very nice suite
of regression commands that can be used with the **svy:** prefix. Type
**help survey** for a list of commands that can be used with the **svy:**
prefix.

svy: reg hav6s hav2s hav3s hff4 hfa13 hfe7 hfa8r(running regress on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Linear regression Number of obs = 5866 Population size = 69288876 Replications = 52 Design df = 51 F( 6, 46) = 83.92 Prob > F = 0.0000 R-squared = 0.4285 ------------------------------------------------------------------------------ | BRR * hav6s | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- hav2s | .0335188 .0164996 2.03 0.047 .0003945 .0666432 hav3s | .0458665 .016843 2.72 0.009 .0120528 .0796802 hff4 | 78.15901 47.06818 1.66 0.103 -16.33432 172.6523 hfa13 | 14.97253 6.777161 2.21 0.032 1.366808 28.57825 hfe7 | -9.015854 11.81718 -0.76 0.449 -32.73983 14.70812 hfa8r | .6153294 .6913622 0.89 0.378 -.7726382 2.003297 _cons | -59.27143 62.78533 -0.94 0.350 -185.3182 66.77539 ------------------------------------------------------------------------------

As we see in the example below, we can use the **xi:** prefix with the **
svy:** prefix. Please note that the order of the prefixes matters; you
need to use the **xi:** prefix in front of the **svy:** prefix. No
**svy:** prefix is needed before the **test** command.

xi: svy: reg hav2s hfe7 hfa13 hssex i.dmaracer hah15i.dmaracer _Idmaracer_1-8 (naturally coded; _Idmaracer_1 omitted) (running regress on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. Survey: Linear regression Number of obs = 13398 Population size = 1.147e+08 Replications = 52 Design df = 51 F( 7, 45) = 4.49 Prob > F = 0.0007 R-squared = 0.1056 ------------------------------------------------------------------------------ | BRR * hav2s | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- hfe7 | 1105.945 220.8457 5.01 0.000 662.5787 1549.311 hfa13 | 1217.864 268.925 4.53 0.000 677.9742 1757.753 hssex | -401.0778 128.2177 -3.13 0.003 -658.4855 -143.6701 _Idmaracer_2 | -291.8996 94.52307 -3.09 0.003 -481.6625 -102.1366 _Idmaracer_3 | -591.7752 120.6021 -4.91 0.000 -833.894 -349.6565 _Idmaracer_8 | -5325.696 2523.105 -2.11 0.040 -10391.04 -260.3501 hah15 | 926.9516 514.7204 1.80 0.078 -106.3927 1960.296 _cons | -4935.265 1166.619 -4.23 0.000 -7277.351 -2593.179 ------------------------------------------------------------------------------test _Idmaracer_2 _Idmaracer_3 _Idmaracer_8Adjusted Wald test ( 1) _Idmaracer_2 = 0 ( 2) _Idmaracer_3 = 0 ( 3) _Idmaracer_8 = 0 F( 3, 49) = 8.27 Prob > F = 0.0001

Let’s create a 0/1 variable and run a logistic regression.

generate clubs = hav5recode clubs (2 = 0)(clubs: 14184 changes made)xi: svy: logit clubs hfe7 hfa13 hssex i.dmaracer hah15i.dmaracer _Idmaracer_1-8 (naturally coded; _Idmaracer_1 omitted) (running logit on estimation sample) BRR replications (52) ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 .................................................. 50 .. note: _Idmaracer_8 != 0 predicts failure perfectly _Idmaracer_8 dropped and 8 obs not used Survey: Logistic regression Number of obs = 13398 Population size = 1.147e+08 Replications = 52 Design df = 51 F( 6, 46) = 22.95 Prob > F = 0.0000 ------------------------------------------------------------------------------ | BRR * clubs | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- hfe7 | .1769599 .0450544 3.93 0.000 .0865093 .2674104 hfa13 | -.3795177 .0567269 -6.69 0.000 -.4934017 -.2656338 hssex | -.0559143 .048677 -1.15 0.256 -.1536376 .0418089 _Idmaracer_2 | -.5907863 .0583408 -10.13 0.000 -.7079103 -.4736623 _Idmaracer_3 | -.8231924 .1895007 -4.34 0.000 -1.203631 -.4427538 hah15 | .3510693 .0936615 3.75 0.000 .163036 .5391026 _cons | -.6049762 .1758361 -3.44 0.001 -.957982 -.2519705 ------------------------------------------------------------------------------test _Idmaracer_2 _Idmaracer_3Adjusted Wald test ( 1) _Idmaracer_2 = 0 ( 2) _Idmaracer_3 = 0 F( 2, 50) = 50.75 Prob > F = 0.0000

We don’t have much time to talk about regression diagnostics here, although
that is a common question among researchers who use survey data. Most of
the assumptions still apply when using survey data, but they can be more
difficult to check. As of Stata 9.2, most of the diagnostic commands that
you would use after **regress**, **logistic**, etc., don’t work after **
svy: regress**, **svy: logit**, etc. Some of the assumptions don’t
really apply, though, because of the extremely large sample size involved.
Checking assumptions when doing a subpopulation analysis can be even more tricky.

## Using multiply imputed data

Some of the NHANES III data sets were released as imputed data sets.
This means that some of the variables contained in the data set were multiply
imputed. For information on which variables were imputed, the imputation
method, etc., please see
http://www.cdc.gov/nchs/nhanes.htm and
ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NHANES/NHANESIII/7A/doc/main.pdf
. We suggest using the prefix **mim:** for analyzing multiply imputed
data sets, although there are some other prefixes available in Stata. The
prefix **mim:** is not part of Stata and needs to be downloaded (**search
mim**). The NHANES data were released with five imputed data sets.
Unlike SUDAAN, **mim** wants the data sets stacked into a single data set.
We can use the **mimstack** command to do this (**search mimstack**). We need to specify
unique case identifier in each data set (**seqn**) and the "stub" name (**nh3mi**)
of the imputed data sets. We strongly encourage everyone to read the help
file for **mim** before using the **mim:** prefix or the **mimstack**
command. Also, before you start using the multiply imputed data, you
should look at the help file for **mim** to see if the procedure
that you want to use is supported by **mim**. For example, **svy: tab**
is not supported by **mim**. You may also want to check periodically to
see if there are any updates to **mim** or similar procedures. We should also note the **mimstack** can take some time to
run, especially if you don’t have a lot of RAM on your computer. After
stacking the data sets, I keep only a few variables, then I use the **compress**
command to make the data set as small as possible. Next I add some value
labels, and then I arrange the variables in alphabetical order with the **
aorder** command. We can use either version of the **svyset** command
(with the PSU/strata or the replicate weights), so I will stick with the
replicate weights.

clear cd "D:Datanhanes IIImiStata_data"D:Datanhanes IIImiStata_datals<dir> 4/16/07 12:59 . <dir> 4/16/07 12:59 .. 24.8M 4/06/07 13:44 nh3mi1.dta 24.8M 4/06/07 13:44 nh3mi2.dta 24.8M 4/06/07 13:44 nh3mi3.dta 24.8M 4/06/07 13:45 nh3mi4.dta 24.8M 4/06/07 13:45 nh3mi5.dtaset mem 200m(204800k)mimstack, m(5) sortorder(seqn) istub("nh3mi") nomj0keep _mj _mi seqn dmaracer hssex hsfsizer sdppsu6 sdpstra6 wtpfqx6- wtpqrp52 hac1k hfa8 /// dmppirif-pep6i3mi compress label define imfl 0 "item not applicable" 1 "data value observed" 2 "value multiply imputed" foreach var of varlist dmppirif - pep6i3if { label values `var' imfl } aorder svyset [pweight = wtpfqx6], brrweight(wtpqrp1 - wtpqrp52) fay(.23303501) vce(brr) msepweight: wtpfqx6 VCE: brr MSE: on brrweight: wtpqrp1 wtpqrp2 wtpqrp3 wtpqrp4 wtpqrp5 wtpqrp6 wtpqrp7 wtpqrp8 wtpqrp9 wtpqrp10 wtpqrp11 wtpqrp12 wtpqrp13 wtpqrp14 wtpqrp15 wtpqrp16 wtpqrp17 wtpqrp18 wtpqrp19 wtpqrp20 wtpqrp21 wtpqrp22 wtpqrp23 wtpqrp24 wtpqrp25 wtpqrp26 wtpqrp27 wtpqrp28 wtpqrp29 wtpqrp30 wtpqrp31 wtpqrp32 wtpqrp33 wtpqrp34 wtpqrp35 wtpqrp36 wtpqrp37 wtpqrp38 wtpqrp39 wtpqrp40 wtpqrp41 wtpqrp42 wtpqrp43 wtpqrp44 wtpqrp45 wtpqrp46 wtpqrp47 wtpqrp48 wtpqrp49 wtpqrp50 wtpqrp51 wtpqrp52 fay: .23303501 Strata 1: <one> SU 1: <observations> FPC 1: <zero>

Like most data sets with imputed values, the NHANES data sets include
imputation flags. Imputation flags are variables that are added to the
imputed data sets to tell the user which cases have imputed values. Of
course, there is an imputation flag variable for each imputed variable.
The **codebook** command can be used to inspect the imputation flags to see
how many cases were imputed. You will want to know this so that you can
assess how reliable estimates involving that variable are. For example, if
there were very few cases imputed, the estimate of, say, the mean of that
variable, may be much more reliable than if 60% of the cases were imputed.

codebook bdpfndif bdpfndmi--------------------------------------------------------------------------------------------------------------------------------------------- bdpfndif imputation flag for bdpfndmi --------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (byte) label: imfl range: [0,2] units: 1 unique values: 3 missing .: 0/169970 tabulation: Freq. Numeric Label 75845 0 item not applicable 73230 1 data value observed 20895 2 value multiply imputed --------------------------------------------------------------------------------------------------------------------------------------------- bdpfndmi bone minrl density femur neck-gm/cm sq --------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (float) range: [.202,1.841] units: .001 unique values: 1150 missing .: 75845/169970 mean: .823623 std. dev: .17804 percentiles: 10% 25% 50% 75% 90% .597 .703 .819 .937 1.049codebook bdpfndif bdpfndmi if _mj == 1--------------------------------------------------------------------------------------------------------------------------------------------- bdpfndif imputation flag for bdpfndmi --------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (byte) label: imfl range: [0,2] units: 1 unique values: 3 missing .: 0/33994 tabulation: Freq. Numeric Label 15169 0 item not applicable 14646 1 data value observed 4179 2 value multiply imputed --------------------------------------------------------------------------------------------------------------------------------------------- bdpfndmi bone minrl density femur neck-gm/cm sq --------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (float) range: [.231,1.841] units: .001 unique values: 1048 missing .: 15169/33994 mean: .823834 std. dev: .17869 percentiles: 10% 25% 50% 75% 90% .595 .704 .818 .938 1.049mim: svy: mean bmpwstmi bmphtmi bmpbutmiMultiple-imputation estimates (svy: mean) Imputations = 5 Survey: Mean estimation Minimum obs = 30548 Minimum dof = 45.8 ------------------------------------------------------------------------------ | Coef. Std. Err. t P>|t| [95% Conf. Int.] MI.df -------------+---------------------------------------------------------------- bmpwstmi | 84.7249 .157001 539.65 0.000 84.4092 85.0405 48.2 bmphtmi | 160.601 .079979 2008.05 0.000 160.44 160.762 45.8 bmpbutmi | 94.1049 .125258 751.29 0.000 93.853 94.3567 48.0 ------------------------------------------------------------------------------

As mentioned before, there are still variables with missing data in the
multiply imputed data sets. We can use either the **codebook** command
or the **nmissing** command to see the number of missing values.

codebook bmphtmi if _mj == 1--------------------------------------------------------------------------------------------------------------------------------------------- bmphtmi standing height (cm) --------------------------------------------------------------------------------------------------------------------------------------------- type: numeric (float) range: [73.6,206.5] units: .1 unique values: 1158 missing .: 3446/33994 mean: 152.127 std. dev: 26.1233 percentiles: 10% 25% 50% 75% 90% 105.2 143.6 160.5 170 177.2nmissing bmpwstmi bmphtmi bmpbutmi if _mj == 1bmpwstmi 3446 bmphtmi 3446 bmpbutmi 3446

If
you look at the output for **mean** command above, you will see that there is
no estimate of the population total. You can run the command using a
single imputed data set (of course, without the **mim:** prefix) to get that
estimate if you need it.

## A quick note on merging data sets

The NHANES documentation provides clear instructions on how to merge various
data sets from the NHANES III collection. You need to read that very
carefully. Depending on what data sets you merge, you may need to adjust
the sampling weights. Again, when and how to do this is spelled out in the
documentation, so please read that carefully. Also, be aware that you will likely have to modify your **
svyset** command to work with the merged data set. In fact, the **svyset**
command may need to be different for different models that you run from the
merged data set, depending on which variables are included in the model.

Another point to remember is that what you do to merge data sets or modify sampling weights with the NHANES data will likely NOT generalize to other survey data sets. You will need to read the documentation for each data set and probably have to follow different procedures to accomplish the same tasks with other data sets. In other words, many ways of doing things are specific to that particular data set; be careful not to over generalize.

## For more information on using the NHANES data sets

There are several helpful resources for learning how to analyze the NHANES III data sets correctly. One is a listserv at http://www.cdc.gov/nchs/nhanes/nhanes_listserv.htm . There are also online tutorials at http://www.cdc.gov/nchs/tutorials/index.htm .

## References

The Stata Journal, Vol. 7, No. 1, 1st Quarter 2007

Analysis of Health Surveys by Edward L. Korn and Barry I. Graubard

Sampling of Populations: Methods and Applications, Third Edition by Paul Levy and Stanley Lemeshow

Analysis of Survey Data Edited by R. L. Chambers and C. J. Skinner

Sampling Techniques, Third Edition by William G. Cochran

Stata 9 Manual: Survey Data