The purpose of this seminar is to explore how to analyze survey data collected under different sampling plans using Stata 9. Other examples, including those using other survey data analysis packages, can be found at Choosing the Correct Analysis for Various Survey Designs. Before we begin looking at examples in Stata, we will quickly review some basic issues and concepts in survey data analysis.

## Why do we need survey data analysis software?

Regular statistical software (that is not designed for survey data) analyzes data as if the data were collected using simple random sampling. For experimental and quasi-experimental designs, this is exactly what we want. However, when surveys are conducted, a simple random sample is rarely collected. Not only is it nearly impossible to do so, but it is not as efficient (both financially and statistically) as other sampling methods. When any sampling method other than simple random sampling is used, we usually need to use survey data analysis software to take into account the differences between the design that was used and simple random sampling. This is because the sampling design affects the calculation of the standard errors of the estimates. If you ignore the sampling design, e.g., if you assume simple random sampling when another type of sampling design was used, the standard errors will likely be underestimated, possibly leading to results that seem to be statistically significant, when in fact, they are not. The difference in point estimates and standard errors obtained using non-survey software and survey software with the design properly specified will vary from data set to data set, and even between variables within the same data set. While it may be possible to get reasonably accurate results using non-survey software, there is no practical way to know beforehand how far off the results from non-survey software will be.

## Sampling designs

Most people do not conduct their own surveys. Rather, they use survey data that some agency or company collected and made available to the public. The documentation must be read carefully to find out what kind of sampling design was used to collect the data. This is very important because many of the estimates and standard errors are calculated differently for the different sampling designs. Hence, if you mis-specify the sampling design, the point estimates and standard errors will likely be wrong.

Below are some common features of many sampling designs.

__Weights__: There are many types of weights that
can be associated with a survey. Perhaps the most common is the sampling weight, sometimes called a pweight,
which is used to denote the inverse of the probability of being included in the
sample due to the sampling design (except for a certainty PSU, see below).
The pweight is calculated as N/n, where N = the number of elements in the
population and n = the number of elements in the sample. For example, if a population has 10
elements and 3 are sampled at random with replacement, then the pweight would be
10/3 = 3.33. In a two-stage design, the pweight is calculated as f_{1}f_{2},
which means that the inverse of the sampling fraction for the first stage is
multiplied by the inverse of the sampling fraction for the second stage.
Under many sampling plans, the sum of the pweights will equal the population total.

__PSU__: This is the **p**rimary **s**ampling **u**nit.
This is the first unit that is sampled in the design. For example, school
districts from California may be sampled and then schools within districts may
be sampled. The school district would be the PSU. If states from the US were sampled, and then school districts from within each
state, and then schools from within each district, then states would be the PSU.
One does not need to use the same sampling method at all levels of sampling.
For example, probability-proportional-to-size sampling may be used at
level 1 (to select states), while cluster sampling is used at level 2
(to select school districts). In the case of a simple random sample, the
PSUs and the elementary units are the same.

__Strata__: Stratification is a method of breaking up the
population into different groups, often by demographic variables such as gender,
race or SES. Once these groups have been defined, one samples from each
group as if it were independent of all of the other groups. For example,
if a sample is to be stratified on gender, men and women would be sampled
independent of one another. This means that the pweights for men will
likely be different from the pweights for the women. In most cases, you
need to have two or more PSUs in each stratum. The purpose of
stratification is to improve the precision of the estimates, and stratification works most
effectively when the variance of the dependent variable is smaller within the
strata than in the sample as a whole.

__FPC__: This is the **f**inite **p**opulation **c**orrection.
This is used when the sampling fraction (the number of elements or respondents
sampled relative to the population) becomes large. The FPC is used in the
calculation of the standard error of the estimate. If the value of the FPC
is close to 1, it will have little impact and can be safely ignored. In
some survey data analysis programs, such as SUDAAN, this information will be
needed if you specify that the data were collected without replacement
(see below for a definition of “without replacement”). The
formula for calculating the FPC is ((N-n)/(N-1))^{1/2}, where N is the
number of elements in the population and n is the number of elements in the
sample. To see the impact of the FPC for
samples of various proportions, suppose that you had
a population of 10,000 elements.

Sample size (n) FPC 1 1.0000 10 .9995 100 .9950 500 .9747 1000 .9487 5000 .7071 9000 .3162

## Sampling with and without replacement

Most samples collected in the real world are collected “without replacement”. This means that once a respondent has been selected to be in the sample and has participated in the survey, that particular respondent cannot be selected again to be in the sample. Many of the calculations change depending on if a sample is collected with or without replacement. Hence, programs like SUDAAN request that you specify if a survey sampling design was implemented with our without replacement, and an FPC is used if sampling without replacement is used, even if the value of the FPC is very close to one.

## Examples

In the examples that follow, we have data that represent a population, and we will discuss the analysis of
these survey data
as if they had been collected under five sampling plans: simple random sampling, stratified
random sampling, systematic sampling, one-stage cluster sampling and two-stage
cluster sampling with stratification. The Stata code necessary to generate
the samples using each of these sampling plans is shown
here. The
variables from the data set with which we will be working include **api00**
and **api99**, which is an aggregate of student test scores for each school,
for the years 2000 and 1999, respectively; **yr_rnd**, which is a 0/1
variable indicating if the school is on a year-round calendar; **awards**,
which indicates whether or not the school met their target; **meals**, which
indicates the percentage of children receiving free or reduced-priced meals at
school; **both**, which indicates that the school met both targets; and **
growth**, which is the difference between the api scores in the current year
and those of the last year.

One
of the most important points to remember is that all **svy** commands can be
used with any sampling plan. To help illustrate this, we will use the **
svy: mean** and the **svy: total** commands with each sampling plan. Another important point is that the
interpretation of the results from the **svy** commands is usually no
different than the interpretation that you would have if you had used the
equivalent non-survey command. For example, there is no special
interpretation of regression coefficients just because you obtained them using
**svy: reg** instead of **regress**.

## Simple random sample

We will start by showing how you can take a simple random sample (SRS) from you data file. While we will not go through the commands necessary for obtaining any other type of sample, we will go over how to draw an SRS. Simple random samples are very rare in actual practice; however, researchers will often draw an SRS of their data set so that they can work out their data analysis programs on a relatively small data file. This saves computing time and resources, as the analysis program may have to be run many times before it is satisfactory.

set mem 5m use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/apipop, clear count6194set seed 1003002849 sample 5 count310

Because we have eliminated elements of our population to create our sample, we need to create pweights (probability weights). We selected 5% of the elements in our population into our sample, so our sampling fraction is 310/6194. The pweight is the inverse of the sampling fraction, or N/n, where N is the population total (6194) and n is the number of elements selected into the sample (310). Another way to think of this is: “How many elements (schools, people, whatever) in the population should each element in the sample represent?” Clearly, each school in our current sample should represent twenty schools in the population, so all of the p-weights will be the same; approximately 20.

gen pw = 6194/310

Next, we need to consider how large our sample is relative to our population to
determine if we need to use a finite population correction. (For a quick review of FPCs, please see the summary at the beginning of this handout.) In Stata, we only need
to give the population total, and Stata will make the
necessary calculations to obtain the correct FPC. Note that the **svyset**
command is very different in Stata 8 than it was in Stata 7.

gen fpc = 6194

We use the **svyset** command to tell Stata about the features of the sampling design that we have.
In this case, we only need to specify the pweight and the FPC.

svyset [pweight=pw], fpc(fpc)

pweight: pw VCE: linearized Strata 1: <one> SU 1: <observations> FPC 1: fpc

Next, we
will use the **svydes** command to display the information Stata has
regarding our sampling plan. As you can see, the number of PSUs and
observations is the same, which reassures us that Stata understands that we have
a simple random sample. We also see that there is only one strata, which
is correct for this type of sampling plan. Note that once you have used the **svyset** command, Stata will remember this information for your entire session; you do not need to reissue this command (unless you want to change something). Also, if you save your data, Stata will save the survey information with the data set, so that when you open the data in your next session, the survey information will be used when you issue **svy** commands.

svydesSurvey: Describing stage 1 sampling units pweight: pw VCE: linearized Strata 1: <one> SU 1: <observations> FPC 1: fpc #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1 -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1

We will start our analysis of these data with some basic descriptive
statistics. We will use the **svy: mean** and **svy: total** commands. The **svy: mean** command is used to estimate the mean of a variable in the population. In our example, we will estimate the mean for **api00** and
**growth**. Please note that **svy: mean** is an estimation command,
and Stata will do a listwise deletion of missing data. For example, if we
had missing data on api00, we would probably get a different mean for growth
than if we issued the command **svy: mean growth** because there would a
different number of cases used in the calculation of the mean.

svy: mean api00 growth(running mean on estimation sample) Survey: Mean estimation Number of strata = 1 Number of obs = 310 Number of PSUs = 310 Population size = 6194 Design df = 309 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ api00 | 663.2645 7.21478 649.0682 677.4608 growth | 33.84516 1.667394 30.56428 37.12604 --------------------------------------------------------------

The **svy: total** command is used to get estimates of population totals. In our example, we will get an estimate of how many schools are on a year-round calendar. From the output of the **svy: total** command, we can see that approximately
719 schools are on a year-round calendar.

svy: total yr_rnd(running total on estimation sample) Survey: Total estimation Number of strata = 1 Number of obs = 310 Number of PSUs = 310 Population size = 6194 Design df = 309 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 719.3032 110.0291 502.8022 935.8042 --------------------------------------------------------------

We will now do a multiple regression. We will use **api00** as the
dependent variable and **award** and **meals** as independent variables.
We can see from the output that the model is statistically significant (F = 464.21, p
< .000), and that each of the predictors is also statistically
significant. You can interpret the output from the **svy** commands in
the same way that you would the non-svy command. In this example, you
interpret the output from the **svy: reg** command in the same way that you
would the output from the **regress** command. Remember that the
difference between the **svy: reg** and the **regress** commands is how the
standard errors are calculated. The **svy: reg** command takes into
account the survey sampling plan, while the **regress** command does not.

svy: reg api00 awards meals(running regress on estimation sample) Survey: Linear regression Number of strata = 1 Number of obs = 310 Number of PSUs = 310 Population size = 6193.9997 Design df = 309 F( 2, 308) = 464.21 Prob > F = 0.0000 R-squared = 0.7124 ------------------------------------------------------------------------------ | Linearized api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- awards | 53.37164 9.047051 5.90 0.000 35.57002 71.17326 meals | -3.329605 .1285952 -25.89 0.000 -3.582638 -3.076571 _cons | 738.9207 19.47419 37.94 0.000 700.6019 777.2395 ------------------------------------------------------------------------------

## Stratified random sampling

The difference between the example above and the next example is that
stratification has been added to the sampling design. For this example, we have
calculated the mean of **api99** and stratified schools based on this.
Schools that were above the mean were placed into one strata, and schools that
were below the mean were placed in the other strata. Simple random samples of schools were then drawn from each strata.
Although we have created only two strata, in many public-use data sets, you can
have dozens of strata.

We have used the **svyset, clear(all)** command here to show how it is
used. After issuing the **svyset** command, we again use the **svydes** command to ensure that Stata is handling the survey design appropriately. Next, we use the **svymean** to obtain the estimated means of **api00** and **api99**. We can
compare these estimates to those obtained from the SRS above. (Please see the table at the end of this handout.)

use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/strsrs, clear svyset, clear(all) svyset [pweight = pw], strata(strat) fpc(fpc)pweight: pw VCE: linearized Strata 1: strat SU 1: <observations> FPC 1: fpcsvydesSurvey: Describing stage 1 sampling units pweight: pw VCE: linearized Strata 1: strat SU 1: <observations> FPC 1: fpc #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 310 310 1 1.0 1 2 310 310 1 1.0 1 -------- -------- -------- -------- -------- -------- 2 620 620 1 1.0 1

Below we use the **svy: mean** command to get the population estimate of the
mean of **api00**. We can use the **estat effects** command to
get the design effect. Notice the value of the design effect, labeled Deff
in the output. The design effect compares
the current sampling design (in this case, stratified random sampling) with
simple random sampling. Design effects of 1 (or close to 1) indicate that
the current sampling design is about as efficient as a simple random sample.
Design effects that are smaller than 1 indicate that the current design is more
efficient than simple random sampling, while design effects that are larger than
1 indicate that the current sampling design is less efficient than simple random
sampling. Here, we can see the benefit of the stratification: the
design effect for **api00** is .35, well below 1. However, you will remember that we
stratified on the mean of **api99**, which is closely related to **api00**,
the variable for which we are getting an estimate.

svy: mean api00 growth(running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ api00 | 665.6216 2.957053 659.8145 671.4287 growth | 33.26666 1.10437 31.09788 35.43543 --------------------------------------------------------------estat effects---------------------------------------------------------- | Linearized | Mean Std. Err. Deff Deft -------------+-------------------------------------------- api00 | 665.6216 2.957053 .346875 .558707 growth | 33.26666 1.10437 .962983 .930909 ---------------------------------------------------------- Note: Weights must represent population totals for deff to be correct when using an FPC; however, deft is invariant to the scale of weights.

In the results of the **svy: total** shown below, you will see that the
design effect is not much smaller than 1; in other words, we get relatively
little benefit from the stratification. That is because there is not much
of a relationship between **api99** and **yr_rnd**. The point here
is that to be genuinely useful, you need to stratify on variable(s) closely
related to the variable of interest. In many cases, this will mean that
while stratification will make some estimates more efficient, it will not do so
for others.

svy: total yr_rnd(running total on estimation sample) Survey: Total estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 789.5516 76.59607 639.1314 939.9717 --------------------------------------------------------------

When estimates are made for each strata, they are made independently of all
other strata. In other words, the estimate of **yr_rnd** for strata 1
was made independently of the estimate for strata 2. Also note that the
sum of the estimates for strata 1 and strata 2 equals the value shown above.

svy: total yr_rnd, over(strat)(running total on estimation sample) Survey: Total estimation Number of strata = 2 Number of obs = 620 Number of PSUs = 620 Population size = 6194 Design df = 618 1: strat = 1 2: strat = 2 -------------------------------------------------------------- | Linearized Over | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 1 | 639.7935 67.69423 506.8549 772.7321 2 | 149.7581 35.83923 79.37663 220.1395 --------------------------------------------------------------

## Systematic sampling

Systematic sampling is just that: drawing a sample from elements that
are ordered in a systematic way. For example, you might take a systematic
sample of library books by selecting every k-th book from the books on the
shelf. (Remember that librarians hate when people actually do this!)
Of course, first you need to determine how large of a sample you want to select. There are 6194 schools in our sample, and we would like to use systematic sampling to select a sample of size 500.
First, we need to determine the “rate” at which schools should be selected.
We do this by dividing the number of elements (e.g., schools) by the number
desired in the sample. Therefore, k = 6194/500 = 12.38, which we will
round to 13. Hence, we will select every 13th school.
We will also randomly select a number from 1 to 13 and start counting from
there. In our example, we selected the number 4. Hence, we ordered
the schools from lowest id number to highest id number, started with school
number 4, and then selected into our sample every 13th school. After creating our sample, we follow the same procedure as before: open the correct data file, issue the **svyset** command, check to see that everything is OK with the **svydes** command, and then begin our analysis with descriptive statistics.

use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/systematic.dta, clear svyset [pweight = pw], fpc(fpc)pweight: pw VCE: linearized Strata 1: <one> SU 1: <observations> FPC 1: fpcsvydesSurvey: Describing stage 1 sampling units pweight: pw VCE: linearized Strata 1: <one> SU 1: <observations> FPC 1: fpc #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 477 477 1 1.0 1 -------- -------- -------- -------- -------- -------- 1 477 477 1 1.0 1

Below we get the population estimates for the mean of **api00** and **
growth**.

svy: mean api00 growth(running mean on estimation sample) Survey: Mean estimation Number of strata = 1 Number of obs = 477 Number of PSUs = 477 Population size = 6194 Design df = 476 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ api00 | 656.3061 5.655353 645.1935 667.4186 growth | 33.08595 1.226588 30.67576 35.49615 --------------------------------------------------------------estat effects---------------------------------------------------------- | Linearized | Mean Std. Err. Deff Deft -------------+-------------------------------------------- api00 | 656.3061 5.655353 1 .960724 growth | 33.08595 1.226588 1 .960724 ---------------------------------------------------------- Note: Weights must represent population totals for deff to be correct when using an FPC; however, deft is invariant to the scale of weights.

Notice that the design effect for all variables is 1. This is not
necessarily because systematic sampling is always just as efficient as simple
random sampling. Rather, it has to do with the information that you have
given to Stata. The design effect is influenced by setting the strata and
PSU. In both simple random sampling and systematic sampling, we set
neither the strata or PSU. Hence, Stata “can’t tell the two sampling
plans apart.”
Because the specification of the sampling design is exactly the same as with
simple random sampling, the design effect is 1. However, you can calculate
the design effect by hand by dividing the variance of the variable of interest
under the current sampling design by the variance of the same variable under
simple random sampling. We did this and found that the design effects were
very close to 1. We found them to be .96 for **api00**, .93 for **
growth** and 1.2 for **yr_rnd**.

svy: total yr_rnd(running total on estimation sample) Survey: Total estimation Number of strata = 1 Number of obs = 477 Number of PSUs = 477 Population size = 6194 Design df = 476 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 779.1195 90.44644 601.3958 956.8432 --------------------------------------------------------------

Below we show the use of the **svy: tab** command. This can be used to
make one- and two-way crosstabulations. Here we will make a crosstab of **
both** and **awards**. The values in the cells are proportions.
You can use the **count** option (as shown below) to obtain the counts in
each cell. The **svy: tab** command also gives us the chi-square test for these two
variables. We can see that the relationship between them is statistically
significant.

svy: tab both awards(running tabulate on estimation sample) Number of strata = 1 Number of obs = 477 Number of PSUs = 477 Population size = 6194 Design df = 476 ------------------------------- met both | eligible for awards targets | no yes Total ----------+-------------------- No | .3019 0 .3019 Yes | .0503 .6478 .6981 | Total | .3522 .6478 1 ------------------------------- Key: cell proportions Pearson: Uncorrected chi2(1) = 379.3900 Design-based F(1, 476) = 427.4673 P = 0.0000svy: tab both awards, count(running tabulate on estimation sample) Number of strata = 1 Number of obs = 477 Number of PSUs = 477 Population size = 6194 Design df = 476 ------------------------------- met both | eligible for awards targets | no yes Total ----------+-------------------- No | 1870 0 1870 Yes | 311.6 4012 4324 | Total | 2182 4012 6194 ------------------------------- Key: weighted counts Pearson: Uncorrected chi2(1) = 379.3900 Design-based F(1, 476) = 427.4673 P = 0.0000svy: reg api00 award meals(running regress on estimation sample) Survey: Linear regression Number of strata = 1 Number of obs = 477 Number of PSUs = 477 Population size = 6194 Design df = 476 F( 2, 475) = 679.67 Prob > F = 0.0000 R-squared = 0.6967 ------------------------------------------------------------------------------ | Linearized api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- awards | 46.30969 7.237096 6.40 0.000 32.08908 60.53029 meals | -3.406531 .1056495 -32.24 0.000 -3.614128 -3.198934 _cons | 791.0985 9.321325 84.87 0.000 772.7825 809.4146 ------------------------------------------------------------------------------

## One-stage cluster sampling in Stata

In a one-stage cluster sample, the data are divided into two “levels”, one “nested” in the other. At the first level, the data are grouped into clusters. In a one-stage cluster sample, clusters are selected first and are called primary sampling units, or PSUs. All of the elements in each selected cluster are selected into the sample. These elements represent the second “level” of the data. In our one-stage cluster sample, the districts will be the clusters and the schools will be the elementary or sampling units. Hence, we randomly select school districts and then select all schools within each selected district. You can use any sampling plan to select the clusters; we have used SRS only for the sake of simplicity.

Typically, data values in one cluster are more similar to one another than data values in another cluster. For example, if we surveyed people in households (e.g., people nested within households), we would expect that people in one household would be more similar to one another than they would be to people in another household. Unfortunately, this feature makes our standard errors less efficient. However, because of financial and/or logistical considerations, most surveys employ some sort of cluster sampling.

use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/oscs1, clearsvyset dnum [pweight = pw], fpc(fpc)pweight: pw VCE: linearized Strata 1: <one> SU 1: dnum FPC 1: fpcsvydesSurvey: Describing stage 1 sampling units pweight: pw VCE: linearized Strata 1: <one> SU 1: dnum FPC 1: fpc #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 189 1463 1 7.7 100 -------- -------- -------- -------- -------- -------- 1 189 1463 1 7.7 100svy: mean api00 growth(running mean on estimation sample) Survey: Mean estimation Number of strata = 1 Number of obs = 1463 Number of PSUs = 189 Population size = 5859.74 Design df = 188 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ api00 | 670.5202 11.09702 648.6295 692.4108 growth | 32.85783 1.440905 30.01541 35.70025 --------------------------------------------------------------svy: total yr_rnd(running total on estimation sample) Survey: Total estimation Number of strata = 1 Number of obs = 1463 Number of PSUs = 189 Population size = 5859.74 Design df = 188 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 797.0529 176.0585 449.7489 1144.357 --------------------------------------------------------------

As you can see, the standard errors for these estimates are much larger than they were for any of the previous sampling plans. Although we don’t show an example here, you can easily combine stratification with cluster sampling, and this will help to make the standard errors more efficient.

## Two-stage cluster sampling with stratification

In this last example, we will take a stratified two-stage cluster sample.
As with the stratified random sample illustrated above, the sampling for each
strata will be done independent of every other strata. A two-stage cluster
sample means that clusters will be sampled (using whatever sampling plan the
researcher chooses), and then elements within each of the selected clusters will
also be sampled. This is different from what we did above in that, in a
one-stage cluster sample, all of the elements in each selected cluster are
selected into the sample. In a two-stage cluster sample, (usually) only
some of the elements are selected into the sample. In our example, we will
take an SRS of school districts (clusters), and then we will take an SRS of
schools (elements). In the same way that you can use pretty much any
sampling plan to select clusters, you can use pretty much any sampling plan to
select elements from within the selected clusters; the sampling plan for
selecting the clusters does not have to be the same as the one for selecting the elements. Also, you do not have to use the same sampling plan from one strata to the next, as the sampling between strata is independent. To obtain the sample used below, we first used the stratification that we used before, stratifying schools based on their
mean **api99** score. Next, we randomly selected 25% of the school districts from each strata. Finally, we randomly selected three schools from each selected district. The choice to select three schools, as opposed to selecting two or four schools, was rather arbitrary. However, when deciding how many elements to select from a cluster, remember that you need to
have a sufficient number to get stable estimates; however, because data values
within each cluster are likely correlated, taking lots of them is often a waste of
resources: 200 elements probably won’t be any more informative than 100.
(This, of course, depends on how strong the correlation is.)

use http://www.ats.ucla.edu/stat/stata/seminars/svy_stata_intro/strataboth, clear svyset dnum [pweight = pwt], fpc(fpc) strata(strata)pweight: pwt VCE: linearized Strata 1: strata SU 1: dnum FPC 1: fpcsvydesSurvey: Describing stage 1 sampling units pweight: pwt VCE: linearized Strata 1: strata SU 1: dnum FPC 1: fpc #Obs per Unit ---------------------------- Stratum #Units #Obs min mean max -------- -------- -------- -------- -------- -------- 1 94 227 1 2.4 3 2 95 239 1 2.5 3 -------- -------- -------- -------- -------- -------- 2 189 466 1 2.5 3svy: mean api00 growth(running mean on estimation sample) Survey: Mean estimation Number of strata = 2 Number of obs = 466 Number of PSUs = 189 Population size = 6032.9 Design df = 187 -------------------------------------------------------------- | Linearized | Mean Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ api00 | 681.84 10.44856 661.2278 702.4522 growth | 30.71763 2.22572 26.32688 35.10838 --------------------------------------------------------------svy: total yr_rnd(running total on estimation sample) Survey: Total estimation Number of strata = 2 Number of obs = 466 Number of PSUs = 189 Population size = 6032.9 Design df = 187 -------------------------------------------------------------- | Linearized | Total Std. Err. [95% Conf. Interval] -------------+------------------------------------------------ yr_rnd | 718.9149 214.9205 294.9345 1142.895 --------------------------------------------------------------svy: reg api00 awards meals(running regress on estimation sample) Survey: Linear regression Number of strata = 2 Number of obs = 466 Number of PSUs = 189 Population size = 6032.9042 Design df = 187 F( 2, 186) = 556.68 Prob > F = 0.0000 R-squared = 0.7114 ------------------------------------------------------------------------------ | Linearized api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- awards | 66.19885 5.867421 11.28 0.000 54.624 77.77369 meals | -3.192264 .1135934 -28.10 0.000 -3.416353 -2.968175 _cons | 772.7654 6.72774 114.86 0.000 759.4934 786.0374 ------------------------------------------------------------------------------

We have seen examples of how to do OLS regression with survey data, so now let’s do a logistic regression. First, we need to recode our dependent variable so that is 0/1. Next, we issue the **svy: logit** command. If you want odds ratios, you can use the **or** option with **svy: logit**. In this example, we use some new variables. The variable **comp_imp1** is coded
0/1 and indicates if the school met a comparable improvement target; **growth**
is the difference between the current year’s api score and last year’s api
score; **ell** is the percent of English language learners; and **mobility** is the percent of students for whom this is their first year at the school.

svy: logit comp_imp1 growth ell mobility(running logit on estimation sample) Survey: Logistic regression Number of strata = 2 Number of obs = 466 Number of PSUs = 189 Population size = 6032.9042 Design df = 187 F( 3, 185) = 20.80 Prob > F = 0.0000 ------------------------------------------------------------------------------ | Linearized comp_imp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- growth | .1213203 .0159442 7.61 0.000 .0898667 .1527739 ell | -.0702944 .0119777 -5.87 0.000 -.0939231 -.0466657 mobility | -.0781154 .0202496 -3.86 0.000 -.1180624 -.0381684 _cons | .6391637 .3169899 2.02 0.045 .013828 1.264499 ------------------------------------------------------------------------------

Now we will use a three-level variable to show the use of the **test** command. Please note that
“svytest” is an out-of-date command. As you can see, the **xi** prefix works with the **svy** commands (and so does **xi3**).
However, you need to use the prefixes in the correct order: “svy: xi:
logit” does not work.

xi: svy: logit comp_imp1 growth ell mobility i.meals3i.meals3 _Imeals3_1-3 (naturally coded; _Imeals3_1 omitted) (running logit on estimation sample) Survey: Logistic regression Number of strata = 2 Number of obs = 466 Number of PSUs = 189 Population size = 6032.9042 Design df = 187 F( 5, 183) = 14.28 Prob > F = 0.0000 ------------------------------------------------------------------------------ | Linearized comp_imp1 | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- growth | .1333139 .0177526 7.51 0.000 .0982928 .168335 ell | -.0335437 .0134298 -2.50 0.013 -.0600371 -.0070503 mobility | -.0528434 .0194839 -2.71 0.007 -.09128 -.0144068 _Imeals3_2 | -1.976366 .3789415 -5.22 0.000 -2.723916 -1.228817 _Imeals3_3 | -2.54474 .9051281 -2.81 0.005 -4.330314 -.7591659 _cons | .5236906 .2685344 1.95 0.053 -.0060555 1.053437 ------------------------------------------------------------------------------test _Imeals3_2 _Imeals3_3Adjusted Wald test

( 1) _Imeals3_2 = 0 ( 2) _Imeals3_3 = 0

F( 2, 186) = 15.94 Prob > F = 0.0000

## Summary of population values, estimates, standard errors, design effects and estimated population totals for each sampling plan

The table below summarizes the values obtained from the descriptive
statistics that we ran under each of the sampling plans, as well as the
estimated population size. It also contains the population values, which,
of course, are not estimates, and hence do not have standard errors or design
effects associated
with them. (To obtain the design effects, you will need to issue the **
estat effects** command after the analysis command.) Design effects are the ratio of the variance of the variable
under the current sampling design to the estimated variance under simple random
sampling. In other words, it is an estimate of efficiency of the current
sampling design relative to simple random sampling. As you can see, the standard errors and the design effects for the stratified simple
random sample are the smallest, followed closely by those for the simple random
sample. The design effects obtained under the systematic sample are
slightly larger, and they become even larger when cluster sampling is used.
The largest design effects are obtained using stratified one-stage cluster
sampling. Also notice that cluster sampling yields estimates of the
population size that are considerably different from those obtained using other
types of sampling plans. You should not assume that this pattern of
results will be obtained every time these sampling plans are compared.
Some plans that look relatively inefficient in this example may appear to be
more efficient with other samples and/or other data.

mean api00 | mean growth | total yr_rnd | estimated population size | |||||||

estimate | standard error | design effect | estimate | standard error | design effect | estimate | standard error | design effect | ||

population values | 664.71 | N/A | N/A | 32.80 | N/A | N/A | 874 | N/A | N/A | 6194 |

SRS | 663.26 | 7.21 | 1 | 33.85 | 1.67 | 1 | 719.30 | 110.03 | 1 | 6194 |

Stratified SRS | 665.62 | 2.96 | .35 | 33.27 | 1.10 | .96 | 789.55 | 76.60 | .95 | 6194 |

Systematic | 656.31 | 5.66 | 1 | 33.09 | 1.23 | 1 | 779.12 | 90.45 | 1 | 6194 |

One-stage cluster | 670.52 | 11.10 | 15.28 | 32.86 | 1.44 | 4.55 | 797.05 | 176.06 | 14.97 | 5860 |

Stratified two-stage cluster | 681.84 | 10.45 | 3.90 | 30.72 | 2.23 | 3.01 | 818.92 | 214.92 | 6.09 | 6033 |