Code for this seminar
All code for this seminar is available here — copy and paste into the Stata do-file editor or right-click to download.
Purpose of the seminar
This seminar covers Stata commands and methods to prepare data for statistical analysis. Topics include:
- Inputting data into Stata
- Viewing the dataset
- Characterizing and summarizing variables
- Selecting Observations
- Appending files
- Merging data
- Handling duplicates
- Missing data
- Creating and transforming variables
- Renaming variables
- Managing string variables
- Labels
- Processing data by groups
- Macros
- Looping
What makes a well-managed dataset in Stata?
Well-managed data prevents processing errors when running commands that expect the data to be structured in a specific way and estimation errors when improper or wrong values are recorded in the data.
Well-managed data in Stata generally have the following properties:
- rectangular, consistent number of rows and columns across dataset (so don’t import any graphs, column totals, or long text notes from Excel)
- no duplicated observations
- user-declared missing values indicators (e.g. -99) are known to Stata to mean missing
- numeric variables have been checked for erroneous values
- string variables are checked for errors and extra spaces
- categorical variables used in statistical analysis are defined as numeric rather than string
- all variables needed for analysis have been created and cleaned
Preliminary advice
Help files
Precede a command name (and certain topic names) with help
to access its help file.
Let’s take a look at the help file for the describe
command.
help describe
In the Title section:
- command name and a brief description
- the blue command name describe is a clickable link to a .pdf of the Stata manual entry for
describe
- manual entries include details about methods and formulas used for estimation commands, and detailed, thoroughly explained examples.
In the Syntax section:
- various uses of command and how to specify it
- bolded words are required: only the word
describe
is required to describe data in memory - italicized words are to be substituted by the user: varlist by variable names, memory_options by one or more of the options in the memory_options section below
- [Bracketed] words are optional: only the word
describe
is required (for dataset in memory); specifying a varlist or memory_options are optional. - a comma
,
is almost always used to initiate the list of options
Under the Syntax section will typically be a listing and description of options available for the command.
Below that will typically be examples of using the command, including video examples for some commonly used commands.
Commands and options can be abbreviated
In the Syntax section of the help file for describe
, the first letter d is underlined like so: describe. The underlined part of the command name is the minimal abbreviation of the command required for Stata to understand it.
Thus, we can use the describe
command by simply issuing d
.
We see that options in the memory_options can also be abbreviated.
We will use abbreviations, though not usually the minimal abbreviation, throughout this seminar. Efficiency of coding is one of Stata’s strengths.
Using the do-file editor
We highly recommend writing and running code in do-files, text files where Stata commands can be saved.
The do-file editor can be opened by issuing the command doedit
, or clicking on the pencil-and-paper icon.
Some tips for using do-files:
- To run code from the do-file, highlight the code and then hit Ctrl-d or click on the right-most icon,the “Execute(do)” icon, on the toolbar at the top of the do-file editor
- To make comments that won’t be run, precede text with
*
or enclose text within/*
and*/
. - To continue a long command specification across multiple lines, place <code>///</code> at the end of each line, except for the last (make sure to put a space between the command text and the continuation lines).
*a comment won't be run by Stata describe /// age using /// https://stats.idre.ucla.edu/stat/data/patient_pt2_stata_dm.dta storage display value variable name type format label variable label ------------------------------------------------------------------------------------- age float %9.0g Age
Inputting data into Stata
Clear memory for new datasets
Stata allows only one dataset to be loaded at a time. Because of this, Stata requires that no dataset be already loaded when loading in a dataset from file. We can clear any dataset from memory with the clear
command:
clear
As a convenience, Stata usually allows the data to be cleared in commands that load in data through a clear
option (e.g. import excel filename, clear
)
Excel and .csv files
Note: Throughout the seminar, we load datasets over the internet. Typically, a Stata user would load data from a storage device, such as a hard drive, so, most specifications would include filename (with path specification, such as “C:/path/to/file/filename”) instead of a http web address.
Excel files and comma-separated values (.csv) files are among the most common ways to store raw data. Both storage types are read in using a variant of the import
command.
We read in Excel files using import excel
. The minimum specification is just import excel
and then the Excel file name (with path if file not in current directory).
Below, we load in an Excel dataset pulled from our website. We add the option clear
to clear any data in memory first before importing.
import excel "https://stats.idre.ucla.edu/stat/data/hsb2.xls", firstrow clear
firstrow
tells Stata that the first row of the Excel files have variable namesclear
clears memory of any datasets before importing
To read in .csv files (as well as most text files in general), we use import delimited
instead.
import delimited "https://stats.idre.ucla.edu/stat/data/hsb2.csv", clear
- Stata assumes that variable names are the first row of data in text files, so no
firstrow
option forimport delimited
Inputting from the keyboard
Stata allows data to be entered directly through the keyboard with the input
command, even when another dataset is already in memory. This can be useful to add data that may not be used in the ensuing statistical analysis, such as graphing data.
To use input
:
- variable names follow
input
- the keyword
end
terminates data entry - number of rows does not need to be the same as data in memory
input age weight 8 11 9 12 8 10 9 11 10 15 end
If we are inputting string (character) variables, precede the string variable name with strn
, where n is the maximum length of any string for that variable.
input str10 words "These" "are" "words." end
Stata data files
Datasets stored in the native Stata .dta format are loaded with the use
command. The .dta extension does not need to be specified:
use "https://stats.idre.ucla.edu/stat/data/hsb2", clear
Dataset for this seminar
We load in the first dataset used for this seminar:
use https://stats.idre.ucla.edu/stat/data/patient_pt1_stata_dm, clear
This dataset contains fake cancer patient data. Each patient is also linked to a doctor in the dataset. Another dataset containing doctor variables will be merged into this dataset later in the seminar.
Viewing the dataset
Specifying list
by itself prints the entire dataset to screen. We do not recommend this for very large datasets.
Instead, list
followed by variable names will display only those variables. Ranges of variables are allowed.
Here we list the variables hospital through pain:
list hospital-pain +--------------------------------------------------------------------+ | hospital hospid docid dis_date tumors~e co2 pain | |--------------------------------------------------------------------| 1. | UCLA 1 1-1 06sep2009 67.98119 1.534333 4 | 2. | UCLA 1 1-1 07jan2011 64.70246 1.676132 2 | 3. | UCLA 1 1-1 04sep2010 86.43799 1.4533 3 | 4. | UCLA 1 1-1 25jun2010 53.40018 1.566348 3 | 5. | UCLA 1 1-1 01jul2009 51.65727 1.417868 4 | |--------------------------------------------------------------------| 6. | UCLA 1 1-1 06mar2009 78.91708 1.706632 3 | 7. | UCLA 1 1-1 15apr2010 62.85259 1.542961 4 | 8. | UCLA 1 1-11 25jul2010 73.18581 1.451655 4 | . . . [some output omitted]
Data can also be viewed in a spreadsheet-style window by either issuing the command browse
or clicking on the Browse icon in the toolbar, the spreadsheet and magnifying glass.
browse
Characterize and summarize variables with describe
and codebook
describe
The describe
command provides the following information about how variables are stored:
- storage type (e.g. byte (integer), float (decimal), str10 (character string variable of length 10))
- display format, or how the values appear in Stata
- value label
- variable label
describe Contains data from https://stats.idre.ucla.edu/stat/data/patient_pt1_stata_dm.dta obs: 120 vars: 25 6 May 2017 19:05 size: 11,400 ------------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------------- hospital str14 %14s hospid byte %8.0g hospid docid str5 %9s dis_date float %td tumorsize float %9.0g co2 float %9.0g pain byte %8.0g . . . [some output omitted]
In the table above we also see that we have 120 observations and 25 variables.
codebook
For more detailed information about the values of each variable, use codebook
, which provides the following:
- For all variables
- Number of unique values
- Number of mising values
- For numeric variables
- Range
- quantiles, means and standard deviation for continuous variables
- frequencies for discrete variables
- For string variables
- frequencies
- warnings about leading and trailing blanks
The codebook
command can be followed by specific variable names, or specified by itself to process all variables.
codebook ------------------------------------------------------------------------------------- hospital (unlabeled) ------------------------------------------------------------------------------------- type: string (str14), but longest is str6 unique values: 6 missing "": 0/120 tabulation: Freq. Value 2 "UCLA " 2 "UCSF " 59 "UCLA" 1 "UCLA " 55 "UCSF" 1 "UCSF " warning: variable has leading and trailing blanks ------------------------------------------------------------------------------------- hospid hospid ------------------------------------------------------------------------------------- type: numeric (byte) range: [1,2] units: 1 unique values: 2 missing .: 0/120 tabulation: Freq. Value 62 1 58 2 . . . [some output omitted]
Selecting observations
Selecting by observation number with in
The simplest method of selection is by observation number, such as the first 10 observations, or observations 30 through 100.
In Stata, the in
operator can be used to specify a range of consecutive observations to select.
After in
, specify the number of the first observation, followed by a slash /
, and then the number of the last observation.
Here we list
the age variable for observations 1 through 10.
list age in 1/10 +----------+ | age | |----------| 1. | 62.59215 | 2. | 47.63313 | 3. | 55.92456 | 4. | 50.91338 | 5. | 51.74344 | |----------| 6. | 54.42569 | 7. | 46.80042 | 8. | 58.43031 | 9. | 41.02624 | 10. | 40.33957 | +----------+
The in
operator accepts negative numbers, which specify the number of observations from the end (-10 is 10 observations from the end), and also accepts the characters “l” or “L” to mean the last observation.
The following lists the age variable for the last 10 observations in the dataset:
list age in -10/L +----------+ | age | |----------| 111. | 40.90655 | 112. | 50.38913 | 113. | 49.10889 | 114. | 54.06199 | 115. | 52.07264 | |----------| 116. | 51.95874 | 117. | 51.08696 | 118. | 47.10477 | 119. | 48.44899 | 120. | 58.73587 | +----------+
Selecting by condition with if
We can also select observations that meet a certain condition, such as observations where sex=”female”.
Conditional selection is handled in Stata by the if
operator, which is almost always placed after the command specification, but before the comma that marks the beginning of the list of options.
Below we list those observations where sex = “female” and pain greater than 8, with a clean
formatting that removes the table lines.
li age if sex == "female" & pain > 8, clean age 103. 41.51811 116. 51.95874
Appending files — adding more rows of observations
Often datasets are split into multiple files, perhaps because data are collected in several waves or by different researchers.
In Stata we append
data files when we need to add more rows of observations of the same variables.
To use append
:
- The datasets should have most or all of the same variables
- Variables not common to all datasets can be appended, but will have missing values wherever absent
- Variables with the same name should have the same type (string, float, etc.)
- The dataset in memory is the master dataset
- Datasets stored elsewhere to be appended follow the keyword
using
in theappend
statement
Now we append another patient dataset from a different hospital, with all the same variables except that the new hospital did not assess the variable nmorphine. Using describe
on the appended datasets shows that we now have 231 observations, 111 more than before. We see 25 variables, so no new variables were added .
append using https://stats.idre.ucla.edu/stat/data/patient_pt2_stata_dm describe Contains data from https://stats.idre.ucla.edu/stat/data/patient_pt1_stata_dm.dta obs: 231 vars: 25 6 May 2017 19:05 size: 21,945 ------------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------------- hospital str14 %14s hospid byte %8.0g hospid docid str5 %9s dis_date float %td tumorsize float %9.0g co2 float %9.0g pain byte %8.0g . . . [some output omitted]
A tabulate
of nmorphine with the missing
option shows that it is missing the new 111 observations:
tab nmorphine, miss nmorphine | Freq. Percent Cum. ------------+----------------------------------- 0 | 20 8.66 8.66 1 | 12 5.19 13.85 2 | 25 10.82 24.68 3 | 14 6.06 30.74 4 | 13 5.63 36.36 5 | 15 6.49 42.86 6 | 5 2.16 45.02 7 | 5 2.16 47.19 8 | 7 3.03 50.22 9 | 1 0.43 50.65 11 | 1 0.43 51.08 12 | 1 0.43 51.52 13 | 1 0.43 51.95 . | 111 48.05 100.00 ------------+----------------------------------- Total | 231 100.00
Merging data — adding columns of variables
When we append
datasets we add more rows of observations, whereas when we merge
datasets, we add more columns of variables. Datasets to be merged should generally be matched on some id variable, so that the correct variable values are grouped together.
To use merge
:
- Only one dataset can be merged into the dataset in memory in a single
merge
command - The dataset in memory is the master dataset
- The dataset stored elsewhere to be merged follows the keyword
using
in themerge
statement and is the using dataset - One more identification (id) variables should be used to match observations between datasets
- The id variables should uniquely identify observations in at least one dataset
- If the id variables uniquely identify observations in both datasets, this is a 1-to-1 merge
- If the id variables uniquely identify observations in only the master dataset, this is a 1-to-many merge
- If the id variables uniquely identify observations in only the using dataset, this is a many-to-1 merge
- If there are variables common to both datasets and one should be used to overwirte the other, consult the
update
andreplace
options
The syntax of merge
is:
merge merge_type id_varlist using filename
where:
- merge_type is one of
1:1
,1:m
, orm:1
for a 1-to-1, 1-to-many, or many-to-1 merge, respectively - id_varlist is one or more id variables to match observations between datasets
- filename is the path and filename of the using file
Let’s merge
in a dataset with variables describing doctors attending patients in our patient dataset. The doctor dataset will be our using dataset.
We can take a quick look at the doctor dataset without loading into memory using describe
:
describe using https://stats.idre.ucla.edu/stat/data/doctor_stata_dm Contains data obs: 40 6 May 2017 19:05 vars: 5 size: 880 ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------------- docid str5 %9s experience byte %8.0g Experience school str7 %9s School lawsuits byte %8.0g Lawsuits medicaid float %9.0g Medicaid -------------------------------------------------------------------------------
We see that there are 40 doctors in the dataset. We also see the variable “docid”, the doctor id variable that we will use to merge the datasets.
Each doctor sees multiple patients in the patient dataset. We see this with a tabulate
of docid:
tab docid docid | Freq. Percent Cum. ------------+----------------------------------- 1-1 | 7 3.03 3.03 1-100 | 9 3.90 6.93 1-11 | 8 3.46 10.39 1-21 | 1 0.43 10.82 1-22 | 4 1.73 12.55
With each docid repeated in the master (patient) dataset, and each docid unique in the using (doctor) dataset, we will be doing a many-to-1 merge on the merge variable docid:
merge m:1 docid using https://stats.idre.ucla.edu/stat/data/doctor_stata_dm Result # of obs. ----------------------------------------- not matched 8 from master 7 (_merge==1) from using 1 (_merge==2) matched 224 (_merge==3)
In output we see that 224 of our original 231 patient observations were successfully matched to doctors. However, seven patients in the master data were not matched, and one doctor in the using data was also not matched.
Upon merging files, Stata generates a new variable, _merge, which equals:
- 1 if the observation’s merge id was unique to the master file (a docid found only in patient file)
- 2 if the observation’s merge id was unique to the using file (a docid found only in doctor file)
- 3 if the observation’s merge id was matched
We only want patients in our data, so we will drop the unmatched doctor by selecting on _merge==2.
drop if _merge == 2 (1 observation deleted)
Handling duplicated data
Inadvertently duplicated observations can be hard to spot in a visual inspection of the data, particularly if there is no unique ID for each observation or the dataset is large. Fortunately, Stata provides a suite of commands to identify and remove duplicates — see help duplicates
.
Count the number of duplicates
To display a count of the number of duplicated observations use duplicates report
.
duplicates report Duplicates in terms of all variables -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 217 0 2 | 14 7 --------------------------------------
Seven observations are duplicated (once each), creating a total of 14 observations, 7 of which are surplus. The remaining 217 observations are unduplicated.
Notice that these are counts of duplicates “in terms of all variables”, the default behavior if no variables are specified. The same counts of duplicates can be achieved by using the keyword _all
.
duplicates report _all Duplicates in terms of hospital hospid docid dis_date tumorsize co2 pain wound mobility ntumors nmorphine remission lungcapacity age married familyhx smokinghx sex cancerstage lengthofstay wbc rbc bmi test1 test2 experience school lawsuits medicaid _merge -------------------------------------- copies | observations surplus ----------+--------------------------- 1 | 217 0 2 | 14 7 --------------------------------------
We can also check for duplicates along a limited set of variables, rather than all variables. Here we count how many times each of the three “hospid” (hospital id) values are duplicated.
duplicates report hospid Duplicates in terms of hospid -------------------------------------- copies | observations surplus ----------+--------------------------- 58 | 58 57 62 | 62 61 111 | 111 110 --------------------------------------
We see the number of copies of each of our 3 hospital ids.
Tag duplicates for inspecton
Duplicated observations are not always errors, so inspecting them would be prudent.
duplicates tag
generates a variable that codes 0 for unique observations, 1 for observations with 2 copies, 2 for observations with 3 copies, and so on. Here we create such a variable called “dup”, and then tabulate dup to check the results.
duplicates tag, gen(dup) Duplicates in terms of all variables tab dup dup | Freq. Percent Cum. ------------+----------------------------------- 0 | 217 93.94 93.94 1 | 14 6.06 100.00 ------------+----------------------------------- Total | 231 100.00
We see the expected 14 values of 1, the 7 duplicated observations with 2 copies.
Dropping duplicates
To drop all copies of an observation but the first, use duplicates drop
.
duplicates drop Duplicates in terms of all variables (7 observations deleted)
The seven surplus observations have been dropped.
Missing Data
Missing data can be a vexing problem, particularly when data are not self-collected and missing data codes (e.g. -99) are not documented well. Stata provides a number of commands to count and report missing values, and to convert missing data codes to true Stata missing values. See help missing
for an overview of missing values in Stata.
Missing values
Stata represents missing values for numeric variables with a dot, .
(also called sysmiss), and for string variables an empty string, ""
(also called blank).
Additional missing data values are available, starting with .a
and ending with .z
. which can be used to represent different types of missing (refusal, don’t know, etc.).
Missing values are very large numbers in Stata, with all non-missing numbers < .
< .a
< .b
< … < .z
.
When reading in data from a text or Excel file, missing data for both numeric and string variables can be represented by an empty field.
Detecting missing data codes
Often we work with datasets where certain extreme numeric values are used to represent missing values, such as -99 or 999. Undetected, these missing data codes can be included as real data in statistical analysis, greatly distorting results.
We can use numeric reports and graphs to detect these missing data codes if we are not sure where they are used. The summarize
command estimates means, variances, min and max values for variables. Missing data values are often found in the min or max columns:
summarize Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- hospital | 0 hospid | 224 2.209821 .8394382 1 3 docid | 0 dis_date | 224 18306.74 192.6187 17743 18809 tumorsize | 224 69.78098 11.76117 38.67265 109.0096 -------------+--------------------------------------------------------- co2 | 224 -2.832827 20.61865 -98 2.047362 pain | 224 5.379464 1.568675 1 9 wound | 224 5.65625 1.551119 2 9 mobility | 224 5.9375 1.745108 2 9 ntumors | 224 3.357143 2.688148 0 9 -------------+--------------------------------------------------------- nmorphine | 117 3.376068 2.744042 0 13 remission | 224 .28125 .4506162 0 1 lungcapacity | 224 -18.33162 39.29377 -99 .9982018 age | 224 52.95722 21.45249 34.19229 357.89 married | 224 .6071429 .4894793 0 1 -------------+--------------------------------------------------------- familyhx | 0 smokinghx | 0 sex | 0 cancerstage | 0 lengthofstay | 224 5.419643 1.121645 3 9 -------------+--------------------------------------------------------- wbc | 0 rbc | 224 4.993808 .266146 4.177896 5.660488 bmi | 224 28.60466 6.43512 18.44991 58 test1 | 224 -3.683989 27.53701 -99 23.72776 test2 | 224 -2.795368 27.8277 -99 19.94562 -------------+--------------------------------------------------------- experience | 217 19.11521 4.611841 9 27 school | 0 lawsuits | 217 1.852535 1.486484 0 5 medicaid | 217 .523119 .2085848 .1415814 .8187299 _merge | 224 2.9375 .3487646 1 3 -------------+--------------------------------------------------------- dup | 224 .03125 .1743823 0 1
We see suspicious codes -98 and -99 in the variables co2, lungcapacity, test1, and test2. We also see the suspicious value 357.89 for age.
Notice that we do not get summaries for string variables. We will need to use another command to detect missing data codes for string variables.
Boxplots highlight outliers, which missing data codes tend to be. Here we use graph box
to detect missing data codes in the variables co2, lungcapacity, test1, and test2:
graph box co2 lungcapacity test1 test2
We see obvious outliers, the missing data codes, including 2 for lungcapacity.
For discrete variables, we can use tabulate
(abbreviated as tab
) to print tables of unique values where missing data codes can be easily spotted. The missing
option will print any true missing values to the table as well.
Let’s inspect the string variables smokinghx and familyhx:
tab smokinghx, miss SmokingHx | Freq. Percent Cum. ------------+----------------------------------- -99 | 16 7.14 7.14 current | 41 18.30 25.45 former | 42 18.75 44.20 never | 125 55.80 100.00 ------------+----------------------------------- Total | 224 100.00 tab familyhx, miss FamilyHx | Freq. Percent Cum. ------------+----------------------------------- -99 | 16 7.14 7.14 no | 171 76.34 83.48 yes | 37 16.52 100.00 ------------+----------------------------------- Total | 224 100.00
Use mvdecode
to convert user-defined missing data codes to missing values
We can quickly convert all user-defined missing codes to true missing values for numeric variables with mvdecode
.
Below, we specify that the code -99 be treated as missing for all variables:
mvdecode _all, mv(-99) hospital: string variable ignored docid: string variable ignored lungcapacity: 33 missing values generated familyhx: string variable ignored smokinghx: string variable ignored sex: string variable ignored cancerstage: string variable ignored wbc: string variable ignored test1: 17 missing values generated test2: 17 missing values generated school: string variable ignored
Notice that although we specified all variables, the string variables were ignored. Unfortunately, mvdecode
will not work at all on string variables.
Instead, we use replace
to convert “-99” to “” for string variables familyhx and smokinghx:
replace familyhx = "" if familyhx == "-99" (16 real changes made) replace smokinghx = "" if smokinghx == "-99" (16 real changes made)
Remember to add the option missing
to tabulate
to report missing values in the table. Let’s check that familyhx and smokinghx missing values have been correctly converted:
tab familyhx, miss FamilyHx | Freq. Percent Cum. ------------+----------------------------------- | 16 7.14 7.14 no | 171 76.34 83.48 yes | 37 16.52 100.00 ------------+----------------------------------- Total | 224 100.00 tab smokinghx, miss SmokingHx | Freq. Percent Cum. ------------+----------------------------------- | 16 7.14 7.14 current | 41 18.30 25.45 former | 42 18.75 44.20 never | 125 55.80 100.00 ------------+----------------------------------- Total | 224 100.00
The missing values look good.
By default mvdecode
will convert the missing data codes to .
. We can also specify other missing values, such as .a
, to distinguish among different kinds of missing (e.g. was not assessed, refused to answer, data error, etc.).
Here we convert the code -98 to missing values .a
, which we will label to mean “refused to answer” later in the seminar.
mvdecode lungcapacity co2, mv(-98=.a) lungcapacity: 10 missing values generated co2: 10 missing values generated
Finally, we noticed a likely data error for age, 357.89. Imagine we know that all patients in this dataset are adults, such that any ages below 18 or above a realistic upper age, say 120, should be declared as errors.
Let’s replace all ages outside of the range [18,120] with missing value .b
, which we will label as “data error” later in the seminar.
summ age Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 224 52.95722 21.45249 34.19229 357.89 replace age = .b if age > 120 | age < 18 (1 real change made, 1 to missing)
Report counts of missing values with misstable summarize
Now that all of our missing data codes have been converted to missing values, we can count the number of missing (and non-missing) values with misstable summarize
. Let’s summarize lungcapacity, test1, and test2:
misstable summarize lungcapacity test1 test2 Obs<. +------------------------------ | | Unique Variable | Obs=. Obs>. Obs<. | values Min Max -------------+--------------------------------+------------------------------ lungcapacity | 33 10 181 | 181 .1677225 .9982018 test1 | 17 207 | 207 .1048958 23.72776 test2 | 17 207 | 207 .1860638 19.94562 -----------------------------------------------------------------------------
- column
Obs=.
counts the number of missing values equal to.
- column
Obs>.
counts the number of missing values other than.
, such as.a
and.b
- column
Obs<.
and the entire right-hand sectionObs<.
address non-missing values
Profile how variables are missing together (missing data patterns) with misstable patterns
Often, examining how variables are missing together can help the researcher understand the reasons for missing. For example, variables assessed at the same time are likely to be missing together.
With no variables specified, misstable patterns
will report patterns across all variables with missing values:
misstable patterns Missing-value patterns (1 means complete) | Pattern Percent | 1 2 3 4 5 6 7 8 9 ------------+-------------------------------- 37% | 1 1 1 1 1 1 1 1 1 | 31 | 1 1 1 1 1 1 1 1 0 9 | 1 1 1 1 1 1 1 0 1 5 | 1 1 1 1 1 1 1 0 0 3 | 1 1 1 1 0 1 1 0 0 2 | 1 1 1 1 1 0 1 1 0 2 | 1 1 1 1 1 0 0 1 0 2 | 1 1 1 1 1 0 1 1 1 2 | 1 1 1 1 1 1 0 1 0 2 | 1 1 1 1 1 1 0 1 1 1 | 1 1 1 1 0 1 1 0 1 1 | 1 1 1 1 1 0 0 1 1 . . . [some output omitted]
- 9 variables with missing listed in the columns under “Pattern”
- variables are identified at the bottom as 1=age, 2=co2, etc (not shown)
- each row of the table represents a distinct pattern of missingness across those 9 variables, with a 0 for missing and 1 for present
- The first row (all 1 values) represents complete data on all 9 variables — 37% of observations
- The second row represents missing on just variable 9 (nmorphine) — 31% of observations
Creating and transforming variables
The original dataset variables may not necessarily be the variable we need for analysis. Instead, we often need variables that are transformations or combinations of other variables.
Use generate
and replace
for simple arithmetic or logical operations
The basic variable generation commands generate
(abbreviated gen
or even g
) and replace
can be used to create variables that are formed by performing arithmetic or logical operations on existing variables.
Below we create a variable that is the average of test1 and test2. When performing arithmetic operations with generate
, if any input variable is missing, the resulting value will be missing as well. We use misstable patterns
to check missing values on all three variables.
gen average = (test1 + test2)/2 (26 missing values generated) misstable patterns average test1 test2 Missing-value patterns (1 means complete) | Pattern Percent | 1 2 3 ------------+------------- 88% | 1 1 1 | 4 | 0 1 0 4 | 1 0 0 4 | 0 0 0 ------------+------------- 100% | Variables are (1) test1 (2) test2 (3) average
We see above that average (variable 3) is always missing if either or both of test1 and test2 are missing as well.
Be careful with logical operations when working with missing values
Remember that in Stata, missing values are very large numbers, with all non-missing numbers < .
< .a
< .b
< … < .z
. Thus, the comparison 100 < .
results in TRUE, even if the desired result is a missing value. Care must be taken when using logical operations on variables with missing.
For example, the following set of commands incorrectly create a dummy indicator variable coding for age over 50.
gen above50_wrong = 0 replace above50_wrong = 1 if age > 50 (142 real changes made) misstable summarize age above50_wrong Obs<. +------------------------------ | | Unique Variable | Obs=. Obs>. Obs<. | values Min Max -------------+--------------------------------+------------------------------ age | 1 223 | 223 34.19229 70.26617 -----------------------------------------------------------------------------
The variable above50_wrong does not even appear in the misstable summarize
output, which means it has no missing, even though we see that age has one missing value (a .b
value coding for a data error). For the value where age=.b, the comparison .b > 50 results in TRUE and a value of 1 on the indicator.
Here we show how to create the dummy indicator for age over 50 to account for missing on the original age variable correctly. The 0/1 indicator variable is created in one generate
step, and then the indicator above50 is replaced with the missing value itself if a missing value is detected:
gen above50 = age > 50 replace above50 = age if age >= . (1 real change made, 1 to missing) misstable summarize age above50 Obs<. +------------------------------ | | Unique Variable | Obs=. Obs>. Obs<. | values Min Max -------------+--------------------------------+------------------------------ age | 1 223 | 223 34.19229 70.26617 above50 | 1 223 | 2 0 1 -----------------------------------------------------------------------------
Functions to transform variables
Functions accept an input and return some sort of output, so naturally can be used to transform variables with generate
and replace
. Consult help functions
for links to several help pages for functions split by category. Categories include:
- datetime: functions to convert string representations of dates and times to numeric representations, and vice-versa
- mathematical: e.g. log, abs, round, floor, sqrt, max
- random number: generate random numbers according to various distributions specified by various parameters
- statistical: density (pdf), distribution (cdf), and quantile (inverse cdf) functions
- string: manipulation of character string variables (see section “Managing string variables” below)
Almost all of the functions that work with generate
accept only one variable (or none) as an argument.
Below we use functions with generate
to create a variable representing the running sum of married and to create a random number variable based on the standard uniform distribution, which could later be used a random selection variable:
gen marsum=sum(married) gen random=runiform() list married marsum random in 1/10 +-----------------------------+ | married marsum random | |-----------------------------| 1. | 0 0 .7118431 | 2. | 0 0 .1796715 | 3. | 1 1 .4506519 | 4. | 0 1 .1946068 | 5. | 1 2 .7135741 | |-----------------------------| 6. | 0 2 .2453114 | 7. | 1 3 .7672456 | 8. | 1 4 .3653557 | 9. | 0 4 .2706914 | 10. | 0 4 .9911318 | +-----------------------------+
The egen
command extends variable generation with even more functions
The egen
command, short for “extended generate”, creates variables with its own, exclusive (cannot be used outside of egen
) set of functions, which include:
- functions that accept multiple variables as arguments: e.g. means across several variables
- functions that accept a single variable, but have complex options: e.g. cutting a continuous variable into several categories
- statistical functions that work with
by
: e.g. standard deviations by group (see “Processing data by group”)
We will explore several commonly used egen
functions.
First we create variables representing the mean and total of test1 and test2, using the egen
-specific functions, rowtotal
and rowmean
. Unlike sum and mean variables created with generate
, these variables will use any available data, such that missing values are returned for the mean only if all input variables are missing, and a zero is returned for totals if all inputs are missing.
egen mean = rowmean(test1 test2) (8 missing values generated) egen total = rowtotal(test1 test2)
Notice that no missing values at all were generated for total.
A look at misstable patterns
shows that “mean”, created by egen
is missing only if both test1 and test2 are missing (whereas “average” created by generate
is missing if either or both are missing). Also, there is no missing at all for total because it does not appear in the table (zero is returned if all inputs missing):
misstable patterns test1 test2 average mean total Missing-value patterns (1 means complete) | Pattern Percent | 1 2 3 4 ------------+------------- 88% | 1 1 1 1 | 4 | 1 0 1 0 4 | 1 1 0 0 4 | 0 0 0 0 ------------+------------- 100% | Variables are (1) mean (2) test1 (3) test2 (4) average
Another couple of egen
functions, rowmiss
and rownonmiss
, count the number of missing and non-missing values across a specified set of variables, respectively. Here we create such a variable that counts the number of missing values for each observation across a set of 5 variables.
egen nummiss = rowmiss(lungcapacity test1 test2 familyhx smokinghx) . li lungcapacity test1 test2 familyhx smokinghx nummiss in 1/10 +----------------------------------------------------------------+ | lungca~y test1 test2 familyhx smokin~x nummiss | |----------------------------------------------------------------| 1. | .886491 3.890698 1.349324 no never 0 | 2. | .326444 2.627481 .8034876 no former 0 | 3. | . 1.418219 2.194694 no never 1 | 4. | .8010882 3.698981 8.086417 no former 0 | 5. | .9088714 8.030203 7.226128 yes former 0 | |----------------------------------------------------------------| 6. | .8484109 . 2.125863 no former 1 | 7. | .8908539 2.29322 8.607957 no current 0 | 8. | .8193308 5.846408 8.962217 no never 0 | 9. | .8070509 4.806459 8.02538 no never 0 | 10. | .8111662 1.913784 3.507287 2 | +----------------------------------------------------------------+
Now we divide a continuous variable into categories with egen
and cut
. We can either specify cutpoints that define the categories with the at
option, or the desired number of equally-sized groups with group
option.
Below we cut the continuous bmi variable into 8 suggested intervals, where the numbers specified in the at
list corresponds to the left-hand limits of each interval, except for the last number which should be a maximum. The label
option applies value labels to the categories, which we will see with tabulate
.
egen bmi_cat = cut(bmi), at(0, 15, 16, 18.5, 25, 30, 35, 40, 500) label tab bmi_cat, miss bmi_cat | Freq. Percent Cum. ------------+----------------------------------- 16- | 1 0.45 0.45 18.5- | 77 34.38 34.82 25- | 72 32.14 66.96 30- | 38 16.96 83.93 35- | 26 11.61 95.54 40- | 10 4.46 100.00 ------------+----------------------------------- Total | 224 100.00
We see 6 of 8 of the intervals represented, implying no patients were in the 2 lowest categories.
Note that the first group in the categorical variable is represented by 0, the second 1, the third 2, etc. We can see this with tab
with the value labels removed with the nolabel
option.
tab bmi_cat, nolabel bmi_cat | Freq. Percent Cum. ------------+----------------------------------- 2 | 1 0.45 0.45 3 | 77 34.38 34.82 4 | 72 32.14 66.96 5 | 38 16.96 83.93 6 | 26 11.61 95.54 7 | 10 4.46 100.00 ------------+----------------------------------- Total | 224 100.00
Another common task for egen
is to create a variable that crosses the groups of multiple input variables. For example, we can cross a 2-group sex variable with a 2-group remission variable to create a 4-group variable (male remssion, male no remission, female remission, female no remission).
To demonstrate, we will use the egen
function group
to create a variable that crosses the groups in familyhx and smokinhx. We apply value labels for the groups with the label
option.
egen family_smoking = group(familyhx smokinghx), label (16 missing values generated) tab family_smoking, miss group(famil | yhx | smokinghx) | Freq. Percent Cum. ------------+----------------------------------- no current | 30 13.39 13.39 no former | 30 13.39 26.79 no never | 111 49.55 76.34 yes current | 11 4.91 81.25 yes former | 12 5.36 86.61 yes never | 14 6.25 92.86 . | 16 7.14 100.00 ------------+----------------------------------- Total | 224 100.00
The first word in each label is the familyhx group, and the second word is the smokinghx group.
Change variable coding with recode
Variables are often not coded the way we want, often with too many categories or with values out of order. With recode
, we can handle all recodes for a variable in a single step.
Here we recode
the original 8 bmi categories of bmi_cat into 3 categories, by collapsing categories (0,1,2) into category 3 and categories (6,7) into 5. We tab
before and after to show the changes:
tab bmi_cat, nolabel bmi_cat | Freq. Percent Cum. ------------+----------------------------------- 2 | 1 0.45 0.45 3 | 77 34.38 34.82 4 | 72 32.14 66.96 5 | 38 16.96 83.93 6 | 26 11.61 95.54 7 | 10 4.46 100.00 ------------+----------------------------------- Total | 224 100.00 recode bmi_cat (0 1 2=3) (6 7 = 5) (bmi_cat: 37 changes made) tab bmi_cat, nolabel bmi_cat | Freq. Percent Cum. ------------+----------------------------------- 3 | 78 34.82 34.82 4 | 72 32.14 66.96 5 | 74 33.04 100.00 ------------+----------------------------------- Total | 224 100.00
Renaming variables and groups of variables with rename
Change variables names with rename
, first specifying the old variable name and then the new name. Let’s rename test1 and test2 to what they are measuring, the biomarkers il6 and crp:
rename test1 il6 rename test2 crp
For situations where many variables need to be renamed in systematic ways (e.g applying a prefix to several variable name), Stata allows the use of many kinds of “wildcards”, symbols that can represent many characters, with rename
. Consult help rename group
for a detailed guide of how to use these wildcards.
In the code below, we rename variables experience and school to mark them as doctor variables, by applying the prefix “d_” with the =
operator. This will result in variable names d_experience and d_school.
Then we convert the “d_” prefix to “doc_” for any variable name with the “d_” prefix, resulting in variables doc_experience and doc_school. The *
wildcard stands for one or more of any character (besides space characters). We examine renamed variables with describe
after each change.
rename (experience school) d_= desc d_* storage display value variable name type format label variable label ------------------------------------------------------------------------------------- d_experience byte %8.0g Experience d_school str7 %9s School rename d_* doc_* desc doc_* storage display value variable name type format label variable label ------------------------------------------------------------------------------------- doc_experience byte %8.0g Experience doc_school str7 %9s School
Managing string variables
Some variables cannot be stored efficiently numerically, so must be represented as string (character) variables. Stata has many functions to manipulate string variables. Keep in mind that, generally, string variables cannot be used in estimation commands as outcomes or predictors.
Values for string variables are always enclosed in quotes, ""
A common error when working with string variables is omitting the quotation marks when assigning a value or checking equality to a value. Remember that this applies to missing values as well, which are represented by empty quotes, ""
.
Below, we use tab
to check for possibly erroneous values in the variable sex. We do find an erroneous value, “12.2”, so then use replace
to replace this value with a missing value. We then tab
sex again to check the results:
tab sex, miss Sex | Freq. Percent Cum. ------------+----------------------------------- 12.2 | 1 0.45 0.45 female | 133 59.38 59.82 male | 90 40.18 100.00 ------------+----------------------------------- Total | 224 100.00 replace sex = "" if sex == "12.2" (1 real change made) tab sex, miss Sex | Freq. Percent Cum. ------------+----------------------------------- | 1 0.45 0.45 female | 133 59.38 59.82 male | 90 40.18 100.00 ------------+----------------------------------- Total | 224 100.00
String functions
Stata has many functions for transforming strings. Consult help string functions
to see a full list of functions with syntax guides. Here is a list of some of the useful functions, the first two of which we will demonstrate:
- strtrim: removing leading and trailing spaces
- substr: extract substring
- strlen: length of strings
- strpos: the position in a string where a substring is found
- strofreal: convert number to string with a specified format
Values on string variables are often padded with unnecessary blank spaces often at the beginning of the string (leading) or at the end (trailing). The function strtrim
removes both leading and trailing spaces. Let’s trim the variable hospital, which suffers from both leading and trailing spaces:
tab hospital hospital | Freq. Percent Cum. ---------------+----------------------------------- Cedars-Sinai | 1 0.45 0.45 UCLA | 2 0.89 1.34 UCSF | 2 0.89 2.23 Cedars-Sinai | 103 45.98 48.21 Cedars-Sinai | 3 1.34 49.55 UCLA | 57 25.45 75.00 UCLA | 1 0.45 75.45 UCSF | 54 24.11 99.55 UCSF | 1 0.45 100.00 ---------------+----------------------------------- Total | 224 100.00 replace hospital = strtrim(hospital) (10 real changes made) tab hospital hospital | Freq. Percent Cum. ---------------+----------------------------------- Cedars-Sinai | 107 47.77 47.77 UCLA | 60 26.79 74.55 UCSF | 57 25.45 100.00 ---------------+----------------------------------- Total | 224 100.00
The hospital variable has been cleaned of the unnecessary spaces.
Another common transformation for string variables is to extract a portion of the string, or a “substring”. For example, we might extract the area code from a phone number, or the two-letter state code from an address.
In the current dataset, the doctor id variable, docid, is composed of the hospital code (1,2,3), then a hyphen, and then the individual doctor’s code (1- to 3-digit code).
We use the Stata function substr
in a gen
command to extract the individual doctor codes. In the substr
call, we specify that we would like to extract a substring from the variable docid starting at position 3 and having a length up to 3:
tab docid docid | Freq. Percent Cum. ------------+----------------------------------- 1-1 | 7 3.13 3.13 1-100 | 9 4.02 7.14 1-11 | 7 3.13 10.27 1-21 | 1 0.45 10.71 1-22 | 3 1.34 12.05 . . . [some output omitted] gen doc_id = substr(docid, 3, 3) tab doc_id doc_id | Freq. Percent Cum. ------------+----------------------------------- 1 | 7 3.13 3.13 100 | 9 4.02 7.14 11 | 7 3.13 10.27 113 | 3 1.34 11.61 121 | 6 2.68 14.29 . . . [some output omitted]
Concatenate strings with +
Strings can be joined or concatenated together in a generate
statement with a +
. Here, we recreate the old docid variable, formed by concatenating hospital id, then “-“, then doctor individual’s id. Because hospid is numeric, we must use the string
function to convert it to a string before concatenating:
gen newdocid = string(hospid) + "-" + doc_id li newdocid hospid doc_id in 1/10 +----------------------------+ | newdocid hospid doc_id | |----------------------------| 1. | 1-1 1 1 | 2. | 1-1 1 1 | 3. | 1-1 1 1 | 4. | 1-1 1 1 | 5. | 1-1 1 1 | |----------------------------| 6. | 1-1 1 1 | 7. | 1-1 1 1 | 8. | 1-100 1 100 | 9. | 1-100 1 100 | 10. | 1-100 1 100 | +----------------------------+
Regular expressions
Stata provides supports regular expression matching with regexm
and subexpression extraction (through the use of capture groups), regexs
. We will not go into the details of regular expression syntax in this seminar.
Here, we show another method to extract the individual doctor’s id from the original docid variable uisng regular expressions.
In the gen
statement, regexm
is checking for matches to a number “[0-9]”, followed by a “-“, followed by 1 or more numbers “([0-9]+)”. The parentheses in “([0-9]+)” serve to delineate capture group 1, which “captures” the one or more numbers following “-“. We then set the variable regexdocid equal to the contents of capture group 1, which are accessible with regexs(1)
.
gen regxdocid = regexs(1) if regexm(docid, "[0-9]-([0-9]+)") list docid regxdocid in 1/10 +------------------+ | docid regxdo~d | |------------------| 1. | 1-1 1 | 2. | 1-1 1 | 3. | 1-1 1 | 4. | 1-1 1 | 5. | 1-1 1 | |------------------| 6. | 1-1 1 | 7. | 1-1 1 | 8. | 1-100 100 | 9. | 1-100 100 | 10. | 1-100 100 | +------------------+
Encoding strings into numeric variables
Categorical variables are often represented by string variables, with the names of the categories as the values. However, in Stata, variables used as either independent or dependent variables typically cannot be strings and should be made numeric.
The encode
command converts string variables to numeric, by assigning a numeric value to each distinct string value (category) and then applying value labels of the original string values. The numeric version of the variable can either be a new variable, using option gen()
, or a replacement of the string variable, using option replace
. The ordering of the categories is alphabetical.
Here we use encode
with the gen()
option to create a numeric version of the the string variable cancerstage. The new, numeric variable will be called stage. A two-way tab of the two variables shows the correspondence.
encode cancerstage, gen(stage) tab cancerstage stage CancerStag | CancerStage e | I II III IV | Total -----------+--------------------------------------------+---------- I | 60 0 0 0 | 60 II | 0 99 0 0 | 99 III | 0 0 38 0 | 38 IV | 0 0 0 27 | 27 -----------+--------------------------------------------+---------- Total | 60 99 38 27 | 224
The table above looks a bit odd, since it seems that we have tabulated the same variable, CancerStage, twice. However, the command executed without error — so what happened? (The true cancerstage variable appears along the rows.)
When we view a description of the new numeric variable stage, we see that Stata has applied a value label called stage, which makes the values along the columns appear as “I”, “II”, “III”, and “IV”, while the variable label CancerStage makes the variable name appear as “CancerStage” atop the columns.
desc stage storage display value variable name type format label variable label ------------------------------------------------------------------------------------- stage long %8.0g stage CancerStage
The same twoway tab
with the nolabel
option reveals the numbering behind the value labels:
tab cancerstage stage, nolab CancerStag | CancerStage e | 1 2 3 4 | Total -----------+--------------------------------------------+---------- I | 60 0 0 0 | 60 II | 0 99 0 0 | 99 III | 0 0 38 0 | 38 IV | 0 0 0 27 | 27 -----------+--------------------------------------------+---------- Total | 60 99 38 27 | 224
Note: When viewing variables in the data browser (command browse
), numeric variables without value labels appear black, numeric variables with value labels appear blue, and string variables appear red.
Convert number variables stored as strings to numeric with destring
Sometimes, variables with number values are loaded as strings into Stata. This can happen can when one of the numbers was mistakenly entered as a letter, or a non-numeric code is used for missing.
We do not want to use encode
here, because the variable values are truly numbers rather than categories — we would not want the string “1.25” converted to a category.
Instead, we can use destring
, which directly translates numbers stored as strings into the numbers themselves.
The wbc (white blood cell) variable should be a numeric variable, but contains the values “not assessed”, which caused wbc to be read in as string into Stata. We can see the “not assessed” values at the bottom of the tab
output for wbc.
tab wbc [some output omitted] . . . 7999.091309 | 1 0.45 96.87 8340.71582 | 1 0.45 97.32 8415.605469 | 1 0.45 97.77 8567.246094 | 1 0.45 98.21 8697.995117 | 1 0.45 98.66 not assessed | 3 1.34 100.00 -------------+----------------------------------- Total | 224 100.00
Stata would not know how to convert the value “not assessed” to a number, so we first convert those values to string missing, “”. Then we can use destring
with replace
to replace the string version of wbc with a numeric version. Running describe
on wbc shows that it is now stored as double, a numeric type.
replace wbc = "" if wbc == "not assessed" (3 real changes made) destring wbc, replace wbc: all characters numeric; replaced as double (3 missing values generated) describe wbc storage display value variable name type format label variable label ------------------------------------------------------------------------------------- wbc double %10.0g WBC
Labels
Labels are used to provide additional information about variables, either the meaning of the variable itself (variable label) or the meaning of the values of the variable (value labels). Stata provides a suite of commands to handle labels — consult help label
to see a listing and descriptions.
Variable labels
Variable labels expand on the meaning of a variable beyond its name. These labels are sometimes used in Stata output and graphs in place of variable names so that anyone can understand the meaning of variables used.
We use label variable
to label variables.
Here we label the variable “il6” with a helpful description, and observe how the label is used in a histogram
.
label var il6 "Concentration of interleukin 6" hist il6
Value labels
Value labels give text descriptions to the numerical values of a variable. We have already seen how the encode
command automatically produces and applies value labels to the numeric variable converted from a string variable. Several Stata commands are used to process value labels.
We can inspect existing value labels, with label list
, which lists the names and contents of existing value labels. Here we see we currently have 3 sets of value labels, named stage, family_smoking, and bmi_cat, as well as their associated text labels and corresponding number values.
label list stage: 1 I 2 II 3 III 4 IV family_smoking: 1 no current 2 no former 3 no never 4 yes current 5 yes former 6 yes never bmi_cat: 0 0- 1 15- 2 16- 3 18.5- 4 25- 5 30- 6 35- 7 40- 8 500-
Now we would like to apply value labels to our special missing values .a, which was supposed to code for “refused to answer” and .b, which was supposed to code for “data error”.
To create a new set of value labels, specify label define
, the name of the set of value labels, then each number code followed by its corresponding text label in quotes. Then, to apply the labels to variables, use label values
.
Here we label define
a new value label set, “other_miss”, to label .a and .b. Then we use label values
to apply the label set “other_miss” to the variables lungcapacity, co2, and age. A tab
of lungcapacity with the miss
option shows the newly labeled .a values.
label define other_miss .a "refused" .b "error" label values lungcapacity co2 age other_miss tab lungcapacity, miss [some output omitted] . . . .9924814 | 1 0.45 79.91 .9940955 | 1 0.45 80.36 .9982018 | 1 0.45 80.80 . | 33 14.73 95.54 refused | 10 4.46 100.00 ------------+----------------------------------- Total | 224 100.00
Processing data by group
Many datasets consist of grouped observations, whether because observations were sampled in clusters (e.g. students sampled from schools) or subjects were repeatedly measured.
With grouped data, we often want to generate variables and process data by groups.
For processing by groups, we will use the Stata prefix by
, which precedes other Stata commands so that they will be executed by groups specified by the variable following the prefix by
.
In general the syntax will be: by groupvar: stata_cmd
, where groupvar is the grouping variable by which the data are to be processed, and stata_cmd is the Stata command to run on each group of data.
Data must be sorted before processing by group
Stata requires that data be sorted by the grouping variable before processing by
that variable.
This can be accomplished several ways:
- use the
sort
command on the grouping variable before running any commands withby
- use the
sort
option forby
, like so:by groupvar, sort: stata_cmd
- use the prefix
bysort
, which is the actually just theby
command withsort
option activated:bysort groupvar: stata_cmd
Multiple variables can follow the by
prefix, and data will be processed by groups formed by crossing all variables specified. If the sort
option is also specified (or bysort
is used), then Stata will also sort by all of those variables, in the order they are listed.
When sorting by several variables, the first option, using the sort
command independently before using by
, is perhaps the “safe” option. This is because in some cases we want to sort by more variables than we want to use to group the data. For example, if we have repeated measures data, we might want to sort by subject and then date, but then only process by subject. If we were to specify: bysort subject date: stata_cmd
, Stata will sort by subject and then date, but will also process by subject and date, when we only want it to process by subject.
Below, we first sort by docid, the doctor identification variable, which will then serve as the grouping variable for by-processing.
sort docid
Generating statistics by group
When we introduced the egen
command earlier above, we mentioned that some of the egen
functions can be used with by
to generate statistics by group. These functions include mean
, max
, and std
, which generate the mean, max, and standard deviation of variables, respectively, for each group specified by the grouping variable following by
.
Here we create variables representing the mean, max, and standard deviation of the variable age for groups defined by doctor (docid). We specify the prefix by
, the grouping variable docid, and then egen
and the statistical function. A look at the first 10 observations shows that these statistics vary with docid:
by docid: egen mean_age = mean(age) by docid: egen max_age = max(age) by docid: egen sd_age = sd(age) li docid age mean_age max_age sd_age in 1/10 +---------------------------------------------------+ | docid age mean_age max_age sd_age | |---------------------------------------------------| 1. | 1-1 46.80042 52.45741 64.96824 7.297493 | 2. | 1-1 53.91714 52.45741 64.96824 7.297493 | 3. | 1-1 51.92936 52.45741 64.96824 7.297493 | 4. | 1-1 64.96824 52.45741 64.96824 7.297493 | 5. | 1-1 54.38936 52.45741 64.96824 7.297493 | |---------------------------------------------------| 6. | 1-1 41.36804 52.45741 64.96824 7.297493 | 7. | 1-1 53.82926 52.45741 64.96824 7.297493 | 8. | 1-100 58.17057 52.23144 58.17057 5.879859 | 9. | 1-100 49.87008 52.23144 58.17057 5.879859 | 10. | 1-100 39.61641 52.23144 58.17057 5.879859 | +---------------------------------------------------+
System variables _n
and _N
When processing data, it can be useful to know the number of the current observation or the number of the last observation. Stata reserves the system variables _n
to represent the number of the current observation, and the system variable _N
to represent the number of the last observation. These system variables can be useful, for example, to check if an observation is first (_n==1
) or last (_n==_N
).
Importantly, these system variables adapt to group processing, such that when used in a statement prefixed with by
, _n
means the current observation in a group, where as _N
means the last observation in a group. This information can then be used to create count and lag variables.
Creating variables that count within group
A common group-level operation is to count how many members of a group belong to a certain category (e.g. females, people with diabetes, first-graders). When the category is represented by a 0/1 indicator variable, counting the number of people with category=1 is equivalent to summing the variable.
Recall that the sum
function, when used with generate
, produces a variable that is the running sum of the variable specified inside sum
. The last value of a running sum is the final sum.
When we precede such a generate
and sum
with a by
-group specification to get running sums by group. We can then replace each value in a group with the last value in the group, which will be the final sum for that group.
Let’s count the number of female patients for each doctor.
First we create a variable female, that is 1 if sex==”female”, 0 if sex==”male”, and missing otherwise. Then we create a variable, num_female, that is the running sum of the number of females per group (defined by docid). Finally, we set num_females to be the last value in each group (num_female[_N]
, observation number can be specified in brackets), the final sum.
gen female = sex == "female" replace female = . if sex == "" (1 real change made, 1 to missing) by docid: gen num_female = sum(female) li docid sex female num_female in 1/10 +------------------------------------+ | docid sex female num_fe~e | |------------------------------------| 1. | 1-1 female 1 1 | 2. | 1-1 male 0 1 | 3. | 1-1 male 0 1 | 4. | 1-1 female 1 2 | 5. | 1-1 male 0 2 | |------------------------------------| 6. | 1-1 male 0 2 | 7. | 1-1 male 0 2 | 8. | 1-100 male 0 0 | 9. | 1-100 female 1 1 | 10. | 1-100 female 1 2 | +------------------------------------+ by docid: replace num_female = num_female[_N] li docid sex female num_female in 1/10 +------------------------------------+ | docid sex female num_fe~e | |------------------------------------| 1. | 1-1 male 0 2 | 2. | 1-1 female 1 2 | 3. | 1-1 male 0 2 | 4. | 1-1 male 0 2 | 5. | 1-1 male 0 2 | |------------------------------------| 6. | 1-1 male 0 2 | 7. | 1-1 female 1 2 | 8. | 1-100 female 1 5 | 9. | 1-100 male 0 5 | 10. | 1-100 female 1 5 | +------------------------------------+
Creating lagged variables
Commonly we want to create lagged versions of existing variables, in which values from earlier points in time are placed in later rows of data. Such variables permit modeling effects of variables from the past.
Although the dataset here is not a repeated measures datasets, where lagged variables make sense, we will create a lagged variable regardless to demonstrate how.
Values on a variable from the previous observation can be specified with variable[_n-1]
.
Here we create a lagged version of the date variable, dis_date within each doctor group. The lag_date variable is assigned to the value of dis_date from the previous observation (within docid group). The format
statement causes the lag_date variable to be displayed as a date rather than a number. We also create a variable, time_lag, that calculates the amount of time between dates.
sort docid dis_date by docid: gen lag_date = dis_date[_n-1] (40 missing values generated) format lag_date %td gen time_lag = dis_date - lag_date (40 missing values generated) li docid dis_date lag_date time_lag in 1/10 +------------------------------------------+ | docid dis_date lag_date time_lag | |------------------------------------------| 1. | 1-1 06mar2009 . . | 2. | 1-1 01jul2009 06mar2009 117 | 3. | 1-1 06sep2009 01jul2009 67 | 4. | 1-1 15apr2010 06sep2009 221 | 5. | 1-1 25jun2010 15apr2010 71 | |------------------------------------------| 6. | 1-1 04sep2010 25jun2010 71 | 7. | 1-1 07jan2011 04sep2010 125 | 8. | 1-100 28oct2009 . . | 9. | 1-100 12feb2010 28oct2009 107 | 10. | 1-100 17mar2010 12feb2010 33 | +------------------------------------------+
Notice that the first value for lag_date (and thus time_lag) is missing in each group, which shows that Stata is indeed lagging within docid.
Macros are temporary variables
Macros allow the user to store string information in temporary variables separate from the dataset. Macros are cleared from memory once the Stata session ends. Stata provides 2 types of macros, local
and global
, which differ in their scope (where they are available, see below).
To use macros, we assign a string to the macro name. Then, wherever the macro name is used elsewhere (in the do-file or program) with special substitution operators, Stata will directly substitute the string stored assigned to the macro.
See help macro
for extensive documentation on macros.
Global macros
Like their name suggests, global macros can be used anywhere. To substitute the contents of the global macro, precede the macro name with $
.
Here we store the string Hello world! in the global macro greeting. We then use display
to print the string to screen.
In place of $greeting
below, Stata substitutes Hello world! (without quotes) — resulting in display "Hello world!"
.
global greeting Hello world! display "$greeting" Hello world!
Omitting the quotes in the display
statement results in the substituted expression display Hello world!
. Stata then interprets Hello as a variable name, which does not exist and causes an error.
display $greeting Hello not found r(111);
If we could only store literal strings like “Hello World!” in macros, they would not be very useful. However, the error message from the last example above reveals that macros can be used to store variable names.
One common usage of macros is to group variables together that are alike in some way. Here we group together a set of demographic variables whose names we store in a global macro. We then access the contents of the macro for the summary
command.
global demographics age married female summ $demographics Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- age | 223 51.58981 6.446488 34.19229 70.26617 married | 224 .6071429 .4894793 0 1 female | 223 .5964126 .4917203 0 1
Local macros
Local macros have much more limited scope than globals. If declared inside a do-file or macro, the local macro is deleted after the do-file code has finished executed. For example, if declared in a do-file, the contents of the local macro cannot be accessed through typing commands in the Command window after the code has been executed.
The contents of local macros are accessed with the two single quotes `
and '
. Here we group together a set of outcome variables and store them in a local
macro, which we then use for summarize
again.
local outcomes tumorsize pain lungcapacity summ `outcomes' Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- tumorsize | 224 69.78098 11.76117 38.67265 109.0096 pain | 224 5.379464 1.568675 1 9 lungcapacity | 181 .7774426 .1585167 .1677225 .9982018
If the macro declaration code (local outcomes tumorsize pain lungcapacity
) was run from a do-file, the macro would be deleted at this point.
Macros and expressions
Macros can be assigned expressions rather than strings with the =
operator. The expression following the =
is first evaluated and the result is then stored in the macro.
Here we show how the expression 2+3
is evaluated and the result 5 is stored in the macro if =
is used. Without the =
, the string “2+3” (without quotes) is stored in the macro.
local add_us = 2 + 3 display "sum = `add_us'" sum = 5 local print_us 2 + 3 display "sum = `print_us'" sum = 2 + 3
Looping
Loops are a programmer’s tool to perform some task repeatedly over a set of items.
The “for” loop is found in most programming languages, and is distinguished from other types of loops in that the programmer explicitly defines the items over which the for loop iterates (other types of loops run until some condition is met).
For loops come in 2 flavors in Stata: the forvalues
loop, which iterates over set of numbers, and the foreach
loop, which iterates over a general set of items (e.g. variable names).
Use forvalues
to loop over numbers
The general syntax of a forvalues
loop is this:
forvalues lname = range { commands }
- lname: the name of the loop control variable, a local macro, that takes on the values in range
- range: the range of values to loop over (e.g.
1/5
for 1 through 5,2(2)10
for 2 through 10 in increments of 2 - commands: Stata commands to be run each time the loop iterates
- the opening
{
must be on the first line - the closing
}
must be by itself on the last line
Here is a simple example that displays “Hello world!” 5 times. The loop control variable i, takes on the integer values 1 through 5, and for each iteration displays the message:
forvalues i = 1/5 { display "Hello world!" } Hello world! Hello world! Hello world! Hello world! Hello world!
As we mentioned in the syntax description above, the loop control variable is a local macro, so its contents can be accessed with the usual local substitution operators `
and '
. Here we display the contents of the loop control variable i for each iteration.
forvalues i = 1/5 { display "i = `i'" } i = 1 i = 2 i = 3 i = 4 i = 5
One common usage of forvalues
loops is to loop over variables that have systematically numbered names (x1, x2, x3, etc.).
Below, we create indicator (dummy) variables that code whether observations are over the ages 50, 60, and 70.
The code gen age`i'
creates variables named age50, age60, and age70, as the values 50, 60, and 70 are substituted for `i'
. We remember that the age variable has a data error that has been set to missing, so we directly copy any missing values to the age indicators if the age variables is missing (checked with the missing
function). We then apply the other_miss value label we created earlier to label the .b values “error”. Finally, we tabulate
each newly created variable:
forvalues i = 50(10)70 { gen age`i' = age > `i' replace age`i' = age if missing(age) label values age`i' other_miss tab age`i', miss } (1 real change made, 1 to missing) age50 | Freq. Percent Cum. ------------+----------------------------------- 0 | 82 36.61 36.61 1 | 141 62.95 99.55 error | 1 0.45 100.00 ------------+----------------------------------- Total | 224 100.00 (1 real change made, 1 to missing) age60 | Freq. Percent Cum. ------------+----------------------------------- 0 | 205 91.52 91.52 1 | 18 8.04 99.55 error | 1 0.45 100.00 ------------+----------------------------------- Total | 224 100.00 (1 real change made, 1 to missing) age70 | Freq. Percent Cum. ------------+----------------------------------- 0 | 222 99.11 99.11 1 | 1 0.45 99.55 error | 1 0.45 100.00 ------------+----------------------------------- Total | 224 100.00
Use foreach
to loop over lists of items
The foreach
loop expands on the types of lists over which the loop can iterate, including lists of strings, variable names, items in a global macro, etc. The syntax is similar, but slightly different from forvalues
:
foreach lname (in|of listtype) list { commands }
- lname: the name of the loop control variable, a local macro, that takes on the values in list
in
: the keyword in is used if a generic list of items is specifiedof listtype
: if using one of Stata’s specific types of lists, use the keyword of and one of the following listtypes (more are available):varlist
: a list of variable nameslocal
: the contents of a local macroglobal
: the contents of a globalmacro
- list: the list of values to loop over (e.g.
1/5
for 1 through 5,2(2)10
for 2 through 10 in increments of 2 - commands: Stata commands to be run each time the loop iterates
- the opening
{
must be on the first line - the closing
}
must be by itself on the last line
First, we will use foreach
to loop over variables.
We will use the keyword of
and the listtype varlist
to let Stata know that the text that follows of
is a list of variables. Here we create standardized versions of several variables, each named “std_” concatenated with the original variable name. We then summarize
all of the newly created standardized variables (notice the use of the wildcard symbol *
to access all variables starting with “std_”):
foreach var of varlist wbc rbc il6 crp { egen std_`var' = std(`var') } summ std_* Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- std_wbc | 221 -1.39e-09 1 -2.515318 2.575009 std_rbc | 224 -3.00e-09 1 -3.065658 2.504939 std_il6 | 207 2.25e-09 1 -1.307275 6.338573 std_crp | 207 -1.26e-09 1 -1.433928 4.325634
Next we show that we can loop over the contents of a macro.
Previously we grouped together several variables in a global named demographics. Here we get means by Cancerstage for each of the variables stored in global demographics:
foreach var of global demographics { tabstat `var', by(stage) } Summary for variables: age by categories of: stage (CancerStage) stage | mean -------+---------- I | 47.73823 II | 51.41961 III | 53.67933 IV | 57.90951 -------+---------- Total | 51.58981 ------------------ Summary for variables: married by categories of: stage (CancerStage) stage | mean -------+---------- I | .6666667 II | .6262626 III | .6315789 IV | .3703704 -------+---------- Total | .6071429 ------------------ Summary for variables: female by categories of: stage (CancerStage) stage | mean -------+---------- I | .6 II | .5714286 III | .5526316 IV | .7407407 -------+---------- Total | .5964126 ------------------
Much of the power of foreach
lies in its ability to loop over a generic list of strings.
Imagine we knew that the following doctor id codes, “1-11”, “1-21”, and “1-57” were entered incorrectly, such that a final zero at the end was truncated. We want these ids to be “1-110”, “1-210”, and “1-570”. Here we use a generic list of strings with the keyword in
to correct these doctor id codes in a foreach
loop:
gen fixed_docid = docid foreach id in "1-11" "1-21" "1-57" { replace fixed_docid = docid + "0" if docid == "`id'" list fixed_docid docid if docid == "`id'" } (7 real changes made) +------------------+ | fixed_~d docid | |------------------| 17. | 1-110 1-11 | 18. | 1-110 1-11 | 19. | 1-110 1-11 | 20. | 1-110 1-11 | 21. | 1-110 1-11 | |------------------| 22. | 1-110 1-11 | 23. | 1-110 1-11 | +------------------+ (1 real change made) +------------------+ | fixed_~d docid | |------------------| 24. | 1-210 1-21 | +------------------+ (3 real changes made) +------------------+ | fixed_~d docid | |------------------| 41. | 1-570 1-57 | 42. | 1-570 1-57 | 43. | 1-570 1-57 | +------------------+
The new doctor ids look good!
References and Further Learning
Mitchell, Michael N. 2010. Data Management Using Stata: A Practical Handbook. Stata Press.
Stata YouTube channel – videos for both data management and data analysis made by Stata, and a list of links to their videos on their home site
Data management FAQ on Stata home site
UCLA IDRE Stata pages – our own pages on data management and data analysis