## 1.0 Stata commands in this unit

codebook |
Show codebook information for file |

order |
Order the variables in a data set |

label data |
Apply a label to a data set |

label variable |
Apply a label to a variable |

label define |
Define value labels for a categorical variable |

label values |
Apply value labels to a variable |

encode |
Create numeric version of a string variable |

list |
Lists the observations |

rename |
Rename a variable |

recode |
Recode the values of a variable |

notes |
Apply notes to the data file |

generate |
Creates a new variable |

replace |
Replaces values for an existing variable |

egen |
Extended generate – has special functions that can be used when creating a new variable |

## 2.0 Demonstration and explanation

In this section we will use Stata commands to label and transform variables, and to create new variables that are functions of existing variables. We first load the data and use **codebook** to look at all variables, including labeling information.

use http://www.ats.ucla.edu/stat/data/hs0, clear codebook

## A) Use **order** to control the ordering of variables as columns in the dataset

While there are several possible orderings of variables that
are logical, we will put the id variable first, followed by the demographic
variables describing the students, such as **gender**, **race**, **ses**
and **prgtype** in the first few columns. The last columns will then contain the test scores.

order id gender race ses prgtype, first

## B) Use **label data** to describe the dataset and **label variable** to give variable names more meaning.

To remember the contents of a dataset, we can apply a label to it as well as some notes, using the **note** command.

label data "High School and Beyond" notes id: anonymous id notes

Short variables are desirable to keep coding clean, but may obscure what the variable reprsents. Variable labels allow us to provide a longer description of the variable’s contents.

label variable schtyp "type of school"

## C) Use **label define** to create a set of value labels, and then use **label values** to apply the value labels to a variable

All variables that undergo any kind of numerical calculation must have a numerical represenation in Stata. Categorical variables should thus use numbers to define the categories and value labels to give meaning to the numbers. Below we create a set of value labels called “scl”, which gives meaning to the values 1 and 2. We then apply the “scl” label to the **schtyp** variable.

codebook schtyp /*check for value labels first */ label define scl 1 public 2 private label values schtyp scl codebook schtyp

Labels will typically be used in the output. Labels can be suppressed in many commands using the **nolabel** option.

list schtyp in 1/10 list schtyp in 1/10, nolabel

## D) The **encode** command will convert a string variable to numeric and will label its values automatically

Because the variable **prgtype** is a string variable, it cannot be used in any commands requiring numerical calculations. We use **encode** to create a numeric version of this variable, called **prog**, which will use the string values of **prgtype** as the value labels.

encode prgtype, gen(prog) label variable prog "type of program" codebook prog list prog in 1/10 list prog in 1/10, nolabel

## E) Renaming and recoding variables

The variable **gender** may give us trouble in the future because it is
difficult to know what the 1s and 2s mean. Consider giving dummy (indicator) variables the name signified by the value of 1. Below we use **rename** to rename **gender** to **female**, which is what **female**=1 indicates. We then change the values of the gender variable from 1,2 to 0,1. Dummy variables should always be valued 0,1 rather than 1,2.

rename gender female recode female (1=0)(2=1) label define fm 1 female 0 male label values female fm codebook female list female in 1/10 list female in 1/10, nolabel

## F) Creating variables from other variables, **generate** and **egen**

The **generate** command creates variables that are created from other variables through simple arithmetic or logical operations. Here we create a variable representing the total of the 5 test score variables.

generate total = read + write + math + science summarize total

Note that there are five missing values of **total** because there are five
missing values of **science**.

Let’s now use **recode** to assign letter grades to ranges of the **total** score. For example the code **(0/140=0 F)** tells Stata to recode all values of total between 0 and 140 to 0, and then give the label “F” to the value 0. The recoded variable will be created as a new variable called **grade**.

recode total (0/140=0 F) (141/180=1 D) (181/210=2 C) (211/234=3 B) (235/300=4 A), gen(grade) label variable grade "combined grades of read, write, math, science" codebook grade list read write math science total grade in 1/10 list read write math science total grade in 1/10, nolabel

The Stata command **egen**, which stands for extended generation, is used to create variables that require some additional function in order to be generated. Examples of these function include taking the mean, discretizing a continuous variable, and counting how many from a set of variables have missing values.

In our first example, we will use **egen** to create standard scores for the variable **read**.

egen zread = std(read) summarize zread list read zread in 1/10

Next we will **egen** a variable that contains the mean of **read** for each level of
**ses**.

egen readmean = mean(read), by(ses) list read ses readmean in 1/10

Finally we will compute the average of several variables for each observation. Please note that there
will be a mean for observation 9 even though it has a missing value for **science**.

egen row_mean = rowmean(read write math science) list read write math science row_mean in 1/10

See **help egen** for a full list of the available functions.

Finally, we will save our data and continue on to the next unit.

save hs1

## 3.0 For more information

**Data Management Using Stata: A Practical Handbook**- Chapters 4-5

**Statistics with Stata 12**- Chapter 2

**Gentle Introduction to Stata, Revised Third Edition**- Chapter 3

**Data Analysis Using Stata, Third Edition**- Chapter 5

**An Introduction to Stata for Health Researchers, Third Edition**- Chapters 7-8

**Stata Learning Modules**Labeling data Creating and recoding variables**Stata Frequently Asked Questions**How can I quickly recode continuous variables into groups? How do I standardize variables in Stata?