Stata FAQ How do I specify types of missing values?
When a data file has missing values, sometimes we may want to be able to distinguish between different types of missing values. For example, we can have missing values because of non-response or missing values because of invalid data entry. The examples here are related to this issue.
Example 1: Specifying types of missing values in a data set
In Stata, we can use letters .a-.z and underscore “.” to indicate the type of missing values.
In the example below, variable female has value -999 indicating that the subject refused to answer the question and value -99 indicating a data entry error. It is the same with variable ses. The first code fragment hard codes the changes using the replace command, the second does the operation in an array with the foreach command.
input score female ses 56 1 1 62 1 2 73 0 3 67 -999 1 57 0 1 56 -99 2 57 1 -999 end save test1, replace *using the replace command replace female=.a if female == -999 replace female=.b if female == -99 replace ses=.a if ses == -999 list, clean noobs score female ses 56 1 1 62 1 2 73 0 3 67 .a 1 57 0 1 56 .b 2 57 1 .a *using the foreach command use test1, clear foreach var of varlist female ses { replace `var' = .a if `var' == -999 replace `var' = .b if `var' == -99 } list, clean noobs score female ses 56 1 1 62 1 2 73 0 3 67 .a 1 57 0 1 56 .b 2 57 1 .a
We should notice that when Stata prints a special missing value, it prints the dot and the letter.
Example 2: Specifying types of missing values in variables with mixed numeric and character values
We have a tiny example raw data file called tiny.txt with three variables shown below. The variables are score, female and ses. These three variables are meant to be numeric, except that we have special characters for missing values. For example, in this example, “a” means that the subject refused to give the information and “b” means data entry error.
56 1 1 62 1 2 73 0 3 67 a 1 57 0 1 56 1 2 57 1 b
We want to read the variables as numeric and we also want to keep the information on the nature of missing values. However, if you input variables with mixed numeric and character values as numeric, Stata will delete any observations with string characters. To prevent this from happening, first import the female and ses variables as strings with str1 (a string with 1 character) in front of the variable name under the input command. Then convert “a” values to “-999” and “b” values to “-99” (note that these two numbers are still encoded as a string). Then convert female and ses to numeric variables using the destring function without losing observations. The rest of the procedure is similar to Example 1.
clear all input score str1 female str1 ses 56 1 1 62 1 2 73 0 3 67 a 1 57 0 1 56 1 2 57 1 b end save test2, replace *using the replace command replace female="-999" if female == "a" replace ses="-99" if ses == "b" list, clean noobs *convert variables to numeric destring female ses, replace *using the foreach command foreach var of varlist female ses { replace `var' = .a if `var' == -999 replace `var' = .b if `var' == -99 } list, clean noobs score female ses 56 1 1 62 1 2 73 0 3 67 .a 1 57 0 1 56 1 2 57 1 .b