Sometimes, a data set may have “holes” in it, that is, missing values. Some statistical procedures such as regression analysis will not work as well, or at all, on a data set with missing values. The observations with missing values have to be deleted or the missing values have to be substituted in order for a statistical procedure to produce meaningful results. Most statistical programs (including SAS, SPSS, and Stata) will automatically remove such cases from any analysis you run (without deleting the cases from the dataset). This is why the “n” often varies from analysis to analysis, even if the dataset is the same. Different variables have different amounts of missing data and hence, changing the variables in a model changes the number of cases with complete data on all the variables in the model. Because the software drops cases with missing values for us, it is very easy to “forget” about missing data entirely. However, the presence of missing data can influence our results, especially when a dataset or even a single variable, has a high percentage of values missing. Thus it is always a good idea to check a dataset for missing data, and to think about how the missing data may influence our analyses. This page shows a few methods of looking at missing values in a dataset, this information can be used to make better informed decisions about how to handle the missing values.

Before we begin, we need some data with missing values, the code below inputs a small dataset into Stata, and then displays that data. In a small dataset, like the one below, it is very easy to look at the raw data and see where values are missing. However, when datasets are large, we need a more systematic way to examine our dataset for missing values. Below we show you some ways to do that, using the data below as an example.

clear input landval improval totval salepric saltoapr 30000 64831 94831 118500 1.25 30000 50765 80765 93900 . 46651 18573 65224 . 1.16 45990 91402 . 184000 1.34 42394 . 40575 168000 1.43 . 3351 51102 169000 1.12 63596 2182 65778 . 1.26 56658 53806 10464 255000 1.21 51428 72451 . . 1.18 93200 . 4321 422000 1.04 76125 78172 54297 290000 1.14 . 61934 16294 237000 1.10 65376 34458 . 286500 1.43 42400 . 57446 . . 40800 92606 33406 168000 1.26 end list+---------------------------------------------------+ | landval improval totval salepric saltoapr | |---------------------------------------------------| 1. | 30000 64831 94831 118500 1.25 | 2. | 30000 50765 80765 93900 . | 3. | 46651 18573 65224 . 1.16 | 4. | 45990 91402 . 184000 1.34 | 5. | 42394 . 40575 168000 1.43 | |---------------------------------------------------| 6. | . 3351 51102 169000 1.12 | 7. | 63596 2182 65778 . 1.26 | 8. | 56658 53806 10464 255000 1.21 | 9. | 51428 72451 . . 1.18 | 10. | 93200 . 4321 422000 1.04 | |---------------------------------------------------| 11. | 76125 78172 54297 290000 1.14 | 12. | . 61934 16294 237000 1.1 | 13. | 65376 34458 . 286500 1.43 | 14. | 42400 . 57446 . . | 15. | 40800 92606 33406 168000 1.26 | +---------------------------------------------------+

## 1. Number of missing values vs. number of non missing values

The first thing we are going to do is determine which variables have a lot
of missing values. We have created a small Stata program called **mdesc**
that counts the number of missing values in both numeric and character variables. You can download
**mdesc**
from within Stata by
typing **search mdesc** (see
How can I use the search command to search for programs and get additional
help? for more information about using **search**).

Then you can run
**mdesc** for one or more variables as illustrated below.

mdescVariable Missing Total Missing/Total ------------------------------------------------------------ landval 2 15 .133333 improval 3 15 .2 totval 3 15 .2 salepric 4 15 .266667 saltoapr 2 15 .133333

Now we know the number of missing values in each variable. For instance, variable
**salepric**
has four missing values and **saltoapr** has two missing values.

## 2. Obtaining the number of missing values per observation

We can also look at the distribution of missing values across observations.
The code below creates a variable called **nmis** that gives the number of missing values
for each observation. The
function **rmiss2()** used here is an extension to the egen function **rmiss()**. It counts the
number of missing values in the varlist. **rmiss2()** accepts both string and numeric variables.
(Stata’s **rmiss()** only accepts numeric variables.) You can download **rmiss2()
**over the internet
from within Stata by typing **search rmiss2** (see
How can I use the search command to search for programs and get additional
help? for more information about using **search**).

egen nmis=rmiss2(landval improval totval salepric saltoapr)

Below we tabulate the variable we just created. Looking at the frequency table we know that there are four observations with no missing values, nine observations with one missing values, one observation with two missing values and one observation with three missing values.

tab nmisnmis | Freq. Percent Cum. ------------+----------------------------------- 0 | 4 26.67 26.67 1 | 9 60.00 86.67 2 | 1 6.67 93.33 3 | 1 6.67 100.00 ------------+----------------------------------- Total | 15 100.00

## 3. Patterns of missing values

We can also look at the patterns of missing values. You can download
**mvpatterns **over the internet from within Stata by typing **search
mvpatterns** (see How can I use the search command
to search for programs and get additional help? for more information
about using **search**). The command **mvpatterns** produces output
for all variables in the dataset, for missing data patterns
across a subset of variables, a variable list can be included, for example,
**mvpatterns landval improval totval**.

The output produced by **mvpatterns** is shown below. The first
table lists the variables, their storage type (type), the number of
observations (obs), the number of missing values (mv), and the variable
label if the variables has one. The second table contains the information
on the pattern of missing values. The first block of columns in the output shows
the patterns of missing data. Within the block, each variable is represented by
a column, a “+” indicates that values of that variable are present in a given
missing data pattern, a “.” indicates that they are missing. The columns follow
the same order as the variable list in the first table, so that the first column
in the output below represents **landval**, the second **improval**, and
so on. The missing data patterns are listed in descending frequency, here the
most common missing data pattern is complete data (“+++++”). The table also shows the
number of missing values in that pattern (_mv), and the number of cases with
that missing data pattern (_freq). Based on the information in the second table we know that there are four observations
with no missing values, two cases missing on just the variable **salepric**, and one
observation with missing values on **improval**, **salepric** and **saltoapr**.

mvpatternsVariable | type obs mv variable label -------------+----------------------------------- landval | float 13 2 improval | float 12 3 totval | float 12 3 salepric | float 11 4 saltoapr | float 13 2 ------------------------------------------------- Patterns of missing values +------------------------+ | _pattern _mv _freq | |------------------------| | +++++ 0 4 | | +++.+ 1 2 | | ++.++ 1 2 | | +.+++ 1 2 | | .++++ 1 2 | |------------------------| | ++++. 1 1 | | ++..+ 2 1 | | +.+.. 3 1 | +------------------------+

## 4. When all the variables of interest are numeric

The examples above all work regardless of whether the variables of interest
(i.e. the variables whose missing data patterns you want to examine)
are numeric or string. When all of the variables you wish to check for missing
values are numeric we
can use a program called **misschk** to simplify the steps of examining the
missing data in our dataset. (Note: numeric variables include those with value
labels that are strings, as long as the actual values of the
variables are stored as numbers.) You can
download **misschk** from within Stata by typing **search misschk **(see
How can I use the search command to search for programs and
get additional help? for more information about using **search**).

Below is the command for **misschk**. We have listed
all five of the variables in our dataset in the variable list after the **misschk**
command. However, we could have just left the list of variables blank (i.e. used only **misschk , gen(miss)**
instead), if we had, **misschk** would have run using all the variables in our dataset. The variable
list is only necessary if we want to run **misschk** on only some of the
variables in our dataset. The **gen(miss)** option tells **misschk** that we want it to
create two new variables, both of which start with “miss”. These two variables
will be named **misspattern** and **missnumber**. The variable **misspattern** indicates
which of the missing data patterns each case follows. The variable **missnumber** indicates
the number of missing values for each case.

misschk landval improval totval salepric saltoapr, gen(miss)

The output for **misschk** consists of three tables. The first table lists the number of missing
values, as well as percent missing for each variable, this is similar to the table
produced by **mdesc** in part 1 above. This table also contains a column labeled “#” which assigns each
variable a number that is used to identify the variable later on in the output. The second table shows
the distribution of missing values. The pattern of missingness is described using the variable
numbers from the first table, and underscores (“_”). The numbers indicate which variables are
missing in that pattern, the underscores represent non-missing observations. For example, from
the second table we see that two cases have missing values on variable 1 (**landval**), but
complete data on all other variables, and that one case is missing data on variables 2, 4, and 5.
The bottom row shows that four cases are not missing any values at all (all underscores). This
table shows the same information generated in part three above, but in a slightly different format.
The missing data pattern for each case is described in the variable **misspattern**.
Finally, the third table shows the distribution of the number of missing values per case. This is
the same information discussed above in part 2. The number of variables each
case is missing is also contained in the variable **missnumber**.

Variables examined for missing values # Variable # Missing % Missing -------------------------------------------- 1 landval 2 13.3 2 improval 3 20.0 3 totval 3 20.0 4 salepric 4 26.7 5 saltoapr 2 13.3 Missing for | which | variables? | Freq. Percent Cum. ------------+----------------------------------- 1____ | 2 13.33 13.33 _2_45 | 1 6.67 20.00 _2___ | 2 13.33 33.33 __34_ | 1 6.67 40.00 __3__ | 2 13.33 53.33 ___4_ | 2 13.33 66.67 ____5 | 1 6.67 73.33 _____ | 4 26.67 100.00 ------------+----------------------------------- Total | 15 100.00 Missing for | how many | variables? | Freq. Percent Cum. ------------+----------------------------------- 0 | 4 26.67 26.67 1 | 9 60.00 86.67 2 | 1 6.67 93.33 3 | 1 6.67 100.00 ------------+----------------------------------- Total | 15 100.00