How do we detect if our data set has duplicate observations in Stata? Let’s look at the following data set with some duplicate observations in it.
list id female race read 1. 18 0 1 50 2. 22 0 1 42 3. 26 1 2 60 4. 147 1 4 47 5. 171 0 4 60 6. 191 1 4 47 7. 18 0 1 50 8. 18 0 1 50
In Stata, several programs are available to detect the duplicates and can also optionally drop the duplicates. One of the programs is called dups. Program dups is not a built-in program in Stata, but can be installed over the internet using search dups (see How can I used the search command to search for programs and get additional help? for more information about using search). Once dups is installed we can use it right away. Here is a list of the different features that dups has.
Without any argument, dups returns information on the number of groups of observations that have duplicates and the number of duplicates in each group. In our example, we have one group of observations with duplicates consisting of observation number 1, 7 and 8. That is what we see below.
dups group by: id female race read groups formed: 1 groups of duplicate observations: _group _count 1 3
We can add a variable list after dups. For example, in the following example, we add variable race after dups. Now dups counts how many duplicate observations in variable race only. We can see from the list of the data set, that there are three groups of observations of race (1, 2 and 4) and two of them have duplicates. That is shown by dups below.
dups race group by: race groups formed: 2 groups of duplicate observations: _group _count 1 3 2 4
By adding an option unique, we also request information on groups that have a unique observation. For example,
dups id, unique group by: id groups formed: 1 groups of duplicate observations: _group _count 1 3 unique observations: _group _count 1 1 2 1 3 1 4 1 5 1
With option key(varlist), we can request to list the observations. For example, in the following example, we see the values of id in each group.
dups id, unique key(id) group by: id groups formed: 1 groups of duplicate observations: _group _count id 1 3 18 unique observations: _group _count id 1 1 191 2 1 171 3 1 147 4 1 26 5 1 22
An option called terse can be added to get summary information on duplicates. For example,
dups id, unique key(id) terse group by: id groups formed: 1 total observations: 8 in duplicates 3 in unique 5
Now what if we want to drop the duplicates? We can do it by adding an option called drop. We do want to warn you that it is always dangerous to delete observations since you may lose your data. So always do it with caution. What is nice about dups is that it creates a new variable which has enough information to recover the deleted observations if ever we want to change our mind on what we just did. The default name of the variable is _expand (you can change the name by using the option expand after dups). Using the variable _expand we can get the deleted observations back by using a command called expand. See the example below.
dups, drop key(id) group by: id female race read groups formed: 1 groups of duplicate observations: _group _count id 1 3 18 (2 observations deleted) observations remaining: 6 list id female race read _expand 1. 22 0 1 42 1 2. 26 1 2 60 1 3. 147 1 4 47 1 4. 171 0 4 60 1 5. 191 1 4 47 1 6. 18 0 1 50 3 expand _expand (2 observations created) list id female race read _expand 1. 22 0 1 42 1 2. 26 1 2 60 1 3. 147 1 4 47 1 4. 171 0 4 60 1 5. 191 1 4 47 1 6. 18 0 1 50 3 7. 18 0 1 50 3 8. 18 0 1 50 3
Now we have seen how to detect and drop duplicate observations by using dups.