Introduction
Stata has two built-in variables called _n and _N. _n is Stata notation for the current observation number. _n is 1 in the first observation, 2 in the second, 3 in the third, and so on.
_N is Stata notation for the total number of observations. Let’s see how _n and _N work.
input score group 72 1 84 2 76 1 89 3 82 2 90 1 85 1 end generate id = _n generate nt = _N list score group id nt 1. 72 1 1 7 2. 84 2 2 7 3. 76 1 3 7 4. 89 3 4 7 5. 82 2 5 7 6. 90 1 6 7 7. 85 1 7 7
As you can see, the variable id contains observation number running from 1 to 7 and nt is the total number of observations, which is 7.
Counting with by
Using _n and _N in conjunction with the by command can produce some very useful results. Of course, to use the by command we must first sort our data on the by variable.
sort group score by group: generate n1 = _n by group: generate n2 = _N list score group id nt n1 n2 1. 72 1 1 7 1 4 2. 76 1 3 7 2 4 3. 85 1 7 7 3 4 4. 90 1 6 7 4 4 5. 82 2 5 7 1 2 6. 84 2 2 7 2 2 7. 89 3 4 7 1 1
Now n1 is the observation number within each group and n2 is the total number of observations for each group.
To list the lowest score for each group use the following:
list if n1==1 score group id nt n1 n2 1. 72 1 1 7 1 4 5. 82 2 5 7 1 2 7. 89 3 4 7 1 1
To list the highest score for each group use the following:
list if n1==n2 score group id nt n1 n2 4. 90 1 6 7 4 4 6. 84 2 2 7 2 2 7. 89 3 4 7 1 1
Another use of _n
Let’s use _n to find out if there are duplicate id numbers in the following data:
input id score 117 72 204 84 311 76 289 89 141 82 277 90 465 85 289 88 182 84 end sort id list if id == id[_n + 1] id score 6. 289 88 list in 6/7 id score 6. 289 88 7. 289 89
As it turns out, observations 6 and 7 have the same id numbers and but different score values.
Finding Duplicates
Now let’s use _N to find duplicate observations.
input id score x1 x2 y1 y2 z1 z2 117 72 3 16 42 7 59 61 204 84 6 12 44 9 51 66 141 82 2 17 41 5 56 61 311 76 9 14 46 1 58 62 289 89 4 13 48 3 55 68 141 82 2 17 41 5 56 61 277 90 3 12 44 6 52 65 465 85 5 19 43 2 54 64 289 88 7 18 45 4 58 69 182 84 1 11 47 7 52 61 141 90 4 13 43 4 51 65 end sort id score x1 x2 y1 y2 z1 z2 by id score x1 x2 y1 y2 z1 z2: generate n = _N list if n>1 Observation 2 id 141 score 82 x1 2 x2 17 y1 41 y2 5 z1 56 z2 61 n 2 Observation 3 id 141 score 82 x1 2 x2 17 y1 41 y2 5 z1 56 z2 61 n 2
In this example we sort the observations by all of the variables. Then we use all of the variable in the by statement and set set n equal to the total number of observations that are identical. Finally, we list the observations for which _N is greater than 1, thereby identifying the duplicate observations.
If you have a lot of variables in the dataset, it could take a long time to type them all out twice. We can make use of the “*” wildcard to indicates that we wish to use all the variables. Further in the latest versions of Stata we can combine sort and by into a single statement. Below is a simplified version of the code that will yield the exact same results as above.
bysort * : generate n = _N list if n>1