When two people enter the same data (double data entry), a concern is whether discrepancies exist between the two datasets (the rationale of double data entry), and if so, where. We start by reading in the two datasets, one entered by person1 and the second by person2. After we read in the data, we sort the datasets by the id variable id and then save the data.
clear input id str8 name age ht wt income 11 john 23 68 145 23000 12 charlie 25 72 178 45000 13 sally 21 64 135 12000 4 mike 34 70 156 5600 43 paul 30 73 189 15600 end sort id save person1, replace clear input id str8 name age ht wt income 11 john 23.5 68 145 23000 12 charles 25 52 178 45000 13 sally 21 64 . 12000 4 michael 34 70 156 5600 43 Paul 30 73 189 5600 end sort id save person2, replace
We compare the two datasets with the cf command to see if any discrepancies exist between the two datasets.
use person1, clear cf _all using person2, verbose id: match name: 3 mismatches age: 1 mismatches ht: 1 mismatches wt: 1 mismatches income: 1 mismatches r(9);
The cf command revealed that differences do exist, however, it did not specify for which observations the mismatches occurred, which is our main objective. To find out where the errors occurred, we start by creating a large dataset that combines the two. However, in the large dataset we must distinguish the data input by person1 and person2. We choose to rename all variables from person1, except for the id variable (this is for matching purposes), by adding the suffix "_person1" via the rename command. We use the foreach command to make the renaming process more efficient. Once we the variables are renamed, person2 is merged with person1 by the id variable, id, and then the merged dataset is listed.
use person1, clear foreach var of varlist name-income{ rename `var' `var'_person1 } merge id using person2 list +---------------------------------------------------------------------------------------------------------+ | id name_p~1 age_pe~1 ht_per~1 wt_per~1 income~1 name age ht wt income _merge | |---------------------------------------------------------------------------------------------------------| 1. | 4 mike 34 70 156 5600 michael 34 70 156 5600 3 | 2. | 11 john 23 68 145 23000 john 23.5 68 145 23000 3 | 3. | 12 charlie 25 72 178 45000 charles 25 52 178 45000 3 | 4. | 13 sally 21 64 135 12000 sally 21 64 . 12000 3 | 5. | 43 paul 30 73 189 15600 Paul 30 73 189 5600 3 | +---------------------------------------------------------------------------------------------------------+
In exploring the discrepancies, we can either display discrepancies by the variables or discrepancies by observations. We start by listing the discrepancies by the variables. We start by using the foreach command and reference the variables from person2 (they do not have the suffix), name-income. We use the if clause, `var’ != `var’_person1, which lists only observations for a given variable, the given variable referenced by `var’ from the foreach command, when the data entered by person2 (`var’) is not equal to person1 (`var’_person1). When this condition is met, we list id, the value entered by person2 (`var’) and the value entered by person1 (`var’_person1).
Note that when we list the variables, the variables with no suffix correspond to the entries made by person2.
*Discrepancies listed by variables. foreach var of varlist name-income{ list id `var' `var'_person1 if `var' != `var'_person1, abbreviate(15) }
+-----------------------------+ | id name name_person1 | |-----------------------------| 1. | 4 michael mike | 3. | 12 charles charlie | 5. | 43 Paul paul | +-----------------------------+ +-------------------------+ | id age age_person1 | |-------------------------| 2. | 11 23.5 23 | +-------------------------+ +----------------------+ | id ht ht_person1 | |----------------------| 3. | 12 52 72 | +----------------------+ +----------------------+ | id wt wt_person1 | |----------------------| 4. | 13 . 135 | +----------------------+ +------------------------------+ | id income income_person1 | |------------------------------| 5. | 43 5600 15600 | +------------------------------+
When we list discrepancies by observations, we need to modify the prior program to evaluate the variables on a case-by-case basis i.e., for observation 1, we evaluate the entries across all variables given in the foreach. Once observation 1 is checked and discrepancies listed, we move to observation 2. This process is repeated until the last observation is completed. First, we find how many observations are in the data with the count command and then insert that value in the forvalues loop. The forvalues argument will allow us to evaluate discrepancies on a case-by-case basis. We added _n == `i’ to the if clause in the list command to evaluate the variables in the foreach command for a given observation before moving to the next observation.
*Discrepancies listed by id variable. count 5 forvalues i = 1/5 { foreach var of varlist name-income{ list id `var' `var'_person1 if (`var' != `var'_person1) & _n == `i', abbreviate(15) } } +-----------------------------+ | id name name_person1 | |-----------------------------| 1. | 4 michael mike | +-----------------------------+ +-------------------------+ | id age age_person1 | |-------------------------| 2. | 11 23.5 23 | +-------------------------+ +-----------------------------+ | id name name_person1 | |-----------------------------| 3. | 12 charles charlie | +-----------------------------+ +----------------------+ | id ht ht_person1 | |----------------------| 3. | 12 52 72 | +----------------------+ +----------------------+ | id wt wt_person1 | |----------------------| 4. | 13 . 135 | +----------------------+ +--------------------------+ | id name name_person1 | |--------------------------| 5. | 43 Paul paul | +--------------------------+ +------------------------------+ | id income income_person1 | |------------------------------| 5. | 43 5600 15600 | +------------------------------+