I don’t know if there is a standard definition of doubly wide data, but we will use it to mean data collected by subject on multiple time points on multiple factors. For example, say a researcher collects data for three trials on each of three days and records the data for each subject in a single row. We can use the data below for two subjects to illustrate this. For these data the variable d1t1 is the value of the response variable for day 1 and trial 1. The data for day 2 trial 1 is found in d2t1 and so on for each day and trial.
clear input d1t1 d1t2 d1t3 d2t1 d2t2 d2t3 d3t1 d3t2 d3t3 4 5 6 5 6 5 6 7 8 3 4 5 4 5 4 3 4 5 end
We are beginning with one row of data for each subject, and we want to end up with nine rows of data for each subject with variables to indicate day and trial. Many Stata users would reshape the data in two stages using two reshape commands. We will do using a single reshape command and several recodes.
We will begin by creating a subject identifier variable called, cleverly, id. Next, we will rename all the response variables to have a common stem followed by a sequence value, such that d1t1 is renamed y1, d1t2 is renamed y2 and so on. Here is the code to do these two operations.
generate id = _n foreach v of varlist d1t1 d1t2 d1t3 d2t1 d2t2 d2t3 d3t1 d3t2 d3t3 { rename `v' y`i' local i = `i' + 1 } describe Contains data obs: 2 vars: 10 size: 88 (99.9% of memory free) ------------------------------------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------------------------------------- y1 float %9.0g y2 float %9.0g y3 float %9.0g y4 float %9.0g y5 float %9.0g y6 float %9.0g y7 float %9.0g y8 float %9.0g y9 float %9.0g id float %9.0g
Now we can go ahead and reshape the data from wide to long with id as the subject identifier. As part of the reshape command we create a variable called seq which will be the sequence identifier for the nine observations within each subject.
reshape long y, i(id) j(seq) clist if i==1 id seq y 1. 1 1 4 2. 1 2 5 3. 1 3 6 4. 1 4 5 5. 1 5 6 6. 1 6 5 7. 1 7 6 8. 1 8 7 9. 1 9 8
Next, we will create variables for day and trial using the recode command. The trick here is to align the sequence number with the day or trial number. For example, sequence numbers 1, 4, and 7 are all trial 1 and 2, 5 and 8 are trial 2. The same thing is done with day, such that sequence numbers 1, 2 and 3 are day 1 and so on.
recode seq (1 4 7=1) (2 5 8=2) (3 6 9=3), gen(trial) recode seq (1/3=1) (4/6=2) (7/9=3), gen(day) clist id seq y trial day 1. 1 1 4 1 1 2. 1 2 5 2 1 3. 1 3 6 3 1 4. 1 4 5 1 2 5. 1 5 6 2 2 6. 1 6 5 3 2 7. 1 7 6 1 3 8. 1 8 7 2 3 9. 1 9 8 3 3 10. 2 1 3 1 1 11. 2 2 4 2 1 12. 2 3 5 3 1 13. 2 4 4 1 2 14. 2 5 5 2 2 15. 2 6 4 3 2 16. 2 7 3 1 3 17. 2 8 4 2 3 18. 2 9 5 3 3
It is also possible to reshape without renaming the variables. To do this we make use of the string option in the reshape command. Here is the code for everything up through reshape.
clear input d1t1 d1t2 d1t3 d2t1 d2t2 d2t3 d3t1 d3t2 d3t3 4 5 6 5 6 5 6 7 8 3 4 5 4 5 4 3 4 5 end gen id = _n reshape long d, i(id) j(s) string clist id s d 1. 1 1t1 4 2. 1 1t2 5 3. 1 1t3 6 4. 1 2t1 5 5. 1 2t2 6 6. 1 2t3 5 7. 1 3t1 6 8. 1 3t2 7 9. 1 3t3 8 10. 2 1t1 3 11. 2 1t2 4 12. 2 1t3 5 13. 2 2t1 4 14. 2 2t2 5 15. 2 2t3 4 16. 2 3t1 3 17. 2 3t2 4 18. 2 3t3 5
Now we have d as our dependent variable and a string variable s that has the information on day and trial. So when you see s = 2t3, it means day 2 and trial 3. We can create new variables for day and trial using regular expressions (see FAQ: How can I extract a portion of a string variable using regular expressions?) built into Stata. Here’s the code.
generate day = real(regexs(0) )if regexm(s, "^[0-9]+") generate trial = real(regexs(0)) if regexm(s, "[0-9]+$") clist id s d day trial 1. 1 1t1 4 1 1 2. 1 1t2 5 1 2 3. 1 1t3 6 1 3 4. 1 2t1 5 2 1 5. 1 2t2 6 2 2 6. 1 2t3 5 2 3 7. 1 3t1 6 3 1 8. 1 3t2 7 3 2 9. 1 3t3 8 3 3 10. 2 1t1 3 1 1 11. 2 1t2 4 1 2 12. 2 1t3 5 1 3 13. 2 2t1 4 2 1 14. 2 2t2 5 2 2 15. 2 2t3 4 2 3 16. 2 3t1 3 3 1 17. 2 3t2 4 3 2 18. 2 3t3 5 3 3
We now have the same day and trial variables as with our first method.
These two methods worked pretty well for doubly wide data, what about triply wide data? Consider the example in which data are collected for two trials per day on two days in each of two different weeks. Here is some sample data on two subjects in which w1d1t1 stands for week 1, day 1, trial 1 and so on.
clear input w1d1t1 w1d1t2 w1d2t1 w1d2t2 w2d1t1 w2d1t2 w2d2t1 w2d2t2 4 5 6 5 6 5 6 7 3 4 5 4 5 4 3 4 end
The process is exactly the same as our first method above except that there is one additional recode command. This time we will run all of the code in a single block.
local i = 1 foreach v of varlist w1d1t1 w1d1t2 w1d2t1 w1d2t2 w2d1t1 w2d1t2 w2d2t1 w2d2t2 { rename `v' y`i' local i = `i' + 1 } generate id = _n reshape long y, i(id) j(seq) recode seq (1 3 5 7=1) (2 4 6 8=2), gen(trial) recode seq (1 2 5 6=1) (3 4 7 8=2), gen(day) recode seq (1/4=1) (5/8=2), gen(week) clist id seq y trial day week 1. 1 1 4 1 1 1 2. 1 2 5 2 1 1 3. 1 3 6 1 2 1 4. 1 4 5 2 2 1 5. 1 5 6 1 1 2 6. 1 6 5 2 1 2 7. 1 7 6 1 2 2 8. 1 8 7 2 2 2 9. 2 1 3 1 1 1 10. 2 2 4 2 1 1 11. 2 3 5 1 2 1 12. 2 4 4 2 2 1 13. 2 5 5 1 1 2 14. 2 6 4 2 1 2 15. 2 7 3 1 2 2 16. 2 8 4 2 2 2
Now we can do the triply wide using reshape with the string option and with regular expressions to decode the week, day and trial.
clear input w1d1t1 w1d1t2 w1d2t1 w1d2t2 w2d1t1 w2d1t2 w2d2t1 w2d2t2 4 5 6 5 6 5 6 7 3 4 5 4 5 4 3 4 end generate id=_n reshape long w, i(id) j(s) string generate week = real(regexs(0)) if regexm(s, "^[0-9]+") generate day = real(regexs(1)) if regexm(s, "[d]([0-9]*)") generate trial = real(regexs(0)) if regexm(s, "[0-9]+$") clist id s w week day trial 1. 1 1d1t1 4 1 1 1 2. 1 1d1t2 5 1 1 2 3. 1 1d2t1 6 1 2 1 4. 1 1d2t2 5 1 2 2 5. 1 2d1t1 6 2 1 1 6. 1 2d1t2 5 2 1 2 7. 1 2d2t1 6 2 2 1 8. 1 2d2t2 7 2 2 2 9. 2 1d1t1 3 1 1 1 10. 2 1d1t2 4 1 1 2 11. 2 1d2t1 5 1 2 1 12. 2 1d2t2 4 1 2 2 13. 2 2d1t1 5 2 1 1 14. 2 2d1t2 4 2 1 2 15. 2 2d2t1 3 2 2 1 16. 2 2d2t2 4 2 2 2
And that’s how easy it is to reshape doubly and triply wide data to long.