The easiest way to convert string variables to numeric form is to use the encode command. If the variable is actually a numeric value that just happens to be stored as a string, see our FAQ: How can I quickly convert many string variables to numeric variables?
Let’s say that you have the following data:
region units East 800 South 600 South 500 West 850 East 300 Central 500 East 600 West 390 Central 480 Central 400 South 650 East 350 Central 380 South 470 West 550 South 720 South 460 East 520 West 400 East 590 |
Let’s begin by reading in the dataset from the Internet and using the describe command.
use https://stats.idre.ucla.edu/stat/stata/faq/region, clear describe Contains data from region.dta obs: 20 vars: 2 size: 260 (99.8% of memory free) -------------------------------------------- 1. region str7 %9s 2. units int %8.0g -------------------------------------------- Sorted by:
The variable region is a string variable. Next, let’s do a tabulate on region.
Now let's use the encode command and another describe.tabulate region region | Freq. Percent Cum. ------------+----------------------------------- Central | 4 20.00 20.00 East | 6 30.00 50.00 South | 6 30.00 80.00 West | 4 20.00 100.00 ------------+----------------------------------- Total | 20 100.00
encode region, generate(region2) describe Contains data from region.dta obs: 20 vars: 3 size: 340 (99.8% of memory free) -------------------------------------------- 1. region str7 %9s 2. units int %8.0g 3. region2 long %8.0g region2 -------------------------------------------- Sorted by: Note: dataset has changed since last saved
The encode command has generated a new variable called region2 that is of type long integer and has value labels which are defined and are also called region2. Now let's do two tabulates, one with labels and one without labels.
tabulate region2 region2 | Freq. Percent Cum. ------------+----------------------------------- Central | 4 20.00 20.00 East | 6 30.00 50.00 South | 6 30.00 80.00 West | 4 20.00 100.00 ------------+----------------------------------- Total | 20 100.00 tabulate region2, nolabel region2 | Freq. Percent Cum. ------------+----------------------------------- 1 | 4 20.00 20.00 2 | 6 30.00 50.00 3 | 6 30.00 80.00 4 | 4 20.00 100.00 ------------+----------------------------------- Total | 20 100.00
The first tabulate looks identical to the tabulate done on the string variable region above. As you can see from the second tabulate, region2 has the numeric values 1 through 4 but retains the strings Central, East, South and West as value labels.