Many researchers use Stata without ever writing a program even though programming could make them more efficient in their data analysis projects. Stata programming is not difficult since it mainly involves the use of Stata commands to you already use. The trick to Stata programming is to use the appropriate commands in the right sequence. Of course, this is the trick to any kind of programming.

There are two kinds of files that are used in Stata programming, do-files and ado-files.
Do-files are run from the command line using the **do** command, for example,

**do hsbcheck hsberr**

Ado-files, on the other hand, work like ordinary Stata commands by just using the file name in the command line, for example,

**median3 read write math science**

In fact, many of the built-in Stata commands are just ado-files. You can look at the source
code for the ado commands using the **viewsource** command, for example,

**viewsource regress.ado**

Do-files can be placed in the same folder as the data but ado-files need to go where Stata can find them. The best place for user written ado-files is in the /ado/personal/ directory. The location of this directory can vary for system to system.

We will try to give users a feel for Stata programming by covering the following topics:

- Creating and using do-files for checking and cleaning data.
- Using do-files for analyzing data.
- Programming tools and tips.
- Writing an ado program to create a statistical command.
- Creating an ado-file that uses Mata, Stata’s matrix programming language.

## Part 1: Creating and using do-files for checking and cleaning data

We will create a file, **hsbcheck.do**, that contains commands that will display
observations with incorrect or impossible values.

/* begin hsbcheck.do */ version 9.2 clear use `1' set more off nmissing describe summarize list id if id<1 | id>200 list id gender if ~ inlist(gender,1,2,.) list id race if ~ inlist(race,1,2,3,4,.) list id ses if ~ inlist(ses,1,2,3,.) list id schtyp if ~ inlist(schtyp,1,2,.) list id prog if ~ inlist(prog,1,2,3,.) list id read if (read<1 | read>99) & read~=. list id write if (write<1 | write>99) & write~=. list id math if (math<1 | math>99) & math~=. list id science if (science<1 | science>99) & science~=. list id socst if (socst<1 | socst>99) & socst~=. list id female if (female<0 | female>1) & female~=. list id region if (region<1 | region>3) & region~=. set more on /* end hsbcheck.do */

Here is how our do-file is used with the dataset **hsberr**.

do hsbcheck hsberr[output omitted]

So how did the **hsbcheck** program “know” which file to use? This was done using a macro variable, in this
case, `1′, which takes the first term typed after the name fof the program and treats it as
as file name. Macro variables have many uses including as variable names or numeric values.
We will see additional uses of macro variables in other programs.

Now that we know what errors there are in the data we can write a do-file
that will fix the errors. When we know the correct value of an observation, we will replace the
incorrect value with the correct one. When we do not know the correct value for an observation, we
will replace the incorrect value with missing. The do-file **hsbfix.do** will read in
**hsberr**, correct the errors and save the corrected file as **hsbclean**.
Here is what **hsbfix.do** looks like.

/* begin hsbfix.do */ version 9.2 clear use hsberr replace id=193 if id==1193 replace read=47 if read==147 replace science=61 if science==-61 replace gender=. if gender==5 replace race=. if race<1 | race>4 replace ses=. if ses<1 | ses>3 replace schtyp=. if schtyp<1 | schtyp>2 replace prog=. if prog<1 | prog>3 tab gender tab reg /* create female from gender */ generate female = gender recode female 1=0 2=1 label define fem 0 "male" 1 "female" label value female fem tab female gender /* create numeric region from string reg */ encode reg, gen(region) recode region (1/3=1)(7=1)(4/5=2)(6=3) label define region 1 "Los Angeles" 2 "Orange" 3 "Riverside", modify tab reg region label data "hsb clean data using hsberr.do" save hsbclean, replace /* end hsbfix.do */

Notes on hsbcheck.do:

1) One important thing to note is that after we fix the incorrect values, we will save the
data file with a new name. We will never change any of the values in the original
data file, **hsberr**.

2) We created two varibles, **female** and **region** that did not exist ins the originaldataset **hsberr**.
**Gender** is an ambiguous variable name while **female** is very straightforward and hardly needs value labels.
The variable **reg** has many problems common to string variables. The region “Los Angeles” was coded at least four
different ways. The solution used here (one of many possible solutions) was to convert the variable into a numeric variable,
recode observations into the correct regions and to create value labels for the regions.

First, we will run **hsbfix** on the original file **hsberr** then, as a check, we will run **hsbcheck** on the new file
**hsbclean**.

do hsbfix do hsbcheck hsbclean[output omitted]

## Part 2: Using do-files for analyzing data

Next, we will create a do-file that contains all of the commands that we need to
run our data analysis. This do-file will be called **hsbanalyze.do**.

/* begin hsbanalyze.do */ verson 9.2 log using hsb10_14_08.txt, text replace summarize read write math science univar read write math science tabstat write, stat(n mean sd p25 p50 p75) by(female) tabstat write, stat(n mean sd p25 p50 p75) by(prog) ttest write, by(gender) histogram write, normal start(30) width(5) more kdensity write, normal more tab1 female ses prog tab prog ses, all correlate write read science female more regress write read female rvfplot more ologit ses read write female gologit2 ses read write female, autofit mlogit prog read write female log close /* end hsbanalyze */

Now, let’s use **hsbanalyze** with our data file **hsbclean**.

use hsbclean, clear do hsbanalyze[output omitted]

This may not seem all that useful; after all, you could just as easily type each of the
commands into the command window, but what if your coauthor comes to you and says, “we need to
redo the whole analysis using only **schtyp** equal to one.” Here’s all you have
to do.

keep if schtyp==1 do hsbanalyze[output omitted]

## Part 3: Writing an ado program to create a statistical command

Now, let’s try our hand at writing a statistical command. Ado programs are very much
like do-file programs with the advantage that you just have to type the name of the
command. You will need to include two additional commands to create an ado program. You
begin with **program define** and the name of the command and you end with an **end**
command. Also, you need to save the file as a **.ado** using the same name for the
file as the name of the new command.

We will illustrate the ado program by writing a command that computes the median. Of course, Stata already have commands that compute medians but we are doing this to illustrate the process of creating an ado program.

The basic logic of computing the median is to sort the variable of interest then, if
there are an odd number of values take the middle one, an if there are an even number
of values take half the distance between the middle two. Below is the first version
of our program which is saved in the file names **median1.ado**.

/* Median Program -- Version #1 */ /* basic program with one variable */ program define median1 version 9.2 sort `1' quietly count if `1' ~= . local n = r(N) local mid = int(`n'/2) local odd = mod(`n',2) if `odd' { local median = `1'[`mid'+1] } else { local median = (`1'[`mid'] + `1'[`mid'+1])/2 } display display as text "Median of `1' = " as result `median' end /* end median1.ado */

Let’s try **median1** on the **hsbclean** dataset.

use hsbclean median1 writeMedian of write = 54

This program worked just fine but it could be improved. We will modify the program to
improve the output format and to allow for multiple variables. We will call this
new program **median2** which will be saved in the file **median2.ado**.

/* Median Program -- Version #2 */ /* multiple variables; saves results in return list */ program define median2, rclass version 9.2 syntax varlist(numeric) preserve /* preserve data */ display display as text " Variable N Median" display as text "----------------------------" foreach var of local varlist { quietly count if `var' ~= . local n = r(N) local mid = int(`n'/2) local odd = mod(`n',2) sort `var' if `odd' { local median = `var'[`mid'+1] } else { local median = (`var'[`mid'] + `var'[`mid'+1]) / 2 } display as result %9s "`var'" %9.0f `n' %10.2f `median' } return scalar Mdn = `median' return scalar N = `n' restore /* restore data */ end /* end median2.ado */

Here is how **median2** works.

median2 read write math scienceVariable N Median ---------------------------- read 200 50.00 write 200 54.00 math 200 52.00 science 195 53.00

Again, everything seems to be working fine but it would be better if the
program allowed the use of **if** or **in** to subset the data. For example,
what if we wanted the medians just for males. The program **median3** will
allow the user to use **if** and **in**.

/* Median Program -- Version #3 */ /* Allows for if and in */ program define median3, rclass version 9.2 syntax varlist(numeric) [if] [in] display display as text " Variable N Median" display as text "----------------------------" foreach var of local varlist { preserve if "`if'"~="" | "`in'"~="" { quietly keep `if' `in' } quietly keep if ~missing(`var') quietly count local n = r(N) local mid = int(`n'/2) local odd = mod(`n',2) sort `var' if `odd' { local median = `var'[`mid'+1] } else { local median = (`var'[`mid'] + `var'[`mid'+1])/2 } display as result %9s "`var'" %9.0f `n' %10.2f `median' restore } return scalar Mdn = `median' return scalar N = `n' end /* end median3.ado */

Here is how this version of the program works.

median3 read write math science if female==0Variable N Median ---------------------------- read 90 52.00 write 90 50.50 math 90 51.50 science 85 55.00median3 read write math science in 50/150Variable N Median ---------------------------- read 101 50.00 write 101 54.00 math 101 52.00 science 98 53.00

The **rclass** option in **median2** and **median3** allow you to temporarily
store the results from a program. For our program, we keep the frequency and median
for the last variable used in the command. Here is how you can view the values being stored.

return listscalars: r(N) = 99 r(Mdn) = 50

Stata has included some tools to make creating and debugging program as bit easier. One
of them is the **trace** option that shows you command by command what is happening inside
your program. Let’s run **median3** with **trace** turned on and see what it
looks like.

set trace on median3 science if female==0[output omitted]set trace off

## Part 4: Creating an ado-file that uses the Stata matrix operations for performing an analysis

Many statistical procedures are done more easily using matrix commands. Stata
has two different ways that you you can use matrix commands in your programs.
First, Stata has a fairly complete set of matrix commands build right into
Stata itself. Second, Stata has a complete matrix programming language called
**Mata**. **Mata** is faster and more powerful than Stata’s built-in
matrix commands but the built-in commands are easier to program for small to
medium programming projects. We will illustrate Stata’s built-in matrix commands
by writing an OLS regression program.

We will begin with **matreg1.ado** that makes use of the famous matrix
equation for the regression coefficients, b=(X’X)^{-1}X’Y.

/* begin matreg1.ado */ program define matreg1 version 9.0 syntax varlist(min=2 numeric) [if] [in] [, Level(integer $S_level)] marksample touse /* mark cases in the sample */ tokenize "`varlist'" quietly matrix accum sscp = `varlist' if `touse' matrix XX = sscp[2...,2...] /* X'X */ matrix Xy = sscp[1,2...] /* X'y */ matrix b = Xy * syminv(XX) /* (X'X)-1X'y */ local k = colsof(b) /* number of coefs */ local nobs = r(N) local df = `nobs' - (rowsof(sscp) - 1) /* df residual */ matrix hat = Xy * b' matrix V = syminv(XX) * (sscp[1,1] - hat[1,1])/`df' matrix C = corr(V) matrix seb = vecdiag(V) matrix seb = seb[1, 1...] matrix t = J(1,`k',0) matrix p = t local i = 1 while `i' <= `k' { matrix seb[1,`i'] = sqrt(seb[1,`i']) matrix t[1,`i'] = b[1,`i']/seb[1,`i'] matrix p[1,`i'] = tprob(`df',t[1,`i']) local i = `i' + 1 } display display "Dependent variable: `1'" display display "Regression coefficients" matrix list b display display "Standard error of coefficients" matrix list seb display display "Values of t" matrix list t display display "P values for t" matrix list p display display "Covariance of the regression coefficients" matrix list V display display "Correlation of the regression coefficients" matrix list C matrix drop sscp XX Xy b hat V seb t p C end /* end matreg1.ado */

Here is an example of how to use **matreg1**.

use https://stats.idre.ucla.edu/stat/data/hsb2, clear regress write read femaleSource | SS df MS Number of obs = 200 -------------+------------------------------ F( 2, 197) = 77.21 Model | 7856.32118 2 3928.16059 Prob > F = 0.0000 Residual | 10022.5538 197 50.8759077 R-squared = 0.4394 -------------+------------------------------ Adj R-squared = 0.4337 Total | 17878.875 199 89.843593 Root MSE = 7.1327 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .5658869 .0493849 11.46 0.000 .468496 .6632778 female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098 _cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011 ------------------------------------------------------------------------------matreg1 write read femaleDependent variable: write Regression coefficients b[1,3] read female _cons write .56588693 5.486894 20.228368 Standard error of coefficients seb[1,3] read female _cons r1 .04938488 1.0142614 2.7137564 Values of t t[1,3] c1 c2 c3 r1 11.458708 5.4097434 7.4540105 P values for t p[1,3] c1 c2 c3 r1 1.265e-23 1.818e-07 2.800e-12 Covariance of the regression coefficients symmetric V[3,3] read female _cons read .00243887 female .00265893 1.0287262 _cons -.12883112 -.69953157 7.3644737 Correlation of the regression coefficients symmetric C[3,3] read female _cons read 1 female .05308387 1 _cons -.96129327 -.25414792 1

In **matreg1** we computed t-tests and p-values manually. We also managed
all of the output display. We can let Stata do all of that for us by using
the **estimates post** command which will greatly simplify
the program. We will do this in **matreg2** as
shown below.

/* begin matreg2.ado */ program define matreg2, eclass version 9.0 syntax varlist(min=2 numeric) [if] [in] [, Level(integer $S_level)] marksample touse /* mark cases in the sample */ tokenize "`varlist'" quietly matrix accum sscp = `varlist' if `touse' matrix XX = sscp[2...,2...] /* X'X */ matrix Xy = sscp[1,2...] /* X'y */ matrix b = Xy * syminv(XX) /* (X'X)-1X'y */ local k = colsof(b) /* number of coefs */ local nobs = r(N) local df = `nobs' - (rowsof(sscp) - 1) /* df residual */ matrix hat = Xy * b' matrix V = syminv(XX) * (sscp[1,1] - hat[1,1])/`df' ereturn post b V, dof(`df') obs(`nobs') depname(`1') /// esample(`touse') ereturn local depvar "`1'" ereturn local cmd "matreg" display ereturn display, level(`level') matrix drop sscp XX Xy hat end /* end matreg2.ado */

Now, check out **matreg2**.

matreg2 write read female------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- read | .5658869 .0493849 11.46 0.000 .468496 .6632778 female | 5.486894 1.014261 5.41 0.000 3.48669 7.487098 _cons | 20.22837 2.713756 7.45 0.000 14.87663 25.58011 ------------------------------------------------------------------------------

The **eclass** option does for estimation commands what the **rclass** option
does for regular statistical commands. To view the **eclass** values we need
to use the **ereturn list** command>

ereturn listscalars: e(N) = 200 e(df_m) = 2 e(df_r) = 197 e(F) = 77.21062421518363 e(r2) = .4394192130387506 e(rmse) = 7.132734938503835 e(mss) = 7856.321182518186 e(rss) = 10022.5538174818 e(r2_a) = .4337280375366059 e(ll) = -675.2152914029984 e(ll_0) = -733.0934827146212 macros: e(cmdline) : "regress write read female, nohead" e(title) : "Linear regression" e(vce) : "ols" e(depvar) : "write" e(cmd) : "regress" e(properties) : "b V" e(predict) : "regres_p" e(model) : "ols" e(estat_cmd) : "regress_estat" matrices: e(b) : 1 x 3 e(V) : 3 x 3 functions: e(sample)matrix list e(b)e(b)[1,3] read female _cons y1 .56588693 5.486894 20.228368matrix list e(V)symmetric e(V)[3,3] read female _cons read .00243887 female .00265893 1.0287262 _cons -.12883112 -.69953157 7.3644737

## Conclusion

Spending a some time learning to program in Stata can increase your productivity and efficiency in your data analysis projects.