*****SECTION 1 ENTERING DATA***** *A) Preparing the workpsace cd w:\ /* note: directory and path may differ on your computer */ dir clear *B) Use import delimited to read in delimited data from other sources *B1. Comma-separated file with variable names import delimited using hs0.csv, clear describe *B2. Comma-separated file without variable names import delimited gender id race ses schtyp prgtype read write math science socst using hs0_noname.csv, clear describe *B3. Delimited files in general import delimited gender id race ses schtyp prgtype read write math science socst using hs0.raw, delimiter("\t", collapse) clear *C) Use infix to read in fixed format files clear infix id 1-2 a1 3-4 t1 5-6 gender 7 a2 8-9 t2 10-11 tgender 12 using schdat.fix *D) Use import excel to read in Excel files import excel using hsbdemo.xlsx, sheet("hsbdemo") firstrow clear *E) Use input to enter data from the keyboard or a do-file clear input id female race ses str3 schtype prog read write math science socst 147 1 1 3 pub 1 47 62 53 53 61 108 0 1 2 pub 2 34 33 41 36 36 18 0 3 2 pub 3 50 33 49 44 36 153 0 1 2 pub 3 39 31 40 39 51 50 0 2 2 pub 2 50 59 42 53 61 51 1 2 1 pub 2 42 36 42 31 39 102 0 1 1 pub 1 52 41 51 53 56 57 1 1 2 pub 1 71 65 72 66 56 160 1 1 2 pub 1 55 65 55 50 61 136 0 1 2 pub 1 65 59 70 63 51 end describe *F) The save command reads stores data as Stata data (.dta) files, and the use command loads Stata data files save hsb10 clear use hsb10 use "W:\data\hsb10", clear *G) The use command can load files over the internet use https://stats.idre.ucla.edu/stat/data/hs0, clear *****SECTION 2 EXPLORING DATA***** use https://stats.idre.ucla.edu/stat/data/hs0, clear *A) Keep a record of your work with log log using unit2.txt, text replace *B) Use list and browse to display your data list list gender-read in 1/20 browse *C) Use describe and codebook to characterize your variables, and labelbook to characterize your data labels describe codebook lookfor s labelbook *D) Calculate descriptive statistics of continuous variables with summarize summarize summarize read math science write /* summarize just these variables */ display 9.48^2 /* variance is the sd (9.48) squared */ summarize write, detail /* more stats */ sum write if read >=60 /* sum is abbreviation of summarize */ sum write if prgtype=="academic" sum write in 1/40 *E) Graphs for exploring continuous variables histogram write, normal histogram write, normal start(30) width(5) /* wider bins for a smoother plot */ kdensity write, normal kdensity write, normal width(5) /* a smoother kdensity plot */ graph box write *F) Exploring continous variables by group tabstat read write math, by(prgtype) stat(n mean sd) tabstat write, by(prgtype) stat(n mean sd p25 p50 p75) histogram math, normal by(prgtype) /* densities by prgtype */ graph box write, over(prgtype) /* box plots by prgtype */ *G) Create frequency tables of categorical variables using tabulate tabulate ses tab1 gender schtyp prgtype tab prgtype ses tab prgtype ses, row col *H) Exploring relationships between continuous variables correlate write read science pwcorr write read science, obs twoway (scatter write read) twoway (scatter write read, jitter(2)) graph matrix read science write, half *H) Closing and examining the log file log close *****SECTION 3 MODIFYING DATA***** use https://stats.idre.ucla.edu/stat/data/hs0, clear codebook *A) Use order to control the ordering of variables as columns in the dataset order id gender race ses prgtype, first *B) Use label data to describe the dataset and label variable to give variable names more meaning. label data "High School and Beyond" notes id: anonymous id notes label variable schtyp "type of school" *C) Use label define to create a set of value labels, and then use label values to apply the value labels to a variable codebook schtyp /*check for value labels first */ label define scl 1 public 2 private label values schtyp scl codebook schtyp list schtyp in 1/10 list schtyp in 1/10, nolabel *D) The encode command will convert a string variable to numeric and will label its values automatically encode prgtype, gen(prog) label variable prog "type of program" codebook prog list prog in 1/10 list prog in 1/10, nolabel *E) Renaming and recoding variables rename gender female recode female (1=0)(2=1) label define fm 1 female 0 male label values female fm codebook female list female in 1/10 list female in 1/10, nolabel *F) Creating variables from other variables, generate and egen generate total = read + write + math + science summarize total recode total (0/140=0 F) (141/180=1 D) (181/210=2 C) (211/234=3 B) (235/300=4 A), gen(grade) label variable grade "combined grades of read, write, math, science" codebook grade list read write math science total grade in 1/10 list read write math science total grade in 1/10, nolabel egen zread = std(read) summarize zread list read zread in 1/10 egen readmean = mean(read), by(ses) list read ses readmean in 1/10 egen row_mean = rowmean(read write math science) list read write math science row_mean in 1/10 save hs1 *****SECTION 4 MANAGING DATA***** use hs1, clear *A) Use keep and drop with an if statement to subset observations keep if female == 0 count save hsmale, replace use hs1, clear keep if female == 1 count save hsfemale, replace *B) Use keep and drop with variable names to remove variables from the dataset use hs1, clear keep id female read write save hskept, replace describe list in 1/20 use hs1, clear drop female read write save hsdropped, replace describe list in 1/10 *C) Adding observations with append use hsmale tabulate female append using hsfemale tabulate female save hsmasters, replace *D) Adding variables with merge use hskept, clear list merge 1:1 id using hsdropped tab _merge list save hsmerged *****SECTION 5 ANALYZING DATA***** use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear *A) Analysis of normally-distributed outcomes *A1. t-tests ttest write = 50 ttest write = read ttest write, by(female) ttest write, by(female) unequal *A2. Analysis of Variance anova write i.prog *A3. Linear Regression regress write c.read i.prog regress write c.read##i.prog *B) Postestimation - analysis after running the model *B1. Custom hypothesis testing and contrasts with test and contrast test 2.prog = 3.prog regress write i.female##i.prog contrast female@prog contrast prog@female *B2. Marginal means and effects with margins and marginsplot regress write i.female##i.prog margins female#prog marginsplot *B3. Residual analysis with predict and graphing predict pred predict res, resid list write pred res in 1/20 kdensity res, normal qnorm res *B4. Influence analysis predict cook, cooksd twoway spike cook id *C) Analysis of categorical outcomes *C1. Chi-square test of independence with tab tabulate prog ses, all tabulate prog ses, all expected *C2. Logistic regression with logit tab honors logit honors c.read i.female logit, or predict prob, pr margins female *D) Non-parametric Tests signtest write = 50 signrank write = read ranksum write, by(female) kwallis write, by(prog)