*****NOTE: THERE ARE MORE COMMANDS HERE THAN DISPLYED IN THE SEMINAR *************** STATA ******************* * load a Stata dataset over the internet webuse auto, clear * change directory (not run) * cd "C:/path/to/directory" * histogram command histogram weight * comments are not executed /* this kind of comment can span multiple lines */ * use /// to continue a command over multiple lines summarize weight /// length ***** IMPORTING DATA * loading Stata data files * read from hard drive; uncomment and change path below before executing * use "C:/path/to/myfile.dta" * load data over internet * notice .dta extension is omitted use https://stats.idre.ucla.edu/stat/data/hsbdemo * save data, replace if it exists save hsbdemo, replace * load data but clear memory first use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear * import excel file; uncomment and change path below before executing * import excel using "C:\path\myfile.xlsx", sheet("mysheet") firstrow clear * import csv file; uncomment and change path below before executing * import delimited using "C:\path\myfile.csv", clear ***** HELP FILES *help files help describe *describe can be abbreviated to d d *************** GETTING TO KNOW YOUR DATA ******************* ***** VIEWING DATA * seminar dataset use https://stats.idre.ucla.edu/stat/data/hs0, clear * browse dataset browse * list all observations and all variables list * list read and write for first 5 observations li read write in 1/5 ***** SELECTING OBSERVATIONS * list science for last 3 observations li science in -3/L * list gender, ses, and math if math > 70 * with clean output li gender ses math if math > 70, clean * browse gender, ses, and read * for females (gender=2) who have read > 70 browse gender ses read if gender == 2 & read > 70 ***** EXPLORING DATA * get variable properties describe * inspect values of variables read gender and prgtype codebook read gender prgtype * summarize continuous variables summarize read math * detailed summary of read for females summarize read if gender == 2, detail * tabulate frequencies of ses tabulate ses * remove labels tab ses, nolabel * two-way tab of race and ses tab race ses * with row percentages tab race ses, row ** DATA VISUALIZATION * histogram of write histogram write * histogram of write with normal density * and intervals of length 5 hist write, normal width(5) * boxplot of all test scores graph box read write math science socst * scatter plot of write vs read scatter write read * layered graph of scatter plot and lowess curve twoway (scatter write read) (lowess write read) * layered scatter plots of write and read * colored by gender twoway (scatter write read if gender == 1, mcolor(blue)) /// (scatter write read if gender == 2, mcolor(red)) * bar graphs * bar graph of count of ses graph bar (count), over(ses) asyvars * frequencies of gender by ses * asyvars colors bars by ses graph bar (count), over(ses) over(gender) asyvars *************** DATA MANAGEMENT ******************* ***** CREATING AND TRANSFORMING VARIABLES * generating variables * generate a sum of 3 variables generate total = math + science + socst * it seems 5 missing values were generated * let's look at variables summarize total math science socst * replace total with just (math+socst) * if science is missing replace total = math + socst if science == . * no missing totals now summarize total * create a variable that equals 1 if prgtype * equals academic, 0 otherwise gen academic = 0 replace academic = 1 if prgtype == "academic" tab prgtype academic * egen to generate variables with functions * rowmean will take the mean of all non-missing values egen meantest = rowmean(read math science socst) summarize meantest read math science socst * standardize read egen zread = std(read) summarize zread * renaming variables rename gender female * recode values to 0,1 recode female (1=0)(2=1) tab female * labeling variables (description) label variable math "9th grade math score" label variable schtyp "public/private school" * the variable label will be used in some output histogram math tab schtyp * schtyp before labeling tab schtyp * create and apply labels for schtyp label define pubpri 1 public 2 private label values schtyp pubpri tab schtyp * list all value label set label list * describe shows which value labels * have been applied to which variables describe * encoding string prgtype into * numeric variable prog encode prgtype, gen(prog) * we see that a value label has been applied to prog describe prog * we see labels by default in prog tab prog * use option nolabel to remove the labels tab prog, nolabel ***** DATASET OPERATIONS * variable list shortcuts * summarize all consecutive variables * from read to socst summ read-socst * summarize all variables that begin with r summ r* * summarize all variables that begin with r * and end with e summ r*e * put id and demographic variables first order id female race ses schtyp prog * put old prgtype variable last order prgtype, last * describe with simple option just lists variable names describe, simple * save dataset, overwrite existing file save hs1, replace * drop prgtype from dataset drop prgtype describe, simple * keep just id read and math keep id read math describe, simple * keep observation if reading > 30 keep if read > 40 summ read * now drop if write outside range [30,70] drop if math < 30 | math > 70 summ math * sorting * first look at unsorted li in 1/5 * now sort by read and then math sort read math li in 1/5 * sort descending read then ascending math gsort -read +math li in 1/5 **** EXERCISES * let's use what he have learned to make a few * small datasets ** males with math and reading scores above 70 * first load the hs1 dataset * restrict to males with math and reading > 70 * keep only id remale math and read * print to screen * save dataset ** females with math and social studies scores above 70 * this time keep id female, math, socst ** race ses for all students with math or read score > 70 * keep variables id female race ses read math socst **** SOLUTIONS TO EXERCISES * first let's make a dataset of males with math scores above 70 * for this dataset we only want the variables id and math * first load the hs1 dataset use hs1, clear * restrict to males with math and reading > 70 keep if female == 0 & math > 70 & read > 70 * keep only id remale math and read keep id female math read * print to screen li * save dataset save males70, replace * now for females with math and socst above 70 use hs1, clear keep if female == 1 & math > 70 & socst > 70 * this time keep id female, math, socst keep id female math socst li save females70, replace * id female race ses for students with * math > 70 and either read > 70 or socst > 70 use hs1, clear keep if math > 70 & (read > 70 | socst > 70) * keep id female race ses read math socst keep id-ses read math socst li save raceses70, replace **** APPENDING AND MERGING * append males70 and females 70 * first load males70 use males70, clear * append females70 and look at results append using females70 li * merge in race and ses of all students * with math > 70, and read or socst > 70 merge 1:1 id using raceses70 * look at _merge variable tab _merge * drop unmatched observations drop if _merge != 3 li * reload dataset use hs1, clear * *Examples of by-group processing * summarizing a variable by gender bysort female: summarize write * 2-way frequencies by ses bysort ses: tab prog female * mean of math by program bysort prog: egen meanmath = mean(math) tab prog meanmath * rank based on math by program bysort prog: egen mathrank = rank(math) * get lowest math score in each program li prog math if mathrank == 1 * histograms of write by gender * notice that by is now an option hist write, by(female) *************** BASIC STATISTICAL ANALYSIS ******************* ** ANALYSIS OF CONTINUOUS OUTCOMES * many commands provide 95% CI mean read * 99% CI for read ci means read, level(99) * testing if means are different * across groups * independent samples t-test ttest read, by(female) * paired samples t-test ttest read == write * 2-way ANOVA of write by female and prog anova write female prog * correlation of write and math correlate write math * correlation matrix of 5 variables corr read write math science socst * linear regression of write on continuous * predictor math and categorical predictor prog regress write c.math i.prog * postestimation examples * add variable of predicted values of write * for each observation predict predwrite * look at first 5 predicted values li predwrite write math female in 1/5 * test of whether 2 prog coefficients are * jointly significant test 2.prog 3.prog * test whether 2 prog coefs are different test 2.prog-3.prog = 0 ** ANALYSIS OF CATEGORICAL OUTCOMES * chi square test of independence tab prog ses, chi2 * logistic regression of being in academic program * on female and math score * coefficients as odds ratios logit academic i.female c.math, or ****************** IN CLASS EXERCISE ***************************** **************** SCROLL DOWN FOR SOLUTION ************************ * 1. load the nhanes2 data, clear memory first * 2. Imagine we are only interested in people in the West * The variable "region" codes where people live. * How many obeservations are from region=West? * 3. Now keep only observations from the West and drop all others * You'll need to know the numeric code for West first * How can you see that? * 4. Now use keep or drop to filter out everything but observations from the West * 5. We want the variables which code for the following: * region, sex, race, age, height, weight, systolic blood pressure, diastolic blood pressure, diabetes * The variable names are abbreviated, so how can you quickly determine which variables we want? * (Hint: How can we see the variable labels of all variables?) * 6. Now drop all variables but * region, sex, race, age, height, weight, systolic blood pressure, diastolic blood pressure, diabetes * 7. What are the counts of race by sex in this Western sample? * 8. What percentage of males and females are represented by each race? * 9. Create a blood pressure category variable called "bpcat" * Use the following numeric codes: * 1 = systolic bp < 120 AND diastolic bp < 80 * 2 = (120 <= systolic bp < 130) AND diastolic bp < 80 * 3 = systolic bp >= 130 OR diastolic bp >= 80 * Start by creating a bpcat variable and setting it to missing * then use replace to code for the category codes above * 10. label the variable bpcat with the description "blood pressure category" * 11. Create a set of value labels called bp for blood pressure category with the following * codes: * 1 = "normal" * 2 = "elevated" * 3 = "high" * 12. Apply the bp value label to the bpcat variable * 13. Create a bar plot of frequencies of counts of each blood pressure category * 14. Create a bar plot of frequencies of counts of each bp category for each * race, with bars colored by blood pressure category * 15. Create a scatter plot of height and weight, with weight on the y-axis * 16. Create a scatter plot of height and weight, but split the plot by * blood pressure category * 17. get the mean and standard deviation of age and weight of all observations * 18. now get the mean and standard deviation of weight by blood pressure category * 19. create a variable called "highbp" that equals 1 if blood pressure category is high, 0 otherwise * 20. Visually inspect whether weight has a normal distributions * 21. Run a t-test to see if mean weight is different between those with and without * high blood pressure ****************** SOLUTION TO IN-CLASS EXERCISE ***************************** * 1. load the nhanes2 data, clear memory first webuse nhanes2, clear * 2. Imagine we are only interested in people in the West * The variable "region" codes where people live. * How many obeservations are from region=West? tab region * 3. Now keep only observations from the West and drop all others * You'll need to know the numeric code for West first * How can you see that? tab region, nolabel * 4. Now use keep or drop to filter out everything but observations from the West keep if region == 4 drop if region != 4 * 5. We want the variables which code for the following: * region, sex, race, age, height, weight, systolic blood pressure, diastolic blood pressure, diabetes * The variable names are abbreviated, so how can you quickly determine which variables we want? * (Hint: How can we see the variable labels of all variables?) describe codebook * 6. Now drop all variables but * region, sex, race, age, height, weight, systolic blood pressure, diastolic blood pressure, diabetes keep region sex race age height weight bpsystol bpdiast diabetes * 7. What are the counts of race by sex in this Western sample? tab race sex * 8. What percentage of males and females are represented by each race? tab race sex, col * 9. Create a blood pressure category variable called "bpcat" * Use the following numeric codes: * 1 = systolic bp < 120 AND diastolic bp < 80 * 2 = (120 <= systolic bp < 130) AND diastolic bp < 80 * 3 = systolic bp >= 130 OR diastolic bp >= 80 * Start by creating a bpcat variable and setting it to missing * then use replace to code for the category codes above gen bpcat = . replace bpcat = 1 if bpsystol < 120 & bpdiast < 80 replace bpcat = 2 if bpsystol >= 120 & bpsystol < 130 & bpdiast < 80 replace bpcat = 3 if bpsystol >= 130 | bpdiast >= 80 * 10. label the variable bpcat with the description "blood pressure category" label var bpcat "blood pressure category" * 11. Create a set of value labels called bp for blood pressure category with the following * codes: * 1 = "normal" * 2 = "elevated" * 3 = "high" label define bp 1 "normal" 2 "elevated" 3 "high" * 12. Apply the bp value label to the bpcat variable label values bpcat bp * 13. Create a bar plot of frequencies of counts of each blood pressure category graph bar (count), over(bpcat) * 14. Create a bar plot of frequencies of counts of each bp category for each * race, with bars colored by blood pressure category graph bar (count), over(bpcat) over(race) asyvars * 15. Create a scatter plot of height and weight, with weight on the y-axis scatter weight height * 16. Create a scatter plot of height and weight, but split the plot by * blood pressure category scatter weight height, by(bpcat) * 17. get the mean and standard deviation of age and weight of all observations summ age weight * 18. now get the mean and standard deviation of weight by blood pressure category bysort bpcat: summ age weight * 19. create a variable called "highbp" that equals 1 if blood pressure category is high, 0 otherwise gen highbp = 0 replace highbp = 1 if bpcat == 3 * 20. Visually inspect whether weight has a normal distributions histogram weight * 21. Run a t-test to see if mean weight is different between those with and without * high blood pressure ttest weight, by(highbp)