*****NOTE: THERE ARE MORE COMMANDS HERE THAN DISPLYED IN THE SEMINAR *************** STATA ******************* * load a Stata dataset over the internet webuse auto, clear * change directory (not run) * cd "C:/path/to/directory" * histogram command histogram weight * comments are not executed /* this kind of comment can span multiple lines */ * use /// to continue a command over multiple lines summarize weight /// length ***** IMPORTING DATA * loading Stata data files * read from hard drive; uncomment and change path below before executing * use "C:/path/to/myfile.dta" * load data over internet * notice .dta extension can be omitted use https://stats.idre.ucla.edu/stat/data/hs0 * save data, replace if it exists save hs0, replace * clear data from memory clear * load data but clear memory first use https://stats.idre.ucla.edu/stat/data/hs0, clear * import excel file; uncomment and change path below before executing * import excel using "C:\path\myfile.xlsx", sheet("mysheet") firstrow clear * import csv file; uncomment and change path below before executing * import delimited using "C:\path\myfile.csv", clear ***** HELP FILES *open help file for command summarize help summarize * summary statistics for all variables summarize * summary statistics for just variables read and write (using abbreviated command) summ read write * provide additional statistics for variable read summ read, detail *************** GETTING TO KNOW YOUR DATA ******************* ***** VIEWING DATA * seminar dataset use https://stats.idre.ucla.edu/stat/data/hs0, clear * browse dataset browse * list all observations and all variables list * list read and write for first 5 observations li read write in 1/5 ***** SELECTING OBSERVATIONS * list science for last 3 observations li science in -3/L * list gender, ses, and math if math > 70 * with clean output li gender ses math if math > 70, clean * browse gender, ses, and read * for females (gender=2) who have read > 70 browse gender ses read if gender == 2 & read > 70 *** EXERCISE 1 *** * 1. Use the browse command to examine the ses values for students with write * score greater than 65. * Then, use the help file for the browse command rewrite the command to * examine the ses values without labels. ***** EXPLORING DATA * inspect values of variables read gender and prgtype codebook read gender prgtype * summarize continuous variables summarize read math * summarize read and math for females summarize read math if gender == 2 * detailed summary of read for females summarize read if gender == 2, detail * tabulate frequencies of ses tabulate ses * remove labels tab ses, nolabel * two-way tab of race and ses tab race ses * with row percentages tab race ses, row *** EXERCISE 2 *** * 2. Use the tab command to determine the numeric code for “Asians” in the race variable * Then use summarize to estimate the mean of the variable science for Asians ***** DATA VISUALIZATION * histogram of write histogram write * histogram of write with normal density * and intervals of length 5 hist write, normal width(5) * boxplot of all test scores graph box read write math science socst * scatter plot of write vs read scatter write read * bar graphs * bar graph of count of ses graph bar (count), over(ses) * frequencies of gender by ses * asyvars colors bars by ses graph bar (count), over(ses) over(gender) asyvars * layered graph of scatter plot and lowess curve twoway (scatter write read) (lowess write read) * layered scatter plots of write and read * colored by gender twoway (scatter write read if gender == 1, mcolor(blue)) /// (scatter write read if gender == 2, mcolor(red)) *** EXERCISE 3 *** * 3. Use the scatter command to create a scatter plot of math on the x-axis vs write on the y-axis * Use the help file for scatter to change the shape of the markers to triangles *************** DATA MANAGEMENT ******************* ***** CREATING AND TRANSFORMING VARIABLES * generating variables * generate a sum of 3 variables generate total = math + science + socst * it seems 5 missing values were generated * let's look at variables summarize total math science socst * list variables when science is missing li math science socst if science == . * same as above, using missing() function li math science socst if missing(science) * replace total with just (math+socst) * if science is missing replace total = math + socst if science == . * no missing totals now summarize total * egen with function rowmean generates variable that * is mean of all non-missing values of those variables egen meantest = rowmean(read math science socst) summarize meantest read math science socst * standardize read egen zread = std(read) summarize zread * renaming variables rename gender female * recode values to 0,1 recode female (1=0)(2=1) tab female * labeling variables (description) label variable math "9th grade math score" label variable schtyp "public/private school" * the variable label will be used in some output histogram math tab schtyp * schtyp before labeling tab schtyp * create and apply labels for schtyp label define pubpri 1 public 2 private label values schtyp pubpri tab schtyp * encoding string prgtype into * numeric variable prog encode prgtype, gen(prog) * we see that prog is numeric with labels (blue) * while the old variable prog is string (red) browse prog prgtype * we see labels by default in prog tab prog * use option nolabel to remove the labels tab prog, nolabel *** EXERCISE 4 *** * 4. Use the generate and replace commands to create a variable called “highmath” * that takes on the value 1 if math is greater than 60, and 0 otherwise. * Then use the label define command to create a set of value labels called “mathlabel”, * which labels the value 1 “high” and the value 0 “low” * Finally, use the label values command to apply the “mathlabel” labels to the * newly generated variable “highmath. * Use the tab command on highmath to check your results. ***** DATASET OPERATIONS * save dataset, overwrite existing file save hs1, replace * drop prgtype from dataset drop prgtype * keep just id read and math keep id read math * keep observation if reading > 30 keep if read > 40 summ read * now drop if write outside range [30,70] drop if math < 30 | math > 70 summ math * sorting * first look at unsorted li in 1/5 * now sort by read and then math sort read math li in 1/5 * sort descending read then ascending math gsort -read +math li in 1/5 *** EXERCISE 5 *** * 5. Reload the hs0 data set fresh using the following command: use https://stats.idre.ucla.edu/stat/data/hs0, clear * Subset the dataset to observations with write score greater than or equal to 60. * Then remove all variables except for id and write. * Save this as a Stata dataset called “highwrite” * Reload the hs0 dataset, subset to observations with write score less than 60, * remove all variables except id and write, and save this dataset as “lowwrite” use https://stats.idre.ucla.edu/stat/data/hs0, clear * Reload the hs0 dataset. Drop the write variable. Save this dataset as “nowrite”. use https://stats.idre.ucla.edu/stat/data/hs0, clear ***** APPENDING AND MERGING * append highwrite and lowwrite datasets * first load highwrite use highwrite, clear * append lowwrite append using lowwrite * summarize write shows 200 observations and write scores above and below 70 summ write * merge in nowrite dataset using id to link merge 1:1 id using nowrite *************** BASIC STATISTICAL ANALYSIS ******************* * load new dataset use https://stats.idre.ucla.edu/stat/data/hs1, clear ** ANALYSIS OF CONTINUOUS OUTCOMES * many commands provide 95% CI mean read * testing if means are different * across groups * independent samples t-test ttest read, by(female) * correlation matrix of 5 variables corr read write math science socst * linear regression of write on continuous * predictor math and categorical predictor ses regress write c.math i.ses * look at what postestimation commands are available after regress help regress postestimation * postestimation examples * predicted dependent variable predict pred * get residuals predict res, residuals * first 5 predicted values and residuals with observed write li pred res write in 1/5 *** EXERCISE 6 *** * 6. Use the regress command to determine if the variables female (categorical) and * science (continuous) are predictive of the dependent variable math. * One of the assumptions of linear regression is that the errors * (estimated by residuals) are normally distributed. Use the predict command * and the histogram command to assess this assumption. ** ANALYSIS OF CATEGORICAL OUTCOMES * chi square test of independence tab prog ses, chi2 * uncomment and run these lines if highmath is not in your dataset * gen highmath = 0 * replace highmath = 1 if math > 60 * logistic regression of binary outcome highmath predicted by * by continuous(write) and female (categorical) logit highmath c.write i.female, or *** EXERCISE 7 *** * 7. Use the tab command to run a chi-square test of independence to test for association between ses * and race. * Fisher's exact test is often used in place of the chi-square test of * independence when the (expected) cell sizes are small. Use the help file * for "tabulate twoway" (which is just the tabulate command for 2 variables) * to run a Fisher's exact test to test the association * between ses and race. How does the p-value compare to the result of the * chi-square test? ****************** SOLUTION TO IN-CLASS EXERCISE ***************************** * 1. Use the browse command to examine the ses values and write scores * for students with write score greater than 65. * Then, use the help file for the browse command rewrite the command to * examine the ses values without labels. browse ses write if write > 65 help browse browse ses write if write > 65, nolabel * 2. Use the tab command to determine the numeric code for “Asians” in the race variable * Then use summarize to estimate the mean of the variable science for Asians tab race tab race, nolabel summ science if race == 2 * 3. Use the scatter command to create a scatter plot of math on the x-axis vs write on the y-axis * Use the help file for scatter to change the shape of the markers to triangles scatter write math scatter write math, msymbol(triangle) *** EXERCISE 4 *** * 4. Use the generate and replace commands to create a variable called “highmath” * that takes on the value 1 if math is greater than 60, and 0 otherwise. * Then use the label define commands create a set of value labels, which you can * call “mathlabel”, which labels the value 1 as “high” and the value 0 “low”. * Finally, use the label values command to apply the “mathlabel” labels to the * newly generated variable “highmath. * Use the tab command on highmath to check your results. gen highmath = 0 replace highmath = 1 if math > 60 replace highmath = . if math == . label define mathlabel 0 "low" 1 "high" label values highmath mathlabel tab highmath *** EXERCISE 5 *** * 5. Reload the hs0 data set fresh using the following command: use https://stats.idre.ucla.edu/stat/data/hs0, clear * Subset the dataset to observations with write score greater than or equal to 60. * Then remove all variables except for id and write. * Save this as a Stata dataset called “highwrite” keep if write >= 60 keep id write save highwrite, replace * Reload the hs0 dataset, subset to observations with write score less than 60, * remove all variables except id and write, and save this dataset as “lowwrite” use https://stats.idre.ucla.edu/stat/data/hs0, clear keep if write < 60 keep id write save lowwrite, replace * Reload the hs0 dataset. Drop the write variable. Save this dataset as “nowrite”. use https://stats.idre.ucla.edu/stat/data/hs0, clear drop write save nowrite, replace * 6. Use the regress command to determine if the variables female (categorical) and * science (continuous) are predictive of the dependent variable math. * One of the assumptions of linear regression is that the errors * (estimated by residuals) are normally distributed. Use the predict command * and the histogram command to assess this assumption. regress math i.female c.science predict mathres, residuals histogram mathres, normal *** EXERCISE 7 *** * 7. Use the tab command to run a chi-square test of independent to test for association between ses * and race. * Fisher's exact test is often used in place of the chi-square test of * independence when the (expected) cell sizes are small. Use the help file * for tabulate twoway to run a fisher's exact test to test the association * between ses and race. How does the p-value compare to the result of the * chi-square test? tab ses race, chi2 help tabulate twoway tab ses race, exact