*****NOTE: THERE ARE MORE COMMANDS HERE THAN DISPLYED IN THE SEMINAR

*************** STATA *******************

* load a Stata dataset over the internet
webuse auto, clear

* change directory (not run)
* cd "C:/path/to/directory"

* histogram command
histogram weight

* comments are not executed

/* this kind of comment 
   can span
   multiple lines */
   
* use /// to continue a command over multiple lines
summarize weight ///
  length

  
***** IMPORTING DATA

* loading Stata data files
* read from hard drive; uncomment and change path below before executing
* use "C:/path/to/myfile.dta"

* load data over internet
* notice .dta extension is omitted
use https://stats.idre.ucla.edu/stat/data/hsbdemo

* save data, replace if it exists
save hsbdemo, replace

* load data but clear memory first
use https://stats.idre.ucla.edu/stat/data/hsbdemo, clear

* import excel file; uncomment and change path below before executing
* import excel using "C:\path\myfile.xlsx", sheet("mysheet") firstrow clear


* import csv file; uncomment and change path below before executing
* import delimited using "C:\path\myfile.csv", clear

***** HELP FILES

*help files
help describe

*describe can be abbreviated to d
d


*************** GETTING TO KNOW YOUR DATA *******************

***** VIEWING DATA

* seminar dataset
use https://stats.idre.ucla.edu/stat/data/hs0, clear

* browse dataset
browse

* list all observations and all variables
list

* list read and write for first 5 observations
li read write in 1/5

***** SELECTING OBSERVATIONS

* list science for last 3 observations
li science in -3/L

* list gender, ses, and math if math > 70 
* with clean output
li gender ses math if math > 70, clean

* browse gender, ses, and read 
*  for females (gender=2) who have read > 70
browse gender ses read if gender == 2 & read > 70

***** EXPLORING DATA
* get variable properties
describe

* inspect values of variables read gender and prgtype 
codebook read gender prgtype

* summarize continuous variables
summarize read math

* detailed summary of read for females
summarize read if gender == 2, detail

* tabulate frequencies of ses
tabulate ses

* remove labels
tab ses, nolabel

* two-way tab of race and ses
tab race ses 

* with row percentages
tab race ses, row


** DATA VISUALIZATION

* histogram of write
histogram write

* histogram of write with normal density 
*  and intervals of length 5
hist write, normal width(5)

* boxplot of all test scores
graph box read write math science socst

* scatter plot of write vs read
scatter write read

* layered graph of scatter plot and lowess curve
twoway (scatter write read) (lowess write read)

* layered scatter plots of write and read
*   colored by gender
twoway (scatter write read if gender == 1, mcolor(blue)) ///
(scatter write read if gender == 2, mcolor(red))


* bar graphs
* bar graph of count of ses
graph bar (count), over(ses) asyvars

* frequencies of gender by ses
*   asyvars colors bars by ses
graph bar (count), over(ses) over(gender) asyvars 




*************** DATA MANAGEMENT *******************

***** CREATING AND TRANSFORMING VARIABLES

* generating variables
* generate a sum of 3 variables
generate total = math + science + socst

* it seems 5 missing values were generated
*  let's look at variables
summarize total math science socst

* replace total with just (math+socst)
*  if science is missing
replace total = math + socst if science == .

* no missing totals now
summarize total

* create a variable that equals 1 if prgtype 
*  equals academic, 0 otherwise
gen academic = 0
replace academic = 1 if prgtype == "academic"
tab prgtype academic

* egen to generate variables with functions
*  rowmean will take the mean of all non-missing values
egen meantest = rowmean(read math science socst)
summarize meantest read math science socst

* standardize read
egen zread = std(read)
summarize zread

* renaming variables
rename gender female
* recode values to 0,1
recode female (1=0)(2=1)
tab female

* labeling variables (description)
label variable math "9th grade math score"
label variable schtyp "public/private school"

* the variable label will be used in some output
histogram math
tab schtyp

* schtyp before labeling
tab schtyp

* create and apply labels for schtyp
label define pubpri 1 public 2 private
label values schtyp pubpri
tab schtyp

* list all value label set
label list

* describe shows which value labels
*  have been applied to which variables
describe

* encoding string prgtype into
*  numeric variable prog
encode prgtype, gen(prog)

* we see that a value label has been applied to prog
describe prog

* we see labels by default in prog
tab prog

* use option nolabel to remove the labels
tab prog, nolabel


***** DATASET OPERATIONS

* variable list shortcuts

* summarize all consecutive variables
*  from read to socst
summ read-socst

* summarize all variables that begin with r
summ r*

* summarize all variables that begin with r 
*   and end with e
summ r*e

* put id and demographic variables first
order id female race ses schtyp prog
* put old prgtype variable last
order prgtype, last
* describe with simple option just lists variable names
describe, simple

* save dataset, overwrite existing file
save hs1, replace

* drop prgtype from dataset
drop prgtype
describe, simple

* keep just id read and math
keep id read math
describe, simple

* keep observation if reading > 30
keep if read > 40
summ read

* now drop if write outside range [30,70]
drop if math < 30 | math > 70
summ math

* sorting
* first look at unsorted
li in 1/5

* now sort by read and then math
sort read math
li in 1/5

* sort descending read then ascending math
gsort -read +math
li in 1/5

**** EXERCISES

* let's use what he have learned to make a few
*  small datasets 

** males with math and reading scores above 70
* first load the hs1 dataset

* restrict to males with math and reading > 70

* keep only id remale math and read

* print to screen

* save dataset


** females with math and social studies scores above 70
* this time keep id female, math, socst


** race ses for all students with math or read score > 70
* keep variables id female race ses read math socst



**** SOLUTIONS TO EXERCISES
* first let's make a dataset of males with math scores above 70
*  for this dataset we only want the variables id and math
* first load the hs1 dataset
use hs1, clear
* restrict to males with math and reading > 70
keep if female == 0 & math > 70 & read > 70
* keep only id remale math and read
keep id female math read
* print to screen
li
* save dataset
save males70, replace

* now for females with math and socst above 70
use hs1, clear
keep if female == 1 & math > 70 & socst > 70
* this time keep id female, math, socst
keep id female math socst
li
save females70, replace

* id female race ses for students with
*  math > 70 and either read > 70 or socst > 70
use hs1, clear
keep if math > 70 & (read > 70 | socst > 70)
* keep id female race ses read math socst
keep id-ses read math socst
li
save raceses70, replace




**** APPENDING AND MERGING
* append males70 and females 70
* first load males70 
use males70, clear

* append females70 and look at results
append using females70
li

* merge in race and ses of all students
*  with math > 70, and read or socst > 70
merge 1:1 id using raceses70

* look at _merge variable
tab _merge

* drop unmatched observations
drop if _merge != 3
li


* reload dataset
use hs1, clear

* *Examples of by-group processing

*  summarizing a variable by gender
bysort female: summarize write

* 2-way frequencies by ses
bysort ses: tab prog female

* mean of math by program
bysort prog: egen meanmath = mean(math)
tab prog meanmath

* rank based on math by program
bysort prog: egen mathrank = rank(math)
* get lowest math score in each program
li prog math if mathrank == 1

* histograms of write by gender
* notice that by is now an option
hist write, by(female)


*************** BASIC STATISTICAL ANALYSIS *******************

** ANALYSIS OF CONTINUOUS OUTCOMES
* many commands provide 95% CI
mean read

* 99% CI for read
ci means read, level(99)

* testing if means are different 
*  across groups
* independent samples t-test
ttest read, by(female)

* paired samples t-test
ttest read == write

* 2-way ANOVA of write by female and prog
anova write female prog

* correlation of write and math
correlate write math

* correlation matrix of 5 variables
corr read write math science socst

* linear regression of write on continuous
*  predictor math and categorical predictor prog
regress write c.math i.prog

* postestimation examples
* add variable of predicted values of write 
*  for each observation
predict predwrite
* look at first 5 predicted values
li predwrite write math female in 1/5


* test of whether 2 prog coefficients are 
*  jointly significant
test 2.prog 3.prog

* test whether 2 prog coefs are different 
test 2.prog-3.prog = 0



** ANALYSIS OF CATEGORICAL OUTCOMES

* chi square test of independence
tab prog ses, chi2

* logistic regression of being in academic program
*  on female and math score
*  coefficients as odds ratios
logit academic i.female c.math, or




****************** IN CLASS EXERCISE *****************************
**************** SCROLL DOWN FOR SOLUTION ************************

* 1. load the nhanes2 data, clear memory first


* 2. Imagine we are only interested in people in the West
*    The variable "region" codes where people live.
*    How many obeservations are from region=West?


* 3. Now keep only observations from the West and drop all others
*    You'll need to know the numeric code for West first
*    How can you see that?


* 4. Now use keep or drop to filter out everything but observations from the West


* 5. We want the variables which code for the following:
*    region, sex, race, age, height, weight, systolic blood pressure, diastolic blood pressure, diabetes
*    The variable names are abbreviated, so how can you quickly determine which variables we want?
*   (Hint: How can we see the variable labels of all variables?)


* 6. Now drop all variables but 
*    region, sex, race, age, height, weight, systolic blood pressure, diastolic blood pressure, diabetes



* 7. What are the counts of race by sex in this Western sample?


* 8. What percentage of males and females are represented by each race?



* 9. Create a blood pressure category variable called "bpcat"
*    Use the following numeric codes:
*    1 = systolic bp < 120 AND diastolic bp < 80
*    2 = (120 <= systolic bp < 130) AND diastolic bp < 80
*    3 = systolic bp >= 130 OR diastolic bp >= 80
*    Start by creating a bpcat variable and setting it to missing
*      then use replace to code for the category codes above


* 10. label the variable bpcat with the description "blood pressure category"


* 11. Create a set of value labels called bp for blood pressure category with the following
*     codes:
*     1 = "normal"
*     2 = "elevated"
*     3 = "high"


* 12. Apply the bp value label to the bpcat variable


* 13. Create a bar plot of frequencies of counts of each blood pressure category


* 14. Create a bar plot of frequencies of counts of each bp category for each
*     race, with bars colored by blood pressure category


* 15. Create a scatter plot of height and weight, with weight on the y-axis


* 16. Create a scatter plot of height and weight, but split the plot by 
*     blood pressure category


* 17. get the mean and standard deviation of age and weight of all observations


* 18. now get the mean and standard deviation of weight by blood pressure category



* 19. create a variable called "highbp" that equals 1 if blood pressure category is high, 0 otherwise


* 20. Visually inspect whether weight has a normal distributions


* 21. Run a t-test to see if mean weight is different between those with and without
*     high blood pressure








****************** SOLUTION TO IN-CLASS EXERCISE *****************************

* 1. load the nhanes2 data, clear memory first
webuse nhanes2, clear

* 2. Imagine we are only interested in people in the West
*    The variable "region" codes where people live.
*    How many obeservations are from region=West?
tab region

* 3. Now keep only observations from the West and drop all others
*    You'll need to know the numeric code for West first
*    How can you see that?
tab region, nolabel

* 4. Now use keep or drop to filter out everything but observations from the West
keep if region == 4
drop if region != 4

* 5. We want the variables which code for the following:
*    region, sex, race, age, height, weight, systolic blood pressure, diastolic blood pressure, diabetes
*    The variable names are abbreviated, so how can you quickly determine which variables we want?
*   (Hint: How can we see the variable labels of all variables?)
describe
codebook

* 6. Now drop all variables but 
*    region, sex, race, age, height, weight, systolic blood pressure, diastolic blood pressure, diabetes
keep region sex race age height weight bpsystol bpdiast diabetes  



* 7. What are the counts of race by sex in this Western sample?
tab race sex

* 8. What percentage of males and females are represented by each race?
tab race sex, col


* 9. Create a blood pressure category variable called "bpcat"
*    Use the following numeric codes:
*    1 = systolic bp < 120 AND diastolic bp < 80
*    2 = (120 <= systolic bp < 130) AND diastolic bp < 80
*    3 = systolic bp >= 130 OR diastolic bp >= 80
*    Start by creating a bpcat variable and setting it to missing
*      then use replace to code for the category codes above
gen bpcat = .
replace bpcat = 1 if bpsystol < 120 & bpdiast < 80
replace bpcat = 2 if bpsystol >= 120 & bpsystol < 130 & bpdiast < 80
replace bpcat = 3 if bpsystol >= 130 | bpdiast >= 80

* 10. label the variable bpcat with the description "blood pressure category"
label var bpcat "blood pressure category"

* 11. Create a set of value labels called bp for blood pressure category with the following
*     codes:
*     1 = "normal"
*     2 = "elevated"
*     3 = "high"
label define bp 1 "normal" 2 "elevated" 3 "high"

* 12. Apply the bp value label to the bpcat variable
label values bpcat bp

* 13. Create a bar plot of frequencies of counts of each blood pressure category
graph bar (count), over(bpcat)

* 14. Create a bar plot of frequencies of counts of each bp category for each
*     race, with bars colored by blood pressure category
graph bar (count), over(bpcat) over(race) asyvars

* 15. Create a scatter plot of height and weight, with weight on the y-axis
scatter weight height

* 16. Create a scatter plot of height and weight, but split the plot by 
*     blood pressure category
scatter weight height, by(bpcat)

* 17. get the mean and standard deviation of age and weight of all observations
summ age weight

* 18. now get the mean and standard deviation of weight by blood pressure category
bysort bpcat: summ age weight


* 19. create a variable called "highbp" that equals 1 if blood pressure category is high, 0 otherwise
gen highbp = 0
replace highbp = 1 if bpcat == 3

* 20. Visually inspect whether weight has a normal distributions
histogram weight

* 21. Run a t-test to see if mean weight is different between those with and without
*     high blood pressure
ttest weight, by(highbp)