*****NOTE: THERE ARE MORE COMMANDS HERE THAN DISPLYED IN THE SEMINAR

*************** STATA *******************

* load a Stata dataset over the internet
webuse auto, clear

* change directory (not run)
* cd "C:/path/to/directory"

* histogram command
histogram weight

* comments are not executed

/* this kind of comment 
   can span
   multiple lines */
   
* use /// to continue a command over multiple lines
summarize weight ///
  length

  
***** IMPORTING DATA

* loading Stata data files
* read from hard drive; uncomment and change path below before executing
* use "C:/path/to/myfile.dta"

* load data over internet
* notice .dta extension can be omitted
use https://stats.idre.ucla.edu/stat/data/hs0

* save data, replace if it exists
save hs0, replace

* clear data from memory
clear

* load data but clear memory first
use https://stats.idre.ucla.edu/stat/data/hs0, clear

* import excel file; uncomment and change path below before executing
* import excel using "C:\path\myfile.xlsx", sheet("mysheet") firstrow clear


* import csv file; uncomment and change path below before executing
* import delimited using "C:\path\myfile.csv", clear

***** HELP FILES

*open help file for command summarize
help summarize

* summary statistics for all variables
summarize
* summary statistics for just variables read and write (using abbreviated command)
summ read write
* provide additional statistics for variable read
summ read, detail



*************** GETTING TO KNOW YOUR DATA *******************

***** VIEWING DATA

* seminar dataset
use https://stats.idre.ucla.edu/stat/data/hs0, clear

* browse dataset
browse

* list all observations and all variables
list

* list read and write for first 5 observations
li read write in 1/5

***** SELECTING OBSERVATIONS

* list science for last 3 observations
li science in -3/L

* list gender, ses, and math if math > 70 
* with clean output
li gender ses math if math > 70, clean

* browse gender, ses, and read 
*  for females (gender=2) who have read > 70
browse gender ses read if gender == 2 & read > 70


*** EXERCISE 1 ***
* 1.  Use the browse command to examine the ses values for students with write 
*     score greater than 65.
*     Then, use the help file for the browse command rewrite the command to 
*     examine the ses values without labels.


***** EXPLORING DATA

* inspect values of variables read gender and prgtype 
codebook read gender prgtype

* summarize continuous variables
summarize read math

* summarize read and math for females
summarize read math if gender == 2

* detailed summary of read for females
summarize read if gender == 2, detail

* tabulate frequencies of ses
tabulate ses

* remove labels
tab ses, nolabel

* two-way tab of race and ses
tab race ses 

* with row percentages
tab race ses, row

*** EXERCISE 2 ***
* 2. Use the tab command to determine the numeric code for “Asians” in the race variable
*    Then use summarize to estimate the mean of the variable science for Asians


***** DATA VISUALIZATION

* histogram of write
histogram write

* histogram of write with normal density 
*  and intervals of length 5
hist write, normal width(5)

* boxplot of all test scores
graph box read write math science socst

* scatter plot of write vs read
scatter write read

* bar graphs
* bar graph of count of ses
graph bar (count), over(ses) 

* frequencies of gender by ses
*   asyvars colors bars by ses
graph bar (count), over(ses) over(gender) asyvars 

* layered graph of scatter plot and lowess curve
twoway (scatter write read) (lowess write read)

* layered scatter plots of write and read
*   colored by gender
twoway (scatter write read if gender == 1, mcolor(blue)) ///
(scatter write read if gender == 2, mcolor(red))


*** EXERCISE 3 ***
* 3. Use the scatter command to create a scatter plot of math on the x-axis vs write on the y-axis
*    Use the help file for scatter to change the shape of the markers to triangles


*************** DATA MANAGEMENT *******************

***** CREATING AND TRANSFORMING VARIABLES

* generating variables
* generate a sum of 3 variables
generate total = math + science + socst

* it seems 5 missing values were generated
*  let's look at variables
summarize total math science socst

* list variables when science is missing
li math science socst if science == .

* same as above, using missing() function
li math science socst if missing(science)

* replace total with just (math+socst)
*  if science is missing
replace total = math + socst if science == .

* no missing totals now
summarize total

* egen with function rowmean generates variable that
* is mean of all non-missing values of those variables
egen meantest = rowmean(read math science socst)
summarize meantest read math science socst

* standardize read
egen zread = std(read)
summarize zread

* renaming variables
rename gender female
* recode values to 0,1
recode female (1=0)(2=1)
tab female

* labeling variables (description)
label variable math "9th grade math score"
label variable schtyp "public/private school"

* the variable label will be used in some output
histogram math
tab schtyp

* schtyp before labeling
tab schtyp

* create and apply labels for schtyp
label define pubpri 1 public 2 private
label values schtyp pubpri
tab schtyp

* encoding string prgtype into
*  numeric variable prog
encode prgtype, gen(prog)

* we see that prog is numeric with labels (blue)
*  while the old variable prog is string (red)
browse prog prgtype

* we see labels by default in prog
tab prog

* use option nolabel to remove the labels
tab prog, nolabel


*** EXERCISE 4 ***
* 4. Use the generate and replace commands to create a variable called “highmath” 
*	  that takes on the value 1 if math is greater than 60, and 0 otherwise.
*	 Then use the label define command to create a set of value labels called “mathlabel”, 
*     which labels the value 1 “high” and the value 0 “low”
*    Finally, use the label values command to apply the “mathlabel” labels to the 
*     newly generated variable “highmath.
*    Use the tab command on highmath to check your results.


***** DATASET OPERATIONS

* save dataset, overwrite existing file
save hs1, replace

* drop prgtype from dataset
drop prgtype

* keep just id read and math
keep id read math

* keep observation if reading > 30
keep if read > 40
summ read

* now drop if write outside range [30,70]
drop if math < 30 | math > 70
summ math

* sorting
* first look at unsorted
li in 1/5

* now sort by read and then math
sort read math
li in 1/5

* sort descending read then ascending math
gsort -read +math
li in 1/5


*** EXERCISE 5 ***
* 5. Reload the hs0 data set fresh using the following command:
use https://stats.idre.ucla.edu/stat/data/hs0, clear

*    Subset the dataset to observations with write score greater than or equal to 60.  
*     Then remove all variables except for id and write.  
*     Save this as a Stata dataset called “highwrite”

*    Reload the hs0 dataset, subset to observations with write score less than 60, 
*      remove all variables except id and write, and save this dataset as “lowwrite”
use https://stats.idre.ucla.edu/stat/data/hs0, clear

*    Reload the hs0 dataset.  Drop the write variable.  Save this dataset as “nowrite”.
use https://stats.idre.ucla.edu/stat/data/hs0, clear


***** APPENDING AND MERGING
* append highwrite and lowwrite datasets
* first load highwrite 
use highwrite, clear

* append lowwrite
append using lowwrite
* summarize write shows 200 observations and write scores above and below 70
summ write

* merge in nowrite dataset using id to link
merge 1:1 id using nowrite


*************** BASIC STATISTICAL ANALYSIS *******************
* load new dataset
use https://stats.idre.ucla.edu/stat/data/hs1, clear

** ANALYSIS OF CONTINUOUS OUTCOMES
* many commands provide 95% CI
mean read

* testing if means are different 
*  across groups
* independent samples t-test
ttest read, by(female)

* correlation matrix of 5 variables
corr read write math science socst

* linear regression of write on continuous
*  predictor math and categorical predictor ses
regress write c.math i.ses

* look at what postestimation commands are available after regress
help regress postestimation

* postestimation examples
* predicted dependent variable
predict pred

* get residuals
predict res, residuals

* first 5 predicted values and residuals with observed write
li pred res write in 1/5


*** EXERCISE 6 ***
* 6. Use the regress command to determine if the variables female (categorical) and
*      science (continuous) are predictive of the dependent variable math.
*    One of the assumptions of linear regression is that the errors 
*     (estimated by residuals) are normally distributed.  Use the predict command
*     and the histogram command to assess this assumption.



** ANALYSIS OF CATEGORICAL OUTCOMES

* chi square test of independence
tab prog ses, chi2

* uncomment and run these lines if highmath is not in your dataset
* gen highmath = 0
* replace highmath = 1 if math > 60

*  logistic regression of binary outcome highmath predicted by 
*    by continuous(write) and female (categorical)
logit highmath c.write i.female, or


*** EXERCISE 7 ***
* 7. Use the tab command to run a chi-square test of independence to test for association between ses 
*     and race.
*    Fisher's exact test is often used in place of the chi-square test of
*     independence when the (expected) cell sizes are small.  Use the help file
*     for "tabulate twoway" (which is just the tabulate command for 2 variables) 
*     to run a Fisher's exact test to test the association
*     between ses and race. How does the p-value compare to the result of the
*     chi-square test?



****************** SOLUTION TO IN-CLASS EXERCISE *****************************
* 1.  Use the browse command to examine the ses values and write scores  
*     for students with write score greater than 65.
*     Then, use the help file for the browse command rewrite the command to 
*     examine the ses values without labels.
browse ses write if write > 65
help browse
browse ses write if write > 65, nolabel


* 2. Use the tab command to determine the numeric code for “Asians” in the race variable
*    Then use summarize to estimate the mean of the variable science for Asians
tab race
tab race, nolabel
summ science if race == 2


* 3. Use the scatter command to create a scatter plot of math on the x-axis vs write on the y-axis
*    Use the help file for scatter to change the shape of the markers to triangles
scatter write math
scatter write math, msymbol(triangle)


*** EXERCISE 4 ***
* 4. Use the generate and replace commands to create a variable called “highmath” 
*	  that takes on the value 1 if math is greater than 60, and 0 otherwise.
*	 Then use the label define commands create a set of value labels, which you can
*     call “mathlabel”, which labels the value 1 as “high” and the value 0 “low”.
*    Finally, use the label values command to apply the “mathlabel” labels to the 
*     newly generated variable “highmath.
*    Use the tab command on highmath to check your results.
gen highmath = 0
replace highmath = 1 if math > 60
replace highmath = . if math == .

label define mathlabel 0 "low" 1 "high"
label values highmath mathlabel
tab highmath


*** EXERCISE 5 ***
* 5. Reload the hs0 data set fresh using the following command:
use https://stats.idre.ucla.edu/stat/data/hs0, clear

*    Subset the dataset to observations with write score greater than or equal to 60.  
*     Then remove all variables except for id and write.  
*     Save this as a Stata dataset called “highwrite”
keep if write >= 60
keep id write
save highwrite, replace

*    Reload the hs0 dataset, subset to observations with write score less than 60, 
*      remove all variables except id and write, and save this dataset as “lowwrite”
use https://stats.idre.ucla.edu/stat/data/hs0, clear
keep if write < 60
keep id write
save lowwrite, replace

*    Reload the hs0 dataset.  Drop the write variable.  Save this dataset as “nowrite”.
use https://stats.idre.ucla.edu/stat/data/hs0, clear
drop write
save nowrite, replace


* 6. Use the regress command to determine if the variables female (categorical) and
*      science (continuous) are predictive of the dependent variable math.
*    One of the assumptions of linear regression is that the errors 
*     (estimated by residuals) are normally distributed.  Use the predict command
*     and the histogram command to assess this assumption.
regress math i.female c.science
predict mathres, residuals
histogram mathres, normal


*** EXERCISE 7 ***
* 7.  Use the tab command to run a chi-square test of independent to test for association between ses 
*     and race.
*    Fisher's exact test is often used in place of the chi-square test of
*     independence when the (expected) cell sizes are small.  Use the help file
*     for tabulate twoway to run a fisher's exact test to test the association
*     between ses and race. How does the p-value compare to the result of the
*     chi-square test?
tab ses race, chi2
help tabulate twoway
tab ses race, exact