***CODE FILE FOR UCLA IDRE STATA DATA MANAGEMENT SEMINAR*** ***PRELIMINARY ADVICE*** *help files help describe *comments start with * (or can be enclosed in /* and */ ) /* a multi-line comment */ *a comment won't be run by Stata *break up command across multiple lines describe /// age using /// http://stats.idre.ucla.edu/stat/data/patient_pt2_stata_dm.dta *describe can be abbreviated to d d ***INPUTTING DATA INTO STATA*** *any data in memory must be cleared before loading new data clear *Also try the File menu to import files! *import excel import excel "http://stats.idre.ucla.edu/stat/data/hsb2.xls", firstrow clear *import csv import delimited "http://stats.idre.ucla.edu/stat/data/hsb2.csv", clear *Getting data in *From keyboard with input input age weight 8 11 9 12 8 10 9 11 10 15 end *get in strings by putting strxxx before variable name input str10 words "These" "are" "words." end *Stata data files have .dta extension, which can be omitted for use use "http://stats.idre.ucla.edu/stat/data/hsb2", clear ***LOAD IN FIRST DATASET*** use http://stats.idre.ucla.edu/stat/data/patient_pt1_stata_dm, clear ***VIEWING THE DATASET*** *print variables to screen list hospital-pain *open the browser browse ***CHARACTERIZING VARIABLES*** *describe how variables are stored describe *describe the values of a variable codebook ***SELECTING OBSERVATIONS*** *in operator to select range of obs list age in 1/10 *can use negative for distance from end, and L or l for end list age in -10/L *if to select by condition li age if sex == "female" & pain > 8, clean ***APPENDING FILES*** *files to be appended follow using append using http://stats.idre.ucla.edu/stat/data/patient_pt2_stata_dm *check appended datasets describe *variables not in both datasets will have missing where absent tab nmorphine, miss ***MERGING FILES*** *look at doctor file (using file, or file to be merged) describe using http://stats.idre.ucla.edu/stat/data/doctor_stata_dm *check if docid is repeated -- it is tab docid *merge the files merge m:1 docid using http://stats.idre.ucla.edu/stat/data/doctor_stata_dm *drop unmatched doctor drop if _merge == 2 ***DUPLICATES START*** *duplicates suite help duplicates *count how many duplicated observations there are duplicates report *the keyword _all duplicates report _all *list observations duplicated on hospid duplicates report hospid *maybe we can check on them later duplicates tag, gen(dup) tab dup *drop all but the first copy of each observation duplicates drop ***MISSING VALUES*** **all nonmissing numbers < . < .a < .b < ... < .z *codebook reports missing codebook *summarize to detect missing data codes summarize *boxplots to detect missing data codes graph box co2 lungcapacity test1 test2 *checking categorical variables tab smokinghx, miss tab familyhx, miss *change -99 codes to missing mvdecode _all, mv(-99) *mvdecode does not work for string variables -- this doesn't work mvdecode _all, mv("-99") *use replace to change user missing to Stata missing for strings replace familyhx = "" if familyhx == "-99" replace smokinghx = "" if smokinghx == "-99" *missing for strings is empty string tab familyhx, miss tab smokinghx, miss *now use a special missing value .a to mean refused to answer mvdecode lungcapacity co2, mv(-98=.a) *seting any out of range ages to missing value .b to mean data error summ age replace age = .b if age > 120 | age < 18 *count missing values misstable summarize lungcapacity test1 test2 *profile missing data patterns misstable patterns ***CREATING and TRANSFORMING VARIABLES*** *generate an average and check missing gen average = (test1 + test2)/2 misstable patterns average test1 test2 *be careful when using logical comparisons with missing values *this is wrong gen above50_wrong = 0 replace above50_wrong = 1 if age > 50 misstable summarize age above50_wrong *correct way to handle if missing gen above50 = age > 50 replace above50 = age if age >= . misstable summarize age above50 *useful functions with generate help functions *functions *running sum gen marsum=sum(married) *random uniform gen random=runiform() list married marsum random in 1/10 **egen** *egen functions help egen *sum across variables egen total = rowtotal(test1 test2) *average across variables egen mean = rowmean(test1 test2) *notice that mean only missing if both test1 and test2 missing misstable patterns test1 test2 average mean total *count number of missing values across variables egen nummiss = rowmiss(lungcapacity test1 test2 familyhx smokinghx) li lungcapacity test1 test2 familyhx smokinghx nummiss in 1/10 *cut continuous variable into categorical variable egen bmi_cat = cut(bmi), at(0, 15, 16, 18.5, 25, 30, 35, 40, 500) label tab bmi_cat, miss tab bmi_cat, nolabel *cross grouping variables egen family_smoking = group(familyhx smokinghx), label tab family_smoking, miss ***RECODING VARIABLES*** *recoding variables *recode -- consolidate some categories together tab bmi_cat, nolabel recode bmi_cat (0 1 2=3) (6 7 = 5) tab bmi_cat, nolabel *labels change too! tab bmi_cat ***RENAME *rename rename test1 il6 rename test2 crp *rename group help rename group *apply prefix "doc" to experience and school rename (experience school) d_= desc d_* *replace doc prefix with d_ rename d_* doc_* desc doc_* **STRINGS *string variables tab sex, miss replace sex = "" if sex == "12.2" tab sex, miss *trim variables tab hospital replace hospital = strtrim(hospital) tab hospital *extract doctor id tab docid gen doc_id = substr(docid, 3, 3) tab doc_id *remake the old docid from parts (and string function) gen newdocid = string(hospid) + "-" + doc_id li newdocid hospid doc_id in 1/10 *regular expressions *use regexm to match and regexs to extract capture group gen regxdocid = regexs(1) if regexm(docid, "[0-9]-([0-9]+)") list docid regxdocid in 1/10 *encoding string categorical variables into numeric variables with labels encode cancerstage, gen(stage) tab cancerstage stage desc stage tab cancerstage stage, nolab *destring number variables stored as strings tab wbc replace wbc = "" if wbc == "not assessed" destring wbc, replace describe wbc *****LABELS***** *variable labels label var il6 "Concentration of interleukin 6" hist il6 *value labels *see existing labels label list *create labels label define other_miss .a "refused" .b "error" *labeling turns a variable blue label values lungcapacity co2 age other_miss tab lungcapacity, miss ***BY-GROUP PROCESSING*** *usually need to sort first sort docid **stats by group (doctor) by docid: egen mean_age = mean(age) by docid: egen max_age = max(age) by docid: egen sd_age = sd(age) li docid age mean_age max_age sd_age in 1/10 **counting within group *first generate a 0/1 variable for female gen female = sex == "female" replace female = . if sex == "" *then create within-group running sum of females by docid: gen num_female = sum(female) li docid sex female num_female in 1/10 *replace all values of num_female, with final value from its group, * its group's sum *_N is number of last observation (within group when used with by) by docid: replace num_female = num_female[_N] li docid sex female num_female in 1/10 **lagging within group *sort by docid and time first sort docid dis_date *_n is number of current observation (within group when used with by) *lag dis_date by setting lag_date to previous dis_date (within docid) by docid: gen lag_date = dis_date[_n-1] *format number to appear as date format lag_date %td *create a time lag variable gen time_lag = dis_date - lag_date li docid dis_date lag_date time_lag in 1/10 ***MACROS START*** *globals global greeting Hello world! *this becomes display "Hello world!" display "$greeting" *this becomes display Hello world! and doesn't work display $greeting *use macros to group variables together global demographics age married female summ $demographics *local macros are deleted after code is run if declared in do-file local outcomes tumorsize pain lungcapacity summ `outcomes' *use = to set local equal to expression *expression is evaluated and result stored local add_us = 2 + 3 display "sum = `add_us'" *compare to without = local print_us 2 + 3 display "sum = `print_us'" ***LOOPING*** *forvalues loops over numbers forvalues i = 1/5 { display "Hello world!" } *the loop control variable is a local macro *let's access its contents forvalues i = 1/5 { display "i = `i'" } *loop over variables with numbered names forvalues i = 50(10)70 { gen age`i' = age > `i' replace age`i' = age if missing(age) label values age`i' other_miss tab age`i', miss } *foreach loops over generic list of items *looping across list of variables *create standardized versions of variables foreach var of varlist wbc rbc il6 crp { egen std_`var' = std(`var') } summ std_* *we can also loop across elements of locals and globals foreach var of global demographics { tabstat `var', by(stage) } *use the keyword in to loop across a generic list gen fixed_docid = docid foreach id in "1-11" "1-21" "1-57" { replace fixed_docid = docid + "0" if docid == "`id'" list fixed_docid docid if docid == "`id'" }