Outline of this seminar:
- common techniques for dealing with missing data
- multiple imputation
- missing mechanism and missing data patterns
- An example using proc mi assuming multivariate normal distribution
- setting up an imputation model
- understanding the output
- checking convergence
- An example using proc mi using the fully conditional specification
- setting up an imputation model
- understanding the output
- checking convergence
- Introduction to direct/full information maximum likelihood (FIML)
- Simple example of parameter estimation with FIML
- An example using FIML
- setting up structural equation model (SEM)
- Implementation of FIML
- Inproving the accuracy of FIML
An alternative method to MI is direct maximum likelihood also know as full information maximum likelihood (FIML). It is an estimation method that uses information from both complete and incomplete observations to obtain unbiased and fully efficient parameter estimates. The goal of FIML is to estimate popualtion parameters that have the highest likelihood of producing a specific sample. The difference between MI and FIML is that the estimation does not need to impute or "fill-in" missing values. It uses all of the available data to obtain paramter estimates with the highest probability of procuding the observed sample. In short, the FIML improves estimates by "borrowing" information from the observed data even if it is incomplete. We can use maximum likelihood estimation to get the variance-covariance matrix for the variables in the model based on all available data points and then use the obtained variance covariance matrix to estimate our regression model. Maximum likelihood method requires large sample size and that the data are missing at random (MAR). For more information using this approach, see Applied Missing Data Analysis by Craig K. Enders, 2011 or….
Full Information Maximum Likelihood/Direct Maximum Likelihood
Full information maximum likelihood or FIML is an estimation method that uses information from both the incomplete and the complete observations. Thus maximizing our statistical power (Enders, 2010). Like MI, FIML produces unbiased parameter estimates under the assumption of MAR. But is still superior to some of the more traditional methods, even under the assumption of MCAR b/c it utlizes all the available information.
How does it work with missing data?
The goal of ML is to identify the population parameter values that have the highest relative probability or likelihood of producing a particular sample of data. This can be applied to missing data as well with some changes to the normal computation of the log-likelihood. In general terms, the two parameters that make the log-likelihood are the mean vector (μ) and the covariance matrix (∑). When there is no missing data, the estimated log-likelihood for each observation will based on a single set of elements for μ and ∑. The difference with missing data is the fact that the log-likelihood for observations with missing information is estimated using the available complete data values, such that the log-likelihood formula will be different for each missing data pattern. Thus, observations with different missing data patterns will use different (e.g. in size and content) subsets of elements for μ and ∑ based on their observed data. The estimated log-likelihoods are then summed to create the overall sample log-likelihood. Thus, the sample likelihood is still a summary measure representing the joint probability of the observed data given a set of parameter values (μ , ∑). This process of estimating the log-likelihood will be repeated iteratively using the EM algorithm, with different sets of potential population parameter estimated until the highest likelihood is reached. Thus no missing values need to be filled-in or imputed (Enders, 2010; Handling Missing data by Allison, 2012 SAS Global Forum) for this method.
How can we implement FIML when conducting analyses?
This method can be employed using Proc Calis, which is the procedure in SAS that allows for linear structural equation modeling (SEM). In general, FIML is implemented in a SEM framework because of the flexibility of SEM to handle a variety of different analytic models. In the floowing example we will demonstrate the use of Proc Calis.
In a 2012 SAS Global Forum paper, the author presents several instances where the use of FIML would be advantageous. First, when you only have missing on the DV but not IV's. Second, when the data includes repeated meaures on the DV, and there is missing infomation on some of the measures. This is commonly known as drop-out. Third, when you have missing on the IV's. In this section we will dicuss methods availble to you in SAS to address these three common situations
Example 1: Missing on outcome but not predictors
Let's assume you are interested in running some type of regression model and you discover that your outcome has missing information that you believe is MAR. Usually, multiple imputation would be an option but you find that you also don't have any good auxiliary variables to include into your imputation model. This trying imputation may add additional variability or noise into your parameter estimates. In this particular case, listwise deletion will produce direct maximum likelihood estimates without any additional modifications on your part (Little, 1992; Handling Missing data by Allison, 2012 SAS Global Forum). The appeal of this, is that is it easy and you will be able to obtain unbiased estimates, assuming MAR. The downside is that depending on the proportion of missing, your sample size may be reduced considerably.
Let's demonstrate this by running a linear regression model where write, which has 17 missing observations, is predicted by race, schtyp, and socst which all have complete information.
First, we will run our model using Proc Calis which by default estimates models using listwise deletion. We will also need to create dummy varaible for our 4 level race variable becasue Proc Calis does not have a class statement.
race_hispanic = (race=1);
race_afam = (race=3);
race_white = (race=4);
IF schtyp = 2 THEN schtyp=0;
RUN;TITLE "Incomplete Data - listwise deletion"; PROC CALIS DATA = fiml_1; PATH write <- race_hispanic race_asian race_afam schtyp socst; RUN;
Second, we will run our model using Proc Calis and adding the option method=fiml which will no use Fill Information Maximum Likelihood to obtain unbiased parameter estimates. Instead of deleting observations with missing values, the full information maximum likelihood method uses all available information in all observations.
TITLE "Incomplete Data - FIML";
PROC CALIS DATA = fiml_1 METHOD=fiml; PATH write <- race_hispanic race_asian race_afam schtyp socst;
Let's compare the model information output produced by SAS for each model. You will notice a few differences in the "Modeling Information" output between the two models. The model using FIML will now include information the number or missing records/observation present in the data. Here we see that SAS correctly reports the 17 observations that are missing on our outcome write. The analysis type is 'Means and Covariances' instead of "Covariances' because with FIML, the sample means need to be analyzed for proper estimation (SAS documentation). This difference will be more meaningful when we explore examples with missing predictor variables.
Now let's take a look at the parameter estimates for each model. You will see that the estimates for each are identical for the reasons we discussed earlier. The SE's are slightly different because the computational approach used to estimate them differs.
Example 2: Missing on outcome only with auxiliary.
Example 3: Missing on predictors only.
A common problem with data is missingness of predictors (independent variables).
TITLE "Incomplete Data on Predictors- Listwise Deletion";
PROC CALIS DATA = fiml_1 PSHORT NOSTAND ;
PATH socst <- race_hispanic race_asian race_afam write female;
RUN;TITLE "Incomplete Data on Predictors - FIML"; PROC CALIS DATA = fiml_1 METHOD=fiml PSHORT NOSTAND ; PATH socst<- race_hispanic race_asian race_afam write female; RUN;TITLE "Complete Data on Predictors"; PROC CALIS DATA = fiml_comp PSHORT NOSTAND ; PATH socst <- race_hispanic race_asian race_afam write female ; RUN;
Complete Listwise FIML Multiple Imputation Parameter Estimate SE t-value Estimate SE t-value Estimate SE t-value Estimate SE t-value Race- Hispanic -0.5163 1.9175 -0.26924 -0.9705 2.106 -0.46 -0.27454 1.984 -0.14 -0.32094 2.02051 -0.16 Race-Asian -5.1169 2.6433 -1.93582 -3.87356 3 -1.29 -3.19677 2.711 -1.18 -3.08538 2.82884 -1.09 Race- Afro-Amer 0.3039 2.0548 0.1479 1.12876 2.331 0.484 1.24456 2.143 0.581 1.12794 2.17088 0.52 Write 0.7291 0.0688 10.5973 0.68605 0.076 9.034 0.72942 0.074 9.833 0.73019 0.07489 9.75 Female -2.2519 1.2443 -1.80973 -2.0847 1.347 -1.55 -1.86111 1.355 -1.37 -1.71803 1.3535 -1.27
Example 4: Missing on predictors and outcome. With auxiliary variable
If you do have any good auxiliary variables the multiple imputation is still an option. Alternatively you could use FIML to obtain your parameter estiamtes.
data fiml_3; set ats.hsb_mar; if prog ^=. then do; if prog =1 then progcat2=1; else progcat2=0; if prog =2 then progcat1=1; else progcat1=0; end; format progcat2 prog_gen. progcat1 prog_ac.; label progcat2 ="general" progcat1 ="academic"; run;TITLE " Incomplete Data on Predictors and Outcome- Listwise Deletion"; PROC CALIS DATA = fiml_3 PSHORT NOSTAND ; PATH read <- science female math progcat1 progcat2; RUN; TITLE " Incomplete Data on Predictors and Outcome - FIML"; PROC CALIS DATA = fiml_3 METHOD=fiml PSHORT NOSTAND ; PATH read <- science female math progcat1 progcat2; RUN; TITLE " Incomplete Data on Predictors and Outcome - With auxilary- FIML"; PROC CALIS DATA = fiml_3 METHOD=fiml PSHORT NOSTAND ; PATH read <- science female math progcat1 progcat2, socst <- read science female math progcat1 progcat2; RUN;
Complete Listwise FIML w/Auxiliary Multiple Imputation
Parameter Estimate SE t-value Estimate SE t-value Estimate SE t-value Estimate SE t-value Science 0.388 0.066 5.877 0.430 0.081 5.324 0.405 0.070 5.788 0.401 0.072 5.570 Female 0.052 1.004 0.051 -0.285 1.238 -0.231 0.368 1.079 0.341 0.596 1.236 0.480 Math 0.382 0.076 5.033 0.321 0.094 3.425 0.370 0.080 4.639 0.362 0.085 4.280 Academic 3.468 1.354 2.562 3.753 1.706 2.200 3.793 1.423 2.666 3.776 1.527 2.470 General 0.151 1.464 0.103 1.252 1.826 0.686 0.296 1.568 0.189 0.256 1.595 0.160
Maximum Likelihood versus Multiple Imputation:
Downside of ML: difficult to incorporate axuxiliary vars: Omitting a cause of missingness tends to be problematic if the correlation b/w the omitted variable and the analysis variables is relatively strong (r>.4) or if the missing data rate si greater than 25%
In SAS, not bale to use Calis with non-linear outcomes
"Maximum likelihood uses a log-likelihood function to indetify the population parameter values that are most likely to have produced the observed data. The estimation process essentially auditions different parameter values until it identifies the estimates that minimize the standardized distance to the observed data. ML estiamtes the paramters firctly from the observed data and therfore does not require values to be imputed" (Enders, 2010).
How does FIML work?
Let's step through a simple example of how ML estimation works in the context of missing data. Let's say we have 2 binary variables X and Z represented by the 2x2 table:
The likelihood of a set of parameter values (i.e. regression coeffcients) that we will call θ is a function of the observed outcomes represented by X and Z. The likelihood function L(θ) is equal too the probability of those observed outcomes given those parameter values: L(θ) = P(X,Z|θ). This of this joint probability of X and Z as cells in out 2x2 table such that L(θ) = P(X=1,Z=1)·P(X=1,Z=2)·P(X=2,Z=1)·P(X=2,Z=2) or:
Assume that we have missing observations on Z such that 10% of Z=1 and 20% Z=2 have missing information. Before our likelihood function assuming complete information, but now we need an additonal set of joint probabilities representing our missing information. Let's now assume that n represents the number of non-missing observations and m represents the number of missing observations. So now we will used the overall or marginal probabaility of X=1 and X=2 in our likelihod function sicne we will not know the joint probability of X and Z for those observation with mising data on Z:
Now the likelihood for the missing observations would be:
The full likelihood would then be :
Now let's use real numbers:
FIML assumes a multivariate normal distribution for all the varaibles with missing data.
Implementing FIML in SAS?
Proc Calis is a procedure in SAS the will perform path analyses as well as structural equaltion modeling. Below we will compare a simple linear regression modeling socst on write, read, female and math using listwise deletion to handle missing data with the same model using FIML.
TITLE "ML - FULL"; PROC CALIS DATA= ats.hsb2; path socst <- write read female math; run;Standard ---------Path---------- Parameter Estimate Error t Valuesocst <=== write _Parm1 0.37575 0.08435 4.45468 socst <=== read _Parm2 0.36968 0.07679 4.81427 socst <=== female _Parm3 -0.23405 1.19579 -0.19573 socst <=== math _Parm4 0.12090 0.08528 1.41765
TITLE "ML - LISTWISE"; PROC CALIS DATA= ats.hsb_mar; path socst <- write read female math; run;Standard ---------Path---------- Parameter Estimate Error t ValueSOCST <=== WRITE _Parm1 0.32128 0.10060 3.19370 SOCST <=== READ _Parm2 0.30477 0.08871 3.43552 SOCST <=== FEMALE _Parm3 0.22336 1.38452 0.16132 SOCST <=== MATH _Parm4 0.19881 0.10025 1.98312
PROC CALIS DATA= ats.hsb_mar METHOD=FIML;
path socst <- write read female math; run;Standard ---------Path---------- Parameter Estimate Error t ValueSOCST <=== WRITE _Parm1 0.35463 0.09225 3.84443 SOCST <=== READ _Parm2 0.37049 0.07903 4.68807 SOCST <=== FEMALE _Parm3 0.42728 1.28457 0.33262 SOCST <=== MATH _Parm4 0.14071 0.09139 1.53970
Which is better FIML or MI?
Simulation studies (Johnson and Young 2011;Collins, Schafer & Kam, 2001; Graham, 2003) have repeatedly demonstrated that FIML and MI will provide nearly identical results when using the same models and the number of imputed datasets in sufficiently large. A "sufficient" numebr of m is based on the FMI and the size of the effect size of interest (Graham et a, 2007). The larger the FMI and the smaller and effect size the more m is needed for equivalence to FIML.
Both FIML and MI allow for the incorporation of auxiliary variables (talke about this a little more and show an exaples). In SEM, this is performed by adding auxiliary varaibles that "conditon estiamtes of the covraince matric without entering them into the analysis (Acock, 2005;Graham, 2003)."