A researcher estimated the following model, which predicts high versus low writing
scores on a standardized test (**hiwrite**), using students’ gender (**female**), and
scores on standardized test scores in reading (**read**), math (**math**),
and science (**science**). The output for the model looks like this:

Logistic regression Number of obs = 200 LR chi2(4) = 105.99 Prob > chi2 = 0.0000 Log likelihood = -84.419842 Pseudo R2 = 0.3857 ------------------------------------------------------------------------------ hiwrite | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 1.805528 .4358101 4.14 0.000 .9513555 2.6597 read | .0529536 .0275925 1.92 0.055 -.0011268 .107034 math | .1319787 .0318836 4.14 0.000 .069488 .1944694 science | .0577623 .027586 2.09 0.036 .0036947 .1118299 _cons | -13.26097 1.893801 -7.00 0.000 -16.97275 -9.549188 ------------------------------------------------------------------------------

The researcher would like to know whether this model (with four predictor
variables) fits significantly better than a model with just **female** and **
read** as predictors. How can the researcher accomplish this? There are three common tests
that can be used to test this type of question, they are the likelihood ratio (lr) test, the Wald test, and
the Lagrange multiplier test (sometimes called a score test). These tests
are sometimes described as tests for differences among nested models, because
one of the models can be said to be nested within the other. The null
hypothesis for all three tests is that the smaller model is the "true" model,
a
large test statistics indicate that the null hypothesis is false. While all
three tests address the same basic question, they are slightly different. In
this page we
will describe how to perform these tests and discuss the similarities and
differences among them. (Note: these tests are very general and are used to test
other types of hypotheses that involve testing whether fixing a parameter
significantly harms model fit.)

## The likelihood

All three tests use the likelihood of the models being compared to assess their fit. The likelihood is the probability the data given the parameter estimates. The goal of a model is to find values for the parameters (coefficients) that maximize value of the likelihood function, that is, to find the set of parameter estimates that make the data most likely. Many procedures use the log of the likelihood, rather than the likelihood itself, because it is easier to work with. The log likelihood (i.e., the log of the likelihood) will always be negative, with higher values (closer to zero) indicating a better fitting model. The above example involves a logistic regression model, however, these tests are very general, and can be applied to any model with a likelihood function. Note that even models for which a likelihood or a log likelihood is not typically displayed by statistical software (e.g., ordinary least squares regression) have likelihood functions.

As mentioned above, the likelihood is a function of the coefficient estimates
and the data. The data are fixed, that is, you cannot change them, so one
changes the estimates of the coefficients in such a way as to maximize
the probability (likelihood). Different parameter estimates, or sets of
estimates give different values of the likelihood. In the figure below, the arch
or curve shows the changes in the value of the likelihood for changes in one
parameter (**a**). On the x-axis
are values of **a**, while the y-axis is the value of the likelihood at the
appropriate value of **a**. Most models have more than one parameter, but, if the values
of all the other coefficients in the model are fixed, changes in a given **a** will show a similar picture.
The vertical line marks the value of **a** that maximizes the likelihood.

## The likelihood ratio test

The lr test is performed by estimating two models and comparing the fit of one model to the fit of the other. Removing predictor variables from a model will almost always make the model fit less well (i.e., a model will have a lower log likelihood), but it is necessary to test whether the observed difference in model fit is statistically significant. The lr test does this by comparing the log likelihoods of the two models, if this difference is statistically significant, then the less restrictive model (the one with more variables) is said to fit the data significantly better than the more restrictive model. If one has the log likelihoods from the models, the lr test is fairly easy to calculate. The formula for the lr test statistic is:

lr = -2 ln(L(m1)/L(m2)) = 2(ll(m2)-ll(m1))

Where L(m*) denotes the likelihood of the respective model (either model 1 or model 2), and ll(m*) the natural log of the model’s final likelihood (i.e., the log likelihood). Where m1 is the more restrictive model, and m2 is the less restrictive model.

The resulting test statistic is distributed chi-squared, with degrees of freedom equal to the number of parameters that are constrained (in the current example, the number of variables removed from the model, i.e., 2).

Using the same example as above, we will run both the full and the restricted
model, and assess the difference in fit using the lr test. Model one is the model using **female** and **read** as predictors (by
not including **math** and **science ** in the model, we restrict their coefficients
to zero). Below is the output for model 1.
We will skip the
interpretation of the results because that is not the focus of our discussion,
but we will make note of the final log likelihood printed just above the table of
coefficients ( ll(m1) = -102.45 ).

Logistic regression Number of obs = 200 LR chi2(2) = 69.94 Prob > chi2 = 0.0000 Log likelihood = -102.44518 Pseudo R2 = 0.2545 ------------------------------------------------------------------------------ hiwrite | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 1.403022 .3671964 3.82 0.000 .6833301 2.122713 read | .1411402 .0224042 6.30 0.000 .0972287 .1850517 _cons | -7.798179 1.235685 -6.31 0.000 -10.22008 -5.376281 ------------------------------------------------------------------------------

Now we can run model 2, in which coefficients for **science** and **math**
are freely estimated, that is, a model with the full set of predictor variables. Below
is output for model 2. Again, we will skip the interpretation, and just make note of the log likelihood ( ll(m2) = -84.42 ).

Logistic regression Number of obs = 200 LR chi2(4) = 105.99 Prob > chi2 = 0.0000 Log likelihood = -84.419842 Pseudo R2 = 0.3857 ------------------------------------------------------------------------------ hiwrite | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 1.805528 .4358101 4.14 0.000 .9513555 2.6597 read | .0529536 .0275925 1.92 0.055 -.0011268 .107034 math | .1319787 .0318836 4.14 0.000 .069488 .1944694 science | .0577623 .027586 2.09 0.036 .0036947 .1118299 _cons | -13.26097 1.893801 -7.00 0.000 -16.97275 -9.549188 ------------------------------------------------------------------------------

Now that we have both log likelihoods, calculating the test statistic is simple:

LR = 2 * (-84.419842 – (-102.44518) ) = 2 * (-84.419842 + 102.44518 ) = 36.050676

So our likelihood ratio test statistic is 36.05 (distributed chi-squared), with two degrees of freedom. We can now use a table or some other method to find the associated p-value, which is p < 0.001, indicating that the model with all four predictors fits significantly better than the model with only two predictors. Note that many statistical packages will perform an lr test comparing two models, we have done the test by hand because it is easy to calculate, and because doing so makes it clear how the lr test works.

## The Wald test

The Wald test approximates the lr test, but with the advantage that it only requires estimating one model. The Wald test works by testing the null hypothesis that a set of parameters is equal to some value. In the model being tested here, the null hypothesis is that the two coefficients of interest are simultaneously equal to zero. If the test fails to reject the null hypothesis, this suggests that removing the variables from the model will not substantially harm the fit of that model, since a predictor with a coefficient that is very small relative to its standard error is generally not doing much to help predict the dependent variable. The formula for a Wald test is a bit more daunting than the formula for the lr test, so we won’t write it out here (see Fox, 1997, p. 569, or other regression texts if you are interested). To give you an intuition about how the test works, it tests how far the estimated parameters are from zero (or any other value under the null hypothesis) in standard errors, similar to the hypothesis tests typically printed in regression output. The difference is that the Wald test can be used to test multiple parameters simultaneously, while the tests typically printed in regression output only test one parameter at a time.

Returning to our example, we will use a statistical package to run our model and then to perform the Wald test. Below we see output for the model with all four predictors (the same output as model 2 above).

Logistic regression Number of obs = 200 LR chi2(4) = 105.99 Prob > chi2 = 0.0000 Log likelihood = -84.419842 Pseudo R2 = 0.3857 ------------------------------------------------------------------------------ hiwrite | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 1.805528 .4358101 4.14 0.000 .9513555 2.6597 read | .0529536 .0275925 1.92 0.055 -.0011268 .107034 math | .1319787 .0318836 4.14 0.000 .069488 .1944694 science | .0577623 .027586 2.09 0.036 .0036947 .1118299 _cons | -13.26097 1.893801 -7.00 0.000 -16.97275 -9.549188 ------------------------------------------------------------------------------

After running the logistic regression model, the Wald test can be used. The
output below shows the results of the Wald test. The first thing listed in this
particular output (the method of obtaining the Wald test and the output may vary
by package) are the specific parameter constraints being tested (i.e., the null
hypothesis), which is that the coefficients for **math** and **science** are simultaneously equal to zero. Below
the list of constraints we see the chi-squared value generated by the Wald test,
as well as the p-value associated with a chi-squared of 27.53 with two degrees
of freedom. The p-value is less than the generally used criterion of 0.05, so
we are able to reject the null hypothesis, indicating that the coefficients are
not simultaneously equal to zero. Because including statistically significant
predictors should lead to better prediction (i.e., better model fit) we can conclude that including
**math** and **science** results in a statistically significant
improvement in the fit of the model.

( 1) math = 0 ( 2) science = 0 chi2( 2) = 27.53 Prob > chi2 = 0.0000

## The Lagrange multiplier or score test

As with the Wald test, the Lagrange multiplier test requires estimating only
a single model. The difference is that with the Lagrange multiplier test, the
model estimated does not include the parameter(s) of interest. This means, in our
example, we can use the Lagrange multiplier test to test whether adding **science** and **math** to the model will result in a significant
improvement in model fit, after running a model with just **female**
and **read** as predictor variables. The test statistic is calculated based
on the slope of the likelihood
function at the observed values of the variables in the model (**female** and **read**).
This estimated slope, or "score" is the reason the Lagrange multiplier test is sometimes
called the score test. The scores are then used to estimate the improvement in
model fit if additional variables were included in the
model. The test statistic is the expected change in the chi-squared statistic for the
model if a variable or set of variables is added to the model. Because it tests for improvement of model fit if variables that are
currently omitted are added to the model, the Lagrange multiplier test is sometimes also
referred to as a test for omitted variables. They are also sometimes referred to
as modification indices, particularly in the structural equation modeling
literature.

Below is output for the logistic regression model using the
variables **female** and **read** as predictors of **hiwrite** (this is
the same as model 1 from the lr test).

Logistic regression Number of obs = 200 LR chi2(2) = 69.94 Prob > chi2 = 0.0000 Log likelihood = -102.44518 Pseudo R2 = 0.2545 ------------------------------------------------------------------------------ hiwrite | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- female | 1.403022 .3671964 3.82 0.000 .6833301 2.122713 read | .1411402 .0224042 6.30 0.000 .0972287 .1850517 _cons | -7.798179 1.235685 -6.31 0.000 -10.22008 -5.376281 ------------------------------------------------------------------------------

After running the above model, we can look at the results of the Lagrange
multiplier test. Unlike the previous two tests,
which are primarily used to assess the change in model fit when more than one
variable is added to the model, the Lagrange multiplier test can be used to test
the expected change in model fit if one or more parameters which are currently
constrained are allowed to be estimated freely. In our example, this means
testing whether adding **math** and **science** to the
model would significantly improve model fit. Below is the output for the score
test. The first two rows in the table give the test statistics (or scores) for
adding either variable alone to the model. To carry on with our example, we will focus on the results in the
third row labeled
"simultaneous test," which shows the test statistic for adding both **
math** and **science** to our model. The test statistic for adding both **math** and **
science** to the model is 35.51, it is distributed
chi-squared, with degrees of freedom equal to the number of variables being
added to the model, so in our example, 2. The p-value is below the typical
cutoff of 0.05, suggesting that including the variables **math** and
**science** in the model would create a statistically significant
improvement in model fit. This conclusion is consistent with the results of both
the lr and Wald tests.

logit: score tests for omitted variables Term | score df p ---------------------+---------------------- math | 28.94 1 0.0000 science | 15.39 1 0.0001 ---------------------+---------------------- simultaneous test | 35.51 2 0.0000 ---------------------+----------------------

## A comparison of the three tests

As discussed above, all three tests address the same basic question, which is, does constraining parameters to zero (i.e., leaving out these predictor variables) reduce the fit of the model? The difference between the tests is how they go about answering that question. As you have seen, in order to perform a likelihood ratio test, one must estimate both of the models one wishes to compare. The advantage of the Wald and Lagrange multiplier (or score) tests is that they approximate the lr test, but require that only one model be estimated. Both the Wald and the Lagrange multiplier tests are asymptotically equivalent to the lr test, that is, as the sample size becomes infinitely large, the values of the Wald and Lagrange multiplier test statistics will become increasingly close to the test statistic from the lr test. In finite samples, the three will tend to generate somewhat different test statistics, but will generally come to the same conclusion. An interesting relationship between the three tests is that, when the model is linear the three test statistics have the following relationship Wald ≥ LR ≥ score (Johnston and DiNardo 1997 p. 150). That is, the Wald test statistic will always be greater than the LR test statistic, which will, in turn, always be greater than the test statistic from the score test. When computing power was much more limited, and many models took a long time to run, being able to approximate the lr test using a single model was a fairly major advantage. Today, for most of the models researchers are likely to want to compare, computational time is not an issue, and we generally recommend running the likelihood ratio test in most situations. This is not to say that one should never use the Wald or score tests. For example, the Wald test is commonly used to perform multiple degree of freedom tests on sets of dummy variables used to model categorical predictor variables in regression (for more information see our webbooks on Regression with Stata, SPSS, and SAS, specifically Chapter 3 – Regression with Categorical Predictors.) The advantage of the score test is that it can be used to search for omitted variables when the number of candidate variables is large.

Figure
based on a figure in Fox (1997, p. 570); used with authors permission.

One way to better understand how the three tests are related, and how they
are different, is to look at a graphical representation of what they are
testing. The figure above illustrates what each of the three tests does.
Along the x-axis (labeled "a") are possible values of the parameter a (in our
example, this would be the regression coefficient for either **math** or **science**). Along the y-axis are the values of the log likelihood
corresponding to those values of **a**. The lr test
compares the log likelihoods of a model with values of the parameter a
constrained to some value (in our example zero) to a model where **a** is freely
estimated. It does this by comparing the height of the likelihoods for the two
models to see if the
difference is statistically significant (remember, higher values of the
likelihood indicate better fit). In the figure above, this corresponds
to the vertical distance between the two dotted lines. In contrast, the Wald test compares the parameter
estimate **a-hat** to **a_0**; **a_0** is the value of **a** under
the null hypothesis, which generally states that **a**
= 0. If **a-hat** is significantly different from **a_0**, this suggests that freely
estimating **a** (using **a-hat**) significantly improves model fit. In the figure, this
is shown as the distance between **a_0** and **a-hat** on the x-axis
(highlighted by the solid lines). Finally, the score
test looks at the slope of the log likelihood when **a** is constrained (in our
example to zero). That is, it looks at
how quickly the likelihood is changing at the (null) hypothesized value of **a**. In
the figure below this is shown as the tangent line at **a_0**.

## References

Fox, J. (1997) *Applied regression analysis, linear models, and related
methods.*
Thousand Oaks, CA: Sage Publications.

Johnston, J. and DiNardo, J. (1997) *Econometric Methods* Fourth
Edition. New York, NY: The McGraw-Hill Companies, Inc.