This page shows an example of a discriminant analysis in SPSS with footnotes
explaining the output. The data used in this example are from a data file,
https://stats.idre.ucla.edu/wp-content/uploads/2016/02/discrim.sav, with 244 observations on four variables. The variables include
three continuous, numeric variables (**outdoor**, **social** and **
conservative**) and one categorical variable (**job**) with three
levels: 1) customer service, 2) mechanic and 3) dispatcher.

We are interested in the relationship between the three continuous variables and our categorical variable. Specifically, we would like to know how many dimensions we would need to express this relationship. Using this relationship, we can predict a classification based on the continuous variables or assess how well the continuous variables separate the categories in the classification. We will be discussing the degree to which the continuous variables can be used to discriminate between the groups. Some options for visualizing what occurs in discriminant analysis can be found in the Discriminant Analysis Data Analysis Example.

To start, we can examine the overall means of the continuous variables.

get file='C:\temp\discrim.sav'.

descriptives variables=outdoor social conservative /statistics=mean stddev min max .

We are interested in how job relates to outdoor, social and conservative. Let’s look at summary statistics of these three continuous variables for each job category.

means tables=outdoor social conservative by job /cells mean count stddev .

From this output, we can see that some of the means of **outdoor, social**
and **conservative** differ noticeably from group to group in **job**.
These differences will hopefully allow us to use these predictors to distinguish
observations in one **job** group from observations in another **job**
group. Next, we can look at the correlations between these three predictors.
These correlations will give us some indication of how much unique information
each predictor will contribute to the analysis. If two predictor variables are
very highly correlated, then they will be contributing shared information to the
analysis. Uncorrelated variables are likely preferable in this respect. We
will also look at the frequency of each job group.

correlations variables=outdoor social conservative .

frequencies variables=job .

The **discriminant** command in SPSS
performs canonical linear discriminant analysis which is the classical form of
discriminant analysis. In this example, we specify in the **groups**
subcommand that we are interested in the variable **job,** and we list
in parenthesis the minimum and maximum values seen in **job**. We next list
the discriminating variables, or predictors, in the **variables** subcommand.
In this example, we have selected three predictors: **outdoor**, **social**
and **conservative**. We will be interested in comparing the actual groupings
in **job** to the predicted groupings generated by the discriminant analysis.
For this, we use the **statistics** subcommand. This will provide us with
classification statistics in our output.

discriminant /groups=job(1 3) /variables=outdoor social conservative /statistics=table.

## Data Summary

a. **Analysis Case Processing Summary** – This table summarizes the
analysis dataset in terms of valid and excluded cases. The reasons why
SPSS might exclude an observation from the analysis are listed here, and the
number (“N”) and percent of cases falling into each category (valid or one of
the exclusions) are presented. In this example, all of the observations in
the dataset are valid.

b. **Group Statistics** – This table presents the distribution of
observations into the three groups within **job**. We can see the
number of observations falling into each of the three groups. In this example,
we are using the default weight of 1 for each observation in the dataset, so the
weighted number of observations in each group is equal to the unweighted number
of observations in each group.

## Eigenvalues and Multivariate Tests

c. **Function** – This indicates the first or second canonical linear
discriminant function. The number of functions is equal to the number of
discriminating variables, if there are more groups than variables, or 1 less than the
number of levels in the group variable. In this example, **job**
has three levels and three discriminating variables were used, so two functions
are calculated. Each function acts as projections of the data onto a dimension
that best separates or discriminates between the groups.

d. **Eigenvalue** – These are the eigenvalues of the matrix product of the
inverse of the within-group sums-of-squares and cross-product matrix and the
between-groups sums-of-squares and cross-product matrix. These eigenvalues are
related to the canonical correlations and describe how much discriminating
ability
a function possesses. The magnitudes of the eigenvalues are indicative of the
functions’ discriminating abilities. See superscript **e** for
underlying calculations.

e.** % of Variance** – This is the proportion of discriminating ability of
the three continuous variables found in a given function. This proportion is
calculated as the proportion of the function’s eigenvalue to the sum of all the
eigenvalues. In this analysis, the first function accounts for 77% of the
discriminating ability of the discriminating variables and the second function
accounts for 23%. We can verify this by noting that the sum of the eigenvalues
is 1.081+.321 = 1.402. Then (1.081/1.402) = 0.771 and (0.321/1.402) = 0.229.

f.** Cumulative %** – This is the cumulative proportion of discriminating
ability . For any analysis, the proportions of discriminating ability will sum to
one. Thus, the last entry in the cumulative column will also be one.

g. **Canonical Correlation** –
These are the canonical correlations of our predictor variables (**outdoor, social**
and **conservative**) and the groupings in
**job**. If we consider our discriminating variables to be
one set of variables and the set of dummies generated from our grouping
variable to be another set of variables, we can perform a canonical correlation
analysis on these two sets. From this analysis, we would arrive at these
canonical correlations.

h. **Test of Function(s)** – These are the functions included in a given
test with the null hypothesis that the canonical correlations associated with
the functions are all equal to zero. In this example, we have two
functions. Thus, the first test presented in this table tests both canonical
correlations (“1 through 2”) and the second test presented tests the second
canonical correlation alone.

i. **Wilks’ Lambda** – Wilks’ Lambda is one of the multivariate statistic calculated by SPSS. It is the
product of the values of (1-canonical correlation^{2}).
In this example, our canonical correlations are 0.721 and 0.493, so
the Wilks’ Lambda testing both canonical correlations is (1- 0.721^{2})*(1-0.493^{2})
= 0.364, and the Wilks’ Lambda testing the second canonical correlation is
(1-0.493^{2}) = 0.757.

j. **Chi-square** – This is the Chi-square statistic testing that the
canonical correlation of the given function is equal to zero. In other words,
the null hypothesis is that the function, and all functions that follow, have no
discriminating ability. This hypothesis is tested using this Chi-square
statistic.

k. **df** – This is the effect degrees of freedom for the given function.
It is based on the number of groups present in the categorical variable and the
number of continuous discriminant variables. The Chi-square statistic is
compared to a Chi-square distribution with the degrees of freedom stated here.

l. **Sig.** – This is the p-value
associated with the Chi-square statistic of a given test. The null
hypothesis that a given function’s canonical correlation and all smaller
canonical correlations are equal to zero is evaluated with regard to this
p-value. For a given alpha level, such as 0.05, if the p-value is less
than alpha, the null hypothesis is rejected. If not, then we fail to reject the
null hypothesis.

## Discriminant Function Output

m. **Standardized Canonical Discriminant Function Coefficients** – These
coefficients can be used to calculate the discriminant score for a given
case. The score is calculated in the same manner as a predicted value from a
linear regression, using the standardized coefficients and the standardized
variables. For example, let **zoutdoor, zsocial **and** zconservative**
be the variables created by standardizing our discriminating variables. Then,
for each case, the function scores would be calculated using the following
equations:

Score1 = 0.379***zoutdoor** – 0.831***zsocial** + 0.517***zconservative**

Score2 = 0.926***zoutdoor** + 0.213***zsocial** – 0.291***zconservative**

The distribution of the scores from each function is standardized to have a
mean of zero and standard deviation of one. The magnitudes of these
coefficients indicate how strongly the discriminating variables effect the
score. For example, we can see that the standardized coefficient for **zsocial**
in the first function is greater in magnitude than the coefficients for the
other two variables. Thus, **social** will have the greatest impact of the
three on the first discriminant score.

n. **Structure Matrix** – This is the canonical structure, also known as
canonical loading or discriminant loading, of the discriminant functions. It
represents the correlations between the observed variables (the three continuous
discriminating variables) and the dimensions created with the unobserved
discriminant functions (dimensions).

o. **Functions at Group Centroids** – These are the means of the
discriminant function scores by group for each function calculated. If we
calculated the scores of the first function for each case in our dataset, and
then looked at the means of the scores by group, we would find that the **
customer service** group has a mean of -1.219, the **mechanic** group has a
mean of 0.107, and the **dispatch** group has a mean of 1.420. We know that
the function scores have a mean of zero, and we can check this by looking at the
sum of the group means multiplied by the number of cases in each group:
(85*-1.219)+(93*.107)+(66*1.420) = 0.

## Predicted Classifications

p. **Classification Processing Summary** – This is similar to the Analysis
Case Processing Summary (see superscript **a**), but in this table,
“Processed” cases are those that were successfully classified based on the
analysis. The reasons why an observation may not have been processed are listed
here. We can see that in this example, all of the observations in the
dataset were successfully classified.

q. **Prior Probabilities for Groups** – This is the distribution of
observations into the **job** groups used as a starting point in the
analysis. The default prior distribution is an equal allocation into the
groups, as seen in this example. SPSS allows users to specify different
priors with the **priors** subcommand.

r. **Predicted Group Membership** – These are the predicted frequencies of
groups from the analysis. The numbers going down each column indicate how many
were correctly and incorrectly classified. For example, of the 89 cases that
were predicted to be in the **customer service** group, 70 were correctly
predicted, and 19 were incorrectly predicted (16 cases were in the **mechanic**
group and three cases were in the **dispatch** group).

s. **Original** – These are the frequencies of groups found in the data.
We can see from the row totals that 85 cases fall into the **customer service**
group, 93 fall into the **mechanic** group, and 66 fall into the **dispatch**
group. These match the results we saw earlier in the output for
the **frequencies** command. Across each row, we see how many of the
cases
in the group are classified by our analysis into each of the different groups.
For example, of the 85 cases that are in the **customer service **group, 70
were predicted correctly and 15 were predicted incorrectly (11 were predicted to
be in the **mechanic** group and four were predicted to be in the **dispatch**
group).

t. **Count** – This portion of the table presents the number of
observations falling into the given intersection of original and predicted group
membership. For example, we can see in this portion of the table that the
number of observations originally in the **customer service **group, but
predicted to fall into the **mechanic** group is 11. The row totals of these
counts are presented, but column totals are not.

u. **%** – This portion of the table presents the percent of observations
originally in a given group (listed in the rows) predicted to be in a given
group (listed in the columns). For example, we can see that the percent of
observations in the **mechanic** group that were predicted to be in the **
dispatch** group is 16.1%. This is NOT the same as the percent of observations
predicted to be in the **dispatch** group that were in the **mechanic**
group. The latter is not presented in this table.

### Appendix

The following code can be used to calculate the scores manually:

DESCRIPTIVES VARIABLES=outdoor social conservative /SAVE /STATISTICS=MEAN STDDEV MIN MAX. COMPUTE Score1 = 0.379*Zoutdoor - 0.831*Zsocial + 0.517*Zconservative. COMPUTE Score2 = 0.926*Zoutdoor - 0.213*Zsocial + 0.291*Zconservative.

Let’s take a look at the first two observations of the newly created scores:

LIST VARIABLES=Zoutdoor Zsocial Zconservative Score1 Score2 /CASES=FROM 1 TO 2. Zoutdoor Zsocial Zconservative Score1 Score2 -1.16517 .24160 -1.49999 -1.42 -1.57 -.33871 -.67094 -1.23167 -.21 -.53 Number of cases read: 2 Number of cases listed: 2

Verify that the mean of the scores is zero and the standard deviation is roughly 1.

DESCRIPTIVES VARIABLES=Score1 Score2 /STATISTICS=MEAN STDDEV MIN MAX.

N | Minimum | Maximum | Mean | Std. Deviation | |

Score1 | 244 | -3.20 | 3.31 | .0000 | 1.17481 |

Score2 | 244 | -3.52 | 2.55 | .0000 | 1.04292 |

Valid N (listwise) | 244 | ||||