**Chapter Outline
**
5.1 Simple Coding

5.2 Forward Difference Coding

5.3 Backward Difference Coding

5.4 Helmert Coding

5.5 Reverse Helmert Coding

5.6 Deviation Coding

5.7 Orthogonal Polynomial Coding

5.8 User-Defined Coding

5.9 Summary

**Please note:** This page makes use of the program **xi3** which is no longer being maintained and has been from our
archives. References to **xi3** will be left on this page because they illustrate specific principles of coding categorical
variables.

**5.0 Introduction**

Categorical variables require special attention in regression analysis because,
unlike dichotomous or continuous variables, they cannot by entered into the
regression equation just as they are. For example, if you have a
variable called **race** that is coded 1 = Hispanic, 2 = Asian 3 = Black 4 =
White,
then entering **race** in your regression will look at the linear
effect of race, which is probably not what you intended. Instead, categorical variables like this need to be
recoded into a series of variables which can then be
entered into the regression model. There are a variety of coding systems that can be used when
coding categorical
variables. Ideally, you would choose a
coding system that reflects the comparisons that you want to make. In Chapter
3 of the Regression with
Stata Web Book
we covered the use of categorical variables in regression analysis focusing on
the use of dummy variables, but that is not the only coding scheme that you can
use. For example,
you may want to compare each level to the next higher
level, in which case you would want to use "forward difference" coding, or you
might want to compare each level to the mean of the subsequent levels of the
variable, in which case you would want to use "Helmert" coding. By
deliberately choosing a coding system, you can obtain comparisons that are most
meaningful for testing your hypotheses. Regardless of the coding system you choose, the
test of the overall effect
of the categorical variable (i.e., the overall effect of **race**) will remain the same.
Below is a table listing various types of contrasts and the
comparison that they make.

Name of contrast |
Comparison made |

Simple Coding | Compares each level of a variable to the reference level |

Forward Difference Coding | Adjacent levels of a variable (each level minus the next level) |

Backward Difference Coding | Adjacent levels of a variable (each level minus the prior level) |

Helmert Coding | Compare levels of a variable with the mean of the subsequent levels of the variable |

Reverse Helmert Coding | Compares levels of a variable with the mean of the previous levels of the variable |

Deviation Coding | Compares deviations from the grand mean |

Orthogonal Polynomial Coding | Orthogonal polynomial contrasts |

User-Defined Coding | User-defined contrast |

There are a couple of notes to be made about the coding systems listed
above. The first is that they represent planned comparisons and not post
hoc comparisons. In other words, they are comparisons that you plan to do
before you begin analyzing your data, not comparisons that you think of once you have seen
the results of preliminary analyses. Also, some forms of coding
make more sense with ordinal categorical variables than with nominal categorical
variables. Below we will show examples using **race** as a categorical
variable, which is a nominal variable. Because simple effect coding compares the mean of the
dependent variable for each level of the categorical variable to the mean of the
dependent variable at for the reference level, it makes sense with a nominal
variable.
However, it may not make as much sense to use a coding scheme that tests the linear
effect of **race**. As we describe each type of coding system, we note
those coding systems with which it does not make as much sense to use a nominal
variable. Also, you may notice that we follow several rules when
creating the contrast coding schemes. For more information about these
rules, please see the section on User-Defined Coding.

This page will illustrate
two ways that you can conduct analyses using
these coding schemes: 1) using the **xi3** command (an extended version of
the **xi** command) and 2) manually coding the variables and entering them using
the **regress **command.
When using **regress** to do contrasts, you first need to create k-1 new variables (where k is the number of
levels of the categorical variable) and use
these new variables as predictors in your regression model.

## The Example Data File

The examples in this page will use dataset called hsb2.dta that you can download from within Stata like this.

use http://www.ats.ucla.edu/stat/stata/notes/hsb2

Within this data file, we will focus on the categorical variable **race**, which has four levels (1 =
Hispanic, 2 = Asian, 3 = African American and 4 = white) and we will use **write**
as our dependent variable. Although our
example uses a variable with four levels, these coding systems work with
variables that have more or fewer categories. No matter which coding system you select, you will always have one fewer recoded variables
than levels of the original variable. In our example, our categorical
variable has four levels so we will have three new variables (a variable corresponding to the final level of the categorical
variables would be redundant and therefore unnecessary).

Before considering any analyses, let’s look at the mean of the dependent
variable, **write**, for each level of **race**. This will help in interpreting
the output from later analyses.

tabulate race, summarize(write)| Summary of writing score race | Mean Std. Dev. Freq. ------------+------------------------------------ hispanic | 46.458333 8.2724223 24 asian | 58 7.8993671 11 african-a | 48.2 9.3222992 20 white | 54.055172 9.1725582 145 ------------+------------------------------------ Total | 52.775 9.478586 200

## 5.1 Simple Coding

The results of simple coding are very similar to dummy coding in that each level is compared to the reference level. In the example below, level 1 is the reference level and the first comparison compares level 2 to level 1, the second comparison compares level 3 to level 1, and the third comparison compares level 4 to level 1.

**Method 1: Using xi3**

When using **xi3**, we can refer to **g.race** to indicate that we wish
to code race using simple coding comparing each group to a reference group, as shown in the example below.

xi3: regress write g.races.race _Irace_1-4 (naturally coded; _Irace_1 omitted) Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Irace_2 | 11.54167 3.286129 3.51 0.001 5.060956 18.02238 _Irace_3 | 1.741667 2.732488 0.64 0.525 -3.647186 7.130519 _Irace_4 | 7.596839 1.98887 3.82 0.000 3.674507 11.51917 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

The coefficient for **_Irace_2** compares the mean of the dependent
variable, **write**, for levels 2 and 1 yielding 58-46.458 = 11.54 and
is statistically significant (p<.000). The coefficient for **_Irace_3**
compares the mean of the dependent variable, **write**, for levels 3 and 1,
yielding 48.2 - 46.46 = 1.74, and this is not statistically significant. Finally,
the coefficient for **_Irace_4** compares the mean of the dependent variable,
**write**, for levels 4 and 1, yielding 7.59, and that is statistically significant.

**Method 2: Manual Coding**

If we wished, we could manually code **race** instead of allowing **xi3**
to do the coding for us. Below we see the coding that replicates the
results we saw in the example above. In the coding below, level 1 is the reference level
and ** x1** compares level 2 to level 1, ** x2** compares level 3 to level
1, and
** x3** compares
level 4 to level 1. For ** x1** the coding is
3/4 for level 2, and -1/4 for all other levels. Likewise, for
**
x2** the coding is 3/4 for level 2, and -1/4 for all other levels, and for
**
x3** the coding is 3/4 for level 3, and -1/4
for all other levels. It is not intuitive that this regression coding
scheme yields these comparisons; however, if you desire simple comparisons, you
can follow this general rule to obtain these comparisons.

SIMPLE regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

1 (Hispanic) | -1/4 | -1/4 | -1/4 |

2 (Asian) | 3/4 | -1/4 | -1/4 |

3 (African American) | -1/4 | 3/4 | -1/4 |

4 (white) | -1/4 | -1/4 | 3/4 |

Below we show the more general rule for creating this kind of coding scheme using regression coding, where k is the number of levels of the categorical variable (in this instance, k = 4).

SIMPLE regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

1 (Hispanic) | -1 / k | -1 / k | -1 / k |

2 (Asian) | (k-1) / k | -1 / k | -1 / k |

3 (African American) | -1 / k | (k-1) / k | -1 / k |

4 (white) | -1 / k | -1 / k | (k-1) / k |

Below we illustrate how to create **x1**, **x2** and **x3** and enter
these new variables into the regression model using the **regression**
command.

generate x1 = -1/4 replace x1 = 3/4 if race==2 generate x2 = -1/4 replace x2 = 3/4 if race==3 generate x3 = -1/4 replace x3 = 3/4 if race==4 regress write x1 x2 x3

As you can see, the results below match those when we used the **xi3**
command above.

Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | 11.54167 3.286129 3.51 0.001 5.060956 18.02238 x2 | 1.741667 2.732488 0.64 0.525 -3.647186 7.130519 x3 | 7.596839 1.98887 3.82 0.000 3.674507 11.51917 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

## 5.2 Forward Difference Coding

In this coding system, the mean of the dependent variable for one level
of the categorical variable is compared to the mean of the dependent variable
for the next (adjacent) level. In our example below, the first comparison
compares the mean of ** write** for level 1 with the mean of ** write ** for level 2 of
**
race** (Hispanics minus Asians). The second comparison compares the mean of
**
write** for level 2 minus level 3, and the third comparison compares the mean of
**
write** for level 3 minus level 4. This type of
coding may be useful with either a nominal or an ordinal
variable.

**Method 1: Using xi3**

We can indicate that we want forward adjacent difference coding for race by specifying
**a.race** as shown below.

xi3 : regress write a.racef.race _Irace_1-4 (naturally coded; _Irace_4 omitted) Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Irace_1 | -11.54167 3.286129 -3.51 0.001 -18.02238 -5.060956 _Irace_2 | 9.8 3.387834 2.89 0.004 3.118714 16.48129 _Irace_3 | -5.855172 2.15276 -2.72 0.007 -10.10072 -1.609626 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

With this coding system, adjacent levels of the categorical variable are
compared. Hence, the mean of the dependent variable at level 1 is compared
to the mean of the dependent variable at level 2: 46.4583 - 58 = -11.542,
which is statistically significant. For the comparison between levels 2
and 3, the calculation of the contrast coefficient would be 58 - 48.2 = 9.8,
which is also statistically significant. Finally, comparing levels 3 and
4, 48.2 - 54.0552 = -5.855, a statistically significant difference. One
would conclude from this that each adjacent level of ** race** is statistically
significantly different.

**Method 2: Manual Coding**

For the first
comparison, where the first and second levels are compared, ** x1** is coded
3/4 for level 1 and the other levels are coded -1/4. For the second comparison where level
2 is compared with level 3, ** x2** is coded 1/2 1/2 -1/2 -1/2, and for the
third comparison where** **level 3 is compared with level 4, ** x3** is
coded 1/4 1/4 1/4 -3/4.

FORWARD DIFFERENCE regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

Level 1 v. Level 2 | Level 2 v. Level 3 | Level 3 v. Level 4 | |

1 (Hispanic) | 3/4 | 1/2 | 1/4 |

2 (Asian) | -1/4 | 1/2 | 1/4 |

3 (African American) | -1/4 | -1/2 | 1/4 |

4 (white) | -1/4 | -1/2 | -3/4 |

The general rule for this regression coding scheme is shown below, where k is the number of levels of the categorical variable (in this case k = 4).

FORWARD DIFFERENCE regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

Level 1 v. Level 2 | Level 2 v. Level 3 | Level 3 v. Level 4 | |

1 (Hispanic) | (k-1)/k | (k-2)/k | (k-3)/k |

2 (Asian) | -1/k | (k-2)/k | (k-3)/k |

3 (African American) | -1/k | -2/k | (k-3)/k |

4 (white) | -1/k | -2/k | -3/k |

generate x1 = 3/4 if race==1 replace x1 = -1/4 if inlist(race,2,3,4)generate x2 = 1/2 if inlist(race,1,2) replace x2 = -1/2 if inlist(race,3,4)generate x3 = 1/4 if inlist(race,1,2,3) replace x3 = -3/4 if race==4regress write x1 x2 x3Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | -11.54167 3.286129 -3.51 0.001 -18.02238 -5.060956 x2 | 9.8 3.387834 2.89 0.004 3.118714 16.48129 x3 | -5.855172 2.15276 -2.72 0.007 -10.10072 -1.609626 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

You can see the regression coefficient for ** x1** is the mean of ** write** for level 1 (Hispanic) minus the mean of ** write**
for level 2 (Asian). Likewise, the
regression coefficient for ** x2** is the mean of ** write** for level 2 (Asian) minus the mean of ** write**
for level 3 (African American), and the
regression coefficient for ** x3** is the mean of ** write** for level 3 (African American) minus the mean
of ** write** for level 4 (white).

## 5.3 Backward Difference Coding

In this coding system, the mean of the dependent variable for one level
of the categorical variable is compared to the mean of the dependent variable
for the prior adjacent level. In our example below, the first comparison
compares the mean of ** write** for level 2 with the mean of ** write ** for level
1 of
**
race** (Hispanics minus Asians). The second comparison compares the mean of
**
write** for level 3 minus level 2, and the third comparison compares the mean of
**
write** for level 4 minus level 3. This type of
coding may be useful with either a nominal or an ordinal
variable.

**Method 1: Using xi3**

We can indicate that we want backward difference coding for race by
specifying **b.race** as shown below.

xi3 : regress write b.raceb.race _Irace_1-4 (naturally coded; _Irace_1 omitted) Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Irace_2 | 11.54167 3.286129 3.51 0.001 5.060956 18.02238 _Irace_3 | -9.8 3.387834 -2.89 0.004 -16.48129 -3.118714 _Irace_4 | 5.855172 2.15276 2.72 0.007 1.609626 10.10072 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

With this coding system, adjacent levels of the categorical variable are
compared, with each level compared to the prior level. Hence, the mean of the dependent variable at level
2 is compared
to the mean of the dependent variable at level 1: 58-46.4583 = 11.542,
which is statistically significant. For the comparison between levels 3
and 2, we calculate 48.2 - 58 = -9.8,
which is also statistically significant. Finally, comparing levels 4 and
3, 54.0552 - 48.2 = 5.855, a statistically significant difference. One
would conclude from this that each adjacent level of ** race** is statistically
significantly different.

**Method 2: Manual Coding**

For the first
comparison, where the first and second levels are compared, ** x1** is coded 3/4
for level 1 while the other levels are coded -1/4. For the second comparison where level
2 is compared with level 3, ** x2** is coded 1/2 1/2 -1/2 -1/2, and for the
third comparison where** **level 3 is compared with level 4, ** x3** is
coded 1/4 1/4 1/4 -3/4.

BACKWARD DIFFERENCE regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

Level 2 v. Level 1 | Level 3 v. Level 2 | Level 4 v. Level 3 | |

1 (Hispanic) | - 3/4 | -1/2 | -1/4 |

2 (Asian) | 1/4 | -1/2 | -1/4 |

3 (African American) | 1/4 | 1/2 | -1/4 |

4 (white) | 1/4 | 1/2 | 3/4 |

The general rule for this regression coding scheme is shown below, where k is the number of levels of the categorical variable (in this case, k = 4).

BACKWARD DIFFERENCE regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

Level 1 v. Level 2 | Level 2 v. Level 3 | Level 3 v. Level 4 | |

1 (Hispanic) | -(k-1)/k | -(k-2)/k | -(k-3)/k |

2 (Asian) | 1/k | -(k-2)/k | -(k-3)/k |

3 (African American) | 1/k | 2/k | -(k-3)/k |

4 (white) | 1/k | 2/k | 3/k |

generate x1 = -3/4 if race==1 replace x1 = 1/4 if inlist(race,2,3,4) generate x2 = -1/2 if inlist(race,1,2) replace x2 = 1/2 if inlist(race,3,4) generate x3 = -1/4 if inlist(race,1,2,3) replace x3 = 3/4 if race==4 regress write x1 x2 x3Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | 11.54167 3.286129 3.51 0.001 5.060956 18.02238 x2 | -9.8 3.387834 -2.89 0.004 -16.48129 -3.118714 x3 | 5.855172 2.15276 2.72 0.007 1.609626 10.10072 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

In the above example, the
regression coefficient for ** x1** is the mean of ** write** for level
2 minus the mean of ** write**
for level 1 (58- 46.4583 = 11.542). Likewise, the
regression coefficient for ** x2** is the mean of ** write** for level 3
minus the mean of ** write**
for level 2, and the
regression coefficient for ** x3** is the mean of ** write** for level 4
minus the mean
of ** write** for level 3.

## 5.4 Helmert Coding

Helmert coding compares each level of a categorical variable to the mean of the subsequent levels.
Hence, the first
contrast compares the mean of
the dependent variable for level 1 of ** race** with the mean of all of the subsequent levels of
**
race** (levels 2, 3, and 4), the second contrast compares the mean of
the dependent variable for level 2 of ** race** with the mean of all of the subsequent levels of
**
race** (levels 3 and 4), and the third contrast compares the mean of
the dependent variable for level 3 of ** race** with the mean of all of the subsequent levels of
**
race** (level 4). While this type of coding system does not make much sense
with a nominal variable like **race**, it is useful in
situations where the levels of the categorical variable are ordered say, from
lowest to highest, or smallest to largest, etc.

**Method 1: Using xi3**

We can specify Helmert coding for **race** using **h.race** as shown
below.

xi3 : regress write h.raceh.race _Irace_1-4 (naturally coded; _Irace_4 omitted) Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Irace_1 | -6.960057 2.175211 -3.20 0.002 -11.24988 -2.670234 _Irace_2 | 6.872414 2.926325 2.35 0.020 1.101287 12.64354 _Irace_3 | -5.855172 2.15276 -2.72 0.007 -10.10072 -1.609626 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

The regression coefficient for the comparison between level 1 and the remaining
levels is calculated by taking the mean of the dependent variable for level 1
and subtracting the
mean of the dependent variable for levels 2, 3 and 4: 46.4583 - [(58 + 48.2 + 54.0552) / 3] =
-6.960, which is statistically significant. This means that the mean of **
write** for level 1 of ** race** is statistically significantly different from the mean
of ** write** for levels 2 through 4. As noted above, this comparison probably
is not meaningful because the variable ** race** is nominal. This type of
comparison would be more meaningful if the categorical variable was
ordinal.

To calculate the contrast coefficient for the comparison between level 2 and the later levels, you subtract the mean of the dependent variable for levels 3 and 4 from the mean of the dependent variable for level 2: 58 - [(48.2 + 54.0552) / 2] = 6.872, which is statistically significant. The regression coefficient for the comparison between level 3 and level 4 is the difference between the mean of the dependent variable for the two levels: 48.2 - 54.0552 = -5.855, which is also statistically significant.

**Method 2: Manual Coding**

Below we see an example of Helmert regression coding. For the first comparison (comparing level 1 with levels 2, 3 and 4) the codes are 3/4 and -1/4 -1/4 -1/4. The second comparison compares level 2 with levels 3 and 4 and is coded 0 2/3 -1/3 -1/3. The third comparison compares level 3 to level 4 and is coded 0 0 1/2 -1/2.

HELMERT regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

Level 1 v. Later | Level 2 v. Later | Level 3 v. Later | |

1 (Hispanic) | 3/4 | 0 | 0 |

2 (Asian) | -1/4 | 2/3 | 0 |

3 (African American) | -1/4 | -1/3 | 1/2 |

4 (white) | -1/4 | -1/3 | -1/2 |

Below we illustrate how to create **x1**, **x2** and **x3** and enter
these new variables into the regression model using the **regression**
command.

generate x1 = -3/4 if race==1 replace x1 = 1/4 if inlist(race,2,3,4) generate x2 = 0 if race==1 replace x2 = 2/3 if race==2 replace x2 = -1/3 if inlist(race,3,4) generate x3 = 0 if inlist(race,1,2) replace x3 = 1/2 if race==3 replace x3 = -1/2 if race==4 regress write x1 x2 x3Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | 6.960057 2.175211 3.20 0.002 2.670234 11.24988 x2 | 6.872414 2.926325 2.35 0.020 1.101287 12.64354 x3 | -5.855172 2.15276 -2.72 0.007 -10.10072 -1.609626 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

As you see above, regression coefficient for ** x1** is the mean of ** write**
for level 1 (Hispanic) versus all subsequent levels (levels 2, 3 and 4).
Likewise, the regression coefficient for ** x2** is the mean of ** write**
for level 2 minus the mean of ** write**
for levels 3 and 4. Finally, the regression coefficient for ** x3**
is the mean of ** write** for level 3 minus the mean of ** write**
for level 4.

## 5.5 Reverse Helmert Coding

Reverse Helmert coding (also know as difference coding) is just the opposite of Helmert coding: instead of
comparing each level of categorical variable to the mean of the subsequent
level(s),
each is compared to the mean of the previous level(s). In our example, the first contrast codes the comparison of the mean of the
dependent variable for level 2 of ** race** to the mean of the dependent variable for
level 1 of **race**. The second comparison compares the mean of the
dependent variable level 3 of ** race** with both levels 1 and 2 of ** race**, and the third comparison compares the
mean of the dependent variable for level 4 of ** race** with levels 1, 2 and 3. Clearly, this coding system does not make much sense with our
example of ** race** because it is a nominal variable. However, this system is
useful when the levels of the categorical variable are ordered in a meaningful
way. For example, if we had a categorical variable in which work-related
stress was coded as low, medium or high, then comparing the means of the
previous levels of the variable would make more sense.

**Method 1: Using xi3**

We can specify Helmert coding for **race** using **r.race** as shown
below.

xi3 : regress write r.racer.race _Irace_1-4 (naturally coded; _Irace_1 omitted) Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Irace_2 | 11.54167 3.286129 3.51 0.001 5.060956 18.02238 _Irace_3 | -4.029167 2.602363 -1.55 0.123 -9.161394 1.103061 _Irace_4 | 3.169061 1.487987 2.13 0.034 .2345401 6.103582 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

The regression coefficient for the first comparison shown in this output was
calculated by subtracting the mean of the dependent variable for level 2 of the
categorical variable from the mean of the dependent variable for level 1: 58 - 46.4583 = 11.542.
This result is statistically significant. The regression coefficient for the second comparison (between level 3 and the previous
levels) was calculated by subtracting the mean of the dependent variable for
levels 1 and 2 from that of level 3: 48.2 - [(46.4583 + 58) / 2] =
-4.029. This result is not statistically significant, meaning that there
is not a reliable difference between the mean of ** write** for level 3 of ** race**
compared to the mean of ** write** for levels 1 and 2 (Hispanics and Asians).
As noted above, this type of coding system does not make much sense for a
nominal variable such as **race**. For the comparison of level 4 and the
previous levels, you take the mean of the dependent variable for the those
levels and subtract it from the mean of the dependent variable for level
4: 54.0552 - [(46.4583 + 58 + 48.2) / 3] = 3.169. This result is
statistically significant.

**Method 2: Manual Coding**

The regression coding for reverse Helmert coding is shown below. For the first comparison, where the
first and second level are compared, **x1** is coded -1/2 and 1/2 and 0
otherwise. For the second comparison, the values of **x2**
are coded -1/3 -1/3 2/3 and 0.
Finally, for the third comparison, the values of **x3** are coded -1/4 -1/4
-/14 and 3/4.

REVERSE HELMERT regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

1 (Hispanic) | -1/2 | -1/3 | -1/4 |

2 (Asian) | 1/2 | -1/3 | -1/4 |

3 (African American) | 0 | 2/3 | -1/4 |

4 (white) | 0 | 0 | 3/4 |

Below we illustrate how to create **x1**, **x2** and **x3** and enter
these new variables into the regression model using the **regress**
command.

generate x1 = -1/2 if race==1 replace x1 = 1/2 if race==2 replace x1 = 0 if inlist(race,3,4) generate x2 = -1/3 if inlist(race,1,2) replace x2 = 2/3 if race==3 replace x2 = 0 if race==4 generate x3 = -1/4 if inlist(race,1,2,3) replace x3 = 3/4 if race==4 regress write x1 x2 x3Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | 11.54167 3.286129 3.51 0.001 5.060956 18.02238 x2 | -4.029167 2.602363 -1.55 0.123 -9.161394 1.103061 x3 | 3.169061 1.487987 2.13 0.034 .2345401 6.103582 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

In the above example, the
regression coefficient for ** x1** is the mean of ** write** for level 1 (Hispanic) minus the mean of
** write**
for level 2 (Asian). Likewise, the
regression coefficient for ** x2** is the mean of ** write** for levels 1 and 2 combined minus the mean of
** write**
for level 3. Finally, the
regression coefficient for ** x3** is the mean of ** write** for levels 1, 2 and 3 combined minus the mean of
** write**
for level 4.

## 5.6 Deviation Coding

This coding system compares the mean of the dependent variable for a
given level to the mean of the dependent variable for the all levels of the
variable. In our example below, the first comparison compares level 2 (Asians) to
all levels of **race**, the second compares level 3 (African Americans) to
all levels of **race**, and the third comparison compares level 4 (White) to
all levels of race.

**Method 1: Using xi3**

We indicate we would like **race** to be coded using deviation effect
coding using **e.race** as shown below.

. xi3 : regress write e.raced.race _Irace_1-4 (naturally coded; _Irace_1 omitted) Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Irace_2 | 6.321624 2.160314 2.93 0.004 2.061179 10.58207 _Irace_3 | -3.478376 1.732305 -2.01 0.046 -6.894726 -.062027 _Irace_4 | 2.376796 1.115991 2.13 0.034 .1759051 4.577687 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

The regression coefficient for **_Irace_2** is the mean for level 2 minus the grand mean.
However, this grand mean is not the overall mean of the dependent variable that
you would get from the **summarize **command. Rather, it is the mean of means of the
dependent variable at each level of the categorical variable: (46.4583 +
58 + 48.2 + 54.0552) / 4 = 51.678375. This regression coefficient is then
58 - 51.678375 = 6.32. Likewise, the coefficient for
**_Irace_3** is the mean for level 3 of race minus the overall mean, i.e.,
48.2 - 51.678 = -3.47, and **_Irace_4** is the mean for level 4 of race minus
the overall mean, 54.055 - 51.678 = 2.37.

**Method 2: Manual Coding**

As you see in the example below, the regression coding is accomplished by assigning 1 to level 2 for the first comparison (because level 2 is the level to be compared to all), level 1 to level 3 for the second comparison (because level 3 is to be compared to all), and 1 to level 4 for the third comparison (because level 4 is to be compared to all). Note that a -1 is assigned to level 1 for all three comparisons (because it is the level that is never compared to the other levels) and all other values are assigned a 0. This regression coding scheme yields the comparisons described above.

DEVIATION regression coding

Level of race | New variable 1 (x1) | New variable 2 (x2) | New variable 3 (x3) |

Level 2 v. Mean | Level 3 v. Mean | Level 4 v. Mean | |

1 (Hispanic) | -1 | -1 | -1 |

2 (Asian) | 1 | 0 | 0 |

3 (African American) | 0 | 1 | 0 |

4 (white) | 0 | 0 | 1 |

Below we illustrate how to create **x1**, **x2** and **x3** and enter
these new variables into the regression model using the **regress**
command.

generate x1 = -1 if race==1 replace x1 = 1 if race==2 replace x1 = 0 if inlist(race,3,4) generate x2 = -1 if race==1 replace x2 = 1 if race==3 replace x2 = 0 if inlist(race,2,4) generate x3 = -1 if race==1 replace x3 = 1 if race==4 replace x3 = 0 if inlist(race,2,3) regress write x1 x2 x3Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | 6.321624 2.160314 2.93 0.004 2.061179 10.58207 x2 | -3.478376 1.732305 -2.01 0.046 -6.894726 -.062027 x3 | 2.376796 1.115991 2.13 0.034 .1759051 4.577687 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

The regression coefficients for this analysis match those in the example above and have the same interpretation.

## 5.7 Orthogonal Polynomial Coding

Orthogonal polynomial coding is a form of trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable. This type of coding system should be used only with an ordinal variable in which the levels are equally spaced. Examples of such a variable might be income or education. The table below shows the contrast coefficients for the linear, quadratic and cubic trends for the four levels. These could be obtained from most statistics books on linear models.

POLYNOMIAL

Level of race | Linear (x1) | Quadratic (x2) | Cubic (x3) |

1 (Hispanic) | -.671 | .5 | -.224 |

2 (Asian) | -.224 | -.5 | .671 |

3 (African American) | .224 | -.5 | -.671 |

4 (white) | .671 | .5 | .224 |

**Method 1: Using xi3**

We indicate we would like race to be coded using orthogonal polynomials by using o.race as shown below.

The three coded variables,. xi3 : regress write o.raceo.race _Irace_1-4 (naturally coded; _Irace_4 omitted) Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0000 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Irace_1 | 2.080058 .6381718 3.26 0.001 .8214929 3.338622 _Irace_2 | -.2159021 .6381718 -0.34 0.735 -1.474467 1.042663 _Irace_3 | 2.279811 .6381718 3.57 0.000 1.021246 3.538375 _cons | 52.775 .6381718 82.70 0.000 51.51644 54.03356 ------------------------------------------------------------------------------

**_Irace_1**,

**_Irace_2**and

**_Irace_3**, represent the linear, quadratic and cubic trends respectively. Of course, the term 'trend' doesn't make sense if the variable is nominal, like

**race**. But if we pretend that

**race**is ordinal than there would be a significant linear and cubic trend. It is also easy to test for nonlinear trend.

. test _Irace_2 _Irace_3( 1) _Irace_2 = 0.0 ( 2) _Irace_3 = 0.0 F( 2, 196) = 6.44 Prob > F = 0.0020

The test for nonlinear trend is statistically significant. This example worked okay to show
how to use **xi3** but we need an ordered example that can be interpreted.

**Example 2**

We will create our own categorical variable, **readcat**, from the continuous variable
**read**.

. gen readcat = read recode readcat 1/43=1 44/49=2 50/59=3 60/100=4 tab readcatreadcat | Freq. Percent Cum. ------------+----------------------------------- 1 | 39 19.50 19.50 2 | 44 22.00 41.50 3 | 61 30.50 72.00 4 | 56 28.00 100.00 ------------+----------------------------------- Total | 200 100.00

Now we can run the regression with **xi3**.

. xi3: regress write o.readcato.readcat _Ireadcat_1-4 (naturally coded; _Ireadcat_4 omitted) Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 29.64 Model | 5579.22989 3 1859.7433 Prob > F = 0.0000 Residual | 12299.6451 196 62.7532914 R-squared = 0.3121 -------------+------------------------------ Adj R-squared = 0.3015 Total | 17878.875 199 89.843593 Root MSE = 7.9217 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Ireadcat_1 | 5.27249 .5601486 9.41 0.000 4.167798 6.377182 _Ireadcat_2 | .3097532 .5601486 0.55 0.581 -.794939 1.414445 _Ireadcat_3 | -.0324612 .5601486 -0.06 0.954 -1.137153 1.072231 _cons | 52.775 .5601486 94.22 0.000 51.67031 53.87969 ------------------------------------------------------------------------------

We see from the significant **_Ireadcat_1** that the linear trend is significant while neither
quadratic nor cubic trends (_Ireadcat_2 & _Ireadcat_3 ) are significant. The test for
nonlinear trend is also nonsignificant.

. test _Ireadcat_2 _Ireadcat_3( 1) _Ireadcat_2 = 0.0 ( 2) _Ireadcat_3 = 0.0 F( 2, 196) = 0.15 Prob > F = 0.8569

**Method 2: Manual Coding**

For the moment we are skipping manual coding.

## 5.8 User Defined Coding

You can use the **xi3** command to create your own regression coding
system. For
our example, we will make the following three comparisons:

1) level 1 to level 3

2) level 2 to levels 1 and
4

3) levels 1 and 2 to levels 3 and 4.

In order to compare level 1 to level 3, we use the contrast coefficients 1 0 -1 0. To compare level 2 to levels 1 and 4 we use the contrast coefficients -1/2 1 0 -1/2 Finally, to compare levels 1 and 2 with levels 3 and 4 we use the coefficients 1/2 1/2 -1/2 -1/2. Before proceeding to the Stata code necessary to conduct these analyses, let's take a moment to more fully explain the logic behind the selection of these contrast coefficients.

For the first contrast, we are comparing level 1 to level 3, and the contrast coefficients are 1 0 -1 0. This means that the levels associated with the contrast coefficients with opposite signs are being compared. In fact, the mean of the dependent variable is multiplied by the contrast coefficient. Hence, levels 2 and 4 are not involved in the comparison: they are multiplied by zero and "dropped out." You will also notice that the contrast coefficients sum to zero. This is necessary. If the contrast coefficients do not sum to zero, the contrast is not estimable and Stata will issue an error message. Which level of the categorical variable is assigned a positive or negative value is not terribly important: 1 0 -1 0 is the same as -1 0 1 0 in that both of these codings compare the first and the third levels of the variable. However, the sign of the regression coefficient would change.

Now let's look at the contrast coefficients for the second and third comparisons. You will notice that in both cases we use fractions that sum to one (or minus one). They do not have to sum to one (or minus one). You may wonder why we would use fractions like -1/2 1 0 -1/2 instead of whole numbers such as -1 2 0 -1. While -1/2 1 0 -1/2 and -1 2 0 -1 both compare level 2 with levels 1 and 4 and both will give you the same t-value and p-value for the regression coefficient, the regression coefficients themselves would be different, as would their interpretation. The coefficient for the -1/2 1 0 -1/2 contrast is the mean of level 2 minus the mean of the means for levels 1 and 4: 58 - (46.4583 + 54.0552)/2 = 7.74325. (Alternatively, you can multiply the contrasts by the mean of the dependent variable for each level of the categorical variable: -1/2*46.4583 + 1*58.00 + 0*48.20 + -1/2*54.0552 = 7.74325. Clearly these are equivalent ways of thinking about how the contrast coefficient is calculated.) By comparison, the coefficient for the -1 2 0 -1 contrast is two times the mean for level 2 minus the means of the dependent variable for levels 1 and 4: 2*58 - (46.4583 + 54.0552) = 15.4865, which is the same as -1*46.4583 + 2*58 + 0*48.20 - 1*54.0552 = 15.4865. Note that the regression coefficient using the contrast coefficients -1 2 0 -1 is twice the regression coefficient obtained when -1/2 1 0 -1/2 is used.

**Method 1: Using xi3**

We use the **char** command to indicate
the contrast coefficients to be used for **race** as shown below. In order to
compare level 1 to level 3, we use the contrast coefficients 1 0 -1 0. To
compare level 2 to levels 1 and 4 we use the contrast coefficients -1/2 1 0 -1/2
Finally, to compare levels 1 and 2 with levels 3 and 4, we use the
coefficients 1/2 1/2 -1/2 -1/2. These coefficients are used in the **char
race[user] **command below. This indicates that for **race** that the
user defined contrast is defined as having three contrasts (because **race** has
four levels)
as (1 0 -1 0 -.5 1 0 -.5 .5 .5 -.5 -.5).

char race[user] (1 0 -1 0 -.5 1 0 -.5 .5 .5 -.5 -.5)xi3 : regress write u.race u.race _Irace_1-4 (naturally coded; _Irace_4 omitted) Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251 ------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- _Irace_1 | -1.741667 2.732488 -0.64 0.525 -7.130519 3.647186 _Irace_2 | 7.743247 2.897186 2.67 0.008 2.029588 13.45691 _Irace_3 | 1.10158 1.964244 0.56 0.576 -2.772186 4.975347 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

The coefficient for **_Irace_1** corresponds to the first contrast
comparing level 1 to level 3 of **race**. The coefficient is the mean of level
1 of ** write** minus the mean for level 3 of **write**, and the significance of this is .525,
i.e., not significant. The coefficient for **_Irace_2** is 7.743, which is the mean of level 2 minus the mean
of level 1 and level 4, and this difference is significant, p = 0.008. The
final regression coefficient is 1.1 which is the mean of levels 1 and 2 minus the
mean of levels 3 and 4, and this contrast is not statistically significant, p = .576.

**Method 2: Manual Coding**

As in the prior examples, we will make the following three comparisons:

1) level 1 to level 3,

2) level 2 to levels 1 and 4 and

3) levels 1 and 2 to levels 3 and 4.

The **xi3** command converts the contrast coding into regression coding
for us. However, we could do this process manually as well.

For methods 1 and 2 it was quite easy to translate the comparisons we wanted to make
into contrast codings, but it is not as easy to translate the comparisons we
want into a regression coding scheme. If we know the contrast coding system, then
we can convert that into a regression coding system using the Stata
program shown below. As you can see, we place the three contrast codings we want into the matrix **c**
and then perform a set of matrix operations on **c,** yielding the matrix **x**.
We then display **x** using the **print** command.

matrix input c = (1 0 -1 0 -.5 1 0 -.5 .5 .5 -.5 -.5) matrix x = c'*inv(c*c') matrix list xx[4,3] r1 r2 r3 c1 -.5 -1 1.5 c2 .5 1 -.5 c3 -1.5 -1 1.5 c4 1.5 1 -2.5

This converted the contrast coding into the regression
coding that we need for running this analysis with the **regress** command.
Below, we use the** generate** and **replace** commands to create **x1**, ** x2**
and **x3**
according to the coding shown above and then enter them into the regression
analysis.

generate x1 = -.5 if race == 1 replace x1 = .5 if race == 2 replace x1 = -1.5 if race == 3 replace x1 = 1.5 if race == 4generate x2 = -1 if race == 1 replace x2 = 1 if race == 2 replace x2 = -1 if race == 3 replace x2 = 1 if race == 4generate x3 = 1.5 if race == 1 replace x3 = -.5 if race == 2 replace x3 = 1.5 if race == 3 replace x3 = -2.5 if race == 4regress write x1 x2 x3Source | SS df MS Number of obs = 200 -------------+------------------------------ F( 3, 196) = 7.83 Model | 1914.15805 3 638.052682 Prob > F = 0.0001 Residual | 15964.717 196 81.4526375 R-squared = 0.1071 -------------+------------------------------ Adj R-squared = 0.0934 Total | 17878.875 199 89.843593 Root MSE = 9.0251------------------------------------------------------------------------------ write | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x1 | -1.741667 2.732488 -0.64 0.525 -7.130519 3.647186 x2 | 7.743247 2.897186 2.67 0.008 2.029588 13.45691 x3 | 1.10158 1.964244 0.56 0.576 -2.772186 4.975347 _cons | 51.67838 .982122 52.62 0.000 49.74149 53.61526 ------------------------------------------------------------------------------

As you can see, the results of this analysis matches those produced using

xi3.

## 5.9 Summary

This page has described a number of different coding systems that you could
use for categorical data, and two different strategies you could use for
performing the analyses. You can choose a coding system that yields
comparisons that make the most sense for testing your hypotheses. Between
the two strategies (**xi3 ** and manual coding), you can see that **xi3**
automates the process of creating the coding, but this gives up a certain amount
of control. If you like, you can use manual coding which gives you more control
over creating the coding of the variables, but may be more laborious and tedious.
In general we would
recommend using the easiest method that accomplishes your goals.

## 5.10 Additional Information

Here are some additional resources.

- Stata Textbook Examples from Design and Analysis: Chapter 6
- Stata Textbook Examples from Design and Analysis: Chapter 7
- Stata Textbook Examples: Applied Regression Analysis, Chapter 8
- One-Way ANOVA Contrast Code Problems From Charles Judd and Gary McClelland
- Two-way contrast code solutions