Data Setup for Comparing Means in SPSS David P. Nichols Senior Support Statistician SPSS, Inc. April 1994 Testing hypotheses about equality of means is one of the most commonly used applications of statistical software. SPSS offers a variety of procedures capable of performing mean comparisons. Several of these procedures are fairly simple, designed to easily handle specific problems, while others are more general, and necessarily more complex. In order to successfully employ any of these options, users need to be familiar with the data structure required by SPSS. Judging by the number of statistical support calls that involve questions of data setup for procedures ranging from T-TEST to MANOVA, many users are not clear on the logic of this structure. SPSS, like most other statistical software, primarily works on a rectangular cases by variables format. That is, rows of the rectangular data matrix represent cases, while columns denote variables. (Even though on occasion data sets are large enough to require multiple records or lines per case, the logic remains as if we were still using one line and simply wrapping it around as many times as necessary.) The decisive question when we look to compare two or more means is whether they represent means of independent or related samples. The independent vs. related samples distinction is usually equivalent to the question of whether we want to compare means of two or more groups of cases or the means of the same group of cases under two or more conditions. For this reason the terms between subjects and within subjects are commonly used to denote the type of comparison(s) desired. In the T-TEST procedure these two kinds of analysis are referred to as independent vs. related samples tests. The generalization of the related samples (within subjects) situation to more than two time points or conditions is handled most generically in the MANOVA procedure via the WSFACTORS specification, though the RELIABILITY procedure's STATISTICS=ANOVA option also provides some tests of means of related samples. Setup for Independent Samples (Between Subjects Analyses) If the desired comparison(s) involve between subjects or independent samples data, the appropriate data structure involves one or more grouping variables to identify what kind of case each line of data represents, with the values for the variable(s) on which we wish to compare the groups listed in one or more separate variable(s). Thus the proper data setup for a comparison of the means of two groups of cases would be along the lines of: DATA LIST FREE / GROUP Y. BEGIN DATA 1 5.2 1 4.3 ... 2 7.1 2 6.9 END DATA. In other words SPSS needs something to tell it which group a case belongs to (this variable--called GROUP in our example--is often referred to as a factor variable), as well as the value of the measured variable(s) of interest (Y). Once the data are successfully entered in this format, any of the following procedure commands can be used to obtain a test of the null hypothesis of equal population means for the two groups: T-TEST GROUPS=GROUP /VAR=Y. MEANS Y BY GROUP /STATISTICS=ANOVA. ONEWAY Y BY GROUP(1,2). ANOVA Y BY GROUP(1,2). MANOVA Y BY GROUP(1,2). For situations in which there are three or more groups the same structure would prevail, except that there would be more than two values for the GROUP variable, and of course then we could not use the T-TEST procedure to compare more than two means at one time. If there are data groupings defined by more than one type of factor, such as gender and geographical region, then we simply have more grouping variables (such as GENDER with two categories and REGION with several) entered in our data set. In this case we move to either ANOVA or MANOVA, since MEANS and ONEWAY are designed specifically for use with one grouping factor. Setup for Paired or Related Samples (Within Subjects Analyses) Suppose instead of wanting to compare the means of two or more groups of cases, we now want to make comparisons among measurements taken on the same cases at different times or under different conditions. Since the repeated measures or time example is so common, we will call the factor of interest here TIME. The difference between this situation and that involving between subjects analyses is that here we are concerned with comparing related measurements on the same cases. Thus the data setup is different. Rather than having one variable distinguish among the cases on the basis of group membership, we simply have two measured variables for each case. If we call these TIME1 and TIME2, the data setup might look like: DATA LIST FREE / TIME1 TIME2. BEGIN DATA 1.5 3.8 2.1 4.2 ... 3.2 4.7 END DATA. The MEANS, ONEWAY and ANOVA procedures are not useful, as they do not handle within subjects data. Instead we could obtain the same results in varying forms of presentation by any of the following specifications: T-TEST PAIRS=TIME1 TIME2. RELIABILITY VARIABLES=TIME1 TIME2 /STATISTICS=ANOVA. MANOVA TIME1 TIME2 /WSFACTORS=TIME(2). Should we move to a comparison involving more than two related means we would not be able to use the T-TEST procedure, and the results produced by the RELIABILITY procedure, though presented in a more familiar format for many people than those given in MANOVA, will provide only a part of the information given by MANOVA, and this information will only be strictly valid under some fairly severe assumptions. For this reason users are generally much safer to work with MANOVA for within subjects analyses. Adding more time points would produce no structural changes in the MANOVA specification, only a longer list of dependent variables and a change in the number of levels of the WSFACTOR TIME. Note that this name is arbitrary; we can call this factor anything we want as long as it is eight characters or less and does not match any reserved words in MANOVA. Note that if we are using data in which there are both grouping or between subjects factors and related or repeated variables forming within subjects factors, MANOVA is the only procedure we can use. If we had two groups measured at two time points and wished to perform a factorial analysis of variance on these data, comparing groups across time, time changes across groups, and the interaction of the two, we would use syntax such as: DATA LIST FREE / GROUP TIME1 TIME2. BEGIN DATA 1 2.1 4.2 1 3.0 3.6 ... 2 2.5 2.1 2 3.1 2.6 END DATA. MANOVA TIME1 TIME2 BY GROUP(1,2) /WSFACTORS=TIME(2).