How can I use the star join command to merge data?

The SPSS command star join was added in version 21. Like the match files command, the star join command is used to merge data files. Below is a short list of the similarities and differences between these two commands.

Similarities:

Multiple SPSS datasets can be merged at a time.
Not all variables from the datasets to be merged need to be included in the new, merged dataset.
An in keyword or subcommand is available to create a new variable in the merged data file that indicates from which file each case came.
Neither the match files nor the star join command will do a many-to-many merge.

Differences:

The datasets do not have to be sorted on the variable(s) that will be used to merge the files.
String variables can be of different lengths with the star join command, but not with the match files command.
Variables can be renamed within the match files command but not within the star join command.
The variable on which cases are matched does not need to have the same name in the datasets using the star join command.
Binary variables indicating the first or last case in a grouping can be created with the match files command but not with the star join command.
The star join command can be used to password protect the resulting file.
The data file resulting from the match cases command becomes the active dataset; the data file resulting from the star join command does not.
The match files command has a limit of 50 datasets that can be merged in one call to the command; the limit for the number of datasets that can be merged using the star join command is set by the operating system (in other words, there is no set limit to the number of datasets that can be merged with one call to star join).
There is no in keyword on the from subcommand (meaning that you don’t get a binary variable indicating if a case came from that dataset or not).

The hsbdemo , hsb_school and hsb_tests1 datasets will be used in the following examples.

Subcommands

The select subcommand is used to determine which variables from each dataset will be included in the final, merged dataset. All of the specifications on this subcommand should be of the general form “alias.variable_name”, where “alias” is the name given to each dataset on the from and join subcommands. The entries on this subcommand must be separated by commas. The select subcommand must be the first subcommand listed, and there can only be one select subcommand in the call to the procedure. An asterisk (*) can be used to specify all variables (except the key, or identifying variable) in a dataset.

The from subcommand must immediately follow the select subcommand and specifies the name of the case data file. The keyword as is also necessary, and it is followed by the alias used to name/identify that dataset. External datasets must be enclosed in quotes. Only one from subcommand is allowed.

The table lookup file is specified on the join subcommand. After the name of the file, the keyword as is used, and then the alias of the dataset is given. At least one join subcommand is required, but many join subcommands can be specified. The maximum number of join subcommands permitted is determined by the operating system. Like the from subcommand, external datasets must be listed in quotation marks. An asterisk can be used to specify a dataset that is already open in SPSS. The values in the key (or identifying) variable must be unique. If multiple variables are used to identify cases in the dataset, the combination of values must be unique. The on keyword is used to name the keys used to match cases. The variable type (either string or numeric) must be the same in all files, but the names of the variable (or variables) do not have to be the same. The in keyword is optional and can be used to create a binary variable that indicates if the case came from that dataset (1 = present, 0 = not present).

On the outfile subcommand is the name of the new, merged dataset. The keyword file is necessary and must be followed by an equals (=) sign. An external dataset (which must be in quotes), a previously declared dataset (need to use the dataset declare command before the star join command) or an asterisk (indicating the currently active dataset) may be specified.

Additionally, the passprotect subcommand can be used to save the resulting, merged file in encrypted format. An external file must be specified on the outfile subcommand if the passprotect subcommand is used. The password keyword is necessary, and the password can be up to 10 characters long. The password itself can be encrypted using encryptedpw = yes.

Example 1: A one-to-one merge with two datasets creating an external dataset

In this example, we will use the star join command to do a one-to-one merge with two SPSS datasets. A one-to-one merge matches one case in one dataset to one case in the other dataset. This matching is done based on a specified variable or variables that uniquely identifies each case in the dataset. In our example, we are matching cases in the hsb_demo dataset (with the alias d1) to cases in the hsb_school dataset (with the alias d2) based on the values in the variable id. The variable id is present in both datasets, and it is a unique identifier of cases in each dataset. The external dataset demo_school.sav contains the merged dataset.

get file "D:\data\hsbdemo.sav".

star join
/select d1.female d1.ses, d2.prog d2.schtyp
/from "D:\data\hsbdemo.sav" as d1
/join "D:\data\hsb_school.sav" as d2
on d1.id = d2.id
/outfile file = "D:\data\demo_school.sav".

get file "D:\data\demo_school.sav".

list 
/cases from 1 to 10.

 
       id    female       ses      prog    schtyp 
 
     1.00      1.00      1.00      3.00      1.00 
     2.00      1.00      2.00      3.00      1.00 
     3.00       .00      1.00      2.00      1.00 
     4.00      1.00      1.00      2.00      1.00 
     5.00       .00      1.00      2.00      1.00 
     6.00      1.00      1.00      2.00      1.00 
     7.00       .00      2.00      2.00      1.00 
     8.00      1.00      1.00      2.00      1.00 
     9.00       .00      2.00      3.00      1.00 
    10.00      1.00      2.00      1.00      1.00 
 
 
Number of cases read:  10    Number of cases listed:  10

Now let’s look at the syntax that does the same thing using the match files command.

get file "D:\data\hsbdemo.sav".
sort cases by id.
save outfile = "D:\data\hsbdemo.sav".

get file "D:\data\hsb_school.sav".
sort cases by id.
save outfile = "D:\data\hsb_school.sav".

match files file = "D:\data\hsb_demo.sav"
/file = "D:\data\hsb_school.sav"
/by id.

list
/cases from 1 to 10.

       id    female       ses      race      prog    schtyp 
 
     1.00      1.00      1.00      1.00      3.00      1.00 
     2.00      1.00      2.00      1.00      3.00      1.00 
     3.00       .00      1.00      1.00      2.00      1.00 
     4.00      1.00      1.00      1.00      2.00      1.00 
     5.00       .00      1.00      1.00      2.00      1.00 
     6.00      1.00      1.00      1.00      2.00      1.00 
     7.00       .00      2.00      1.00      2.00      1.00 
     8.00      1.00      1.00      1.00      2.00      1.00 
     9.00       .00      2.00      1.00      3.00      1.00 
    10.00      1.00      2.00      1.00      1.00      1.00 
 
 
Number of cases read:  10    Number of cases listed:  10

As you can see, we need to open each dataset (using the get file command), sort it (using the sort cases command) and save the file (using the save command). Once we have done that for each dataset we wish to merge, we can use the match files command to merge the files. (The list command is not a necessary part of this process; we use it here only to display the resulting output.)

Example 2: A one-to-one merge with three datasets creating an external dataset

In this example, we merge three datasets, hsbdemo, hsb_school and hsb_tests1. Note that the identification variable is called id in hsbdemo and hsb_school, but it is called sid in hsb_tests1. Also, we have added two new variables, called first and second, to our new, merged dataset. We did this with the in keyword on the join subcommand. These variables will have a value of 1 when the case comes from that dataset and a 0 when it does not.

star join
/select d1.female d1.ses, d2.prog d2.schtyp, d3.read d3.write
/from "D:\data\hsbdemo.sav" as d1
/join "D:\data\hsb_school.sav" as d2
on d1.id = d2.id in=first
/join "D:\data\hsb_tests1.sav" as d3
on d1.id = d3.sid in = second
/outfile file = "D:\data\demo_school_test1.sav".

get file "D:\data\demo_school_test1.sav".

list 
/cases from 1 to 10.

       id    female       ses      prog    schtyp      read     write first second 
 
     1.00      1.00      1.00      3.00      1.00     34.00     44.00   1      1 
     2.00      1.00      2.00      3.00      1.00     39.00     41.00   1      1 
     3.00       .00      1.00      2.00      1.00     63.00     65.00   1      1 
     4.00      1.00      1.00      2.00      1.00     44.00     50.00   1      1 
     5.00       .00      1.00      2.00      1.00     47.00     40.00   1      1 
     6.00      1.00      1.00      2.00      1.00     47.00     41.00   1      1 
     7.00       .00      2.00      2.00      1.00     57.00     54.00   1      1 
     8.00      1.00      1.00      2.00      1.00     39.00     44.00   1      1 
     9.00       .00      2.00      3.00      1.00     48.00     49.00   1      1 
    10.00      1.00      2.00      1.00      1.00     47.00     54.00   1      1 
 
 
Number of cases read:  10    Number of cases listed:  10

To use the match files command, we will need to open, sort and save each of the three datasets. We will also have to rename sid to id, so that the identification variable has the same name in each of the datasets. If we wanted to limed the number of variables in the resulting, merged dataset as we did using the star join command, we would need to use the keep subcommand on the save command to keep only those variables in the dataset.

get file "D:\data\hsbdemo.sav".
sort cases by id.
save outfile = "D:\data\hsbdemo_sorted.sav"
/keep = id female ses.

get file "D:\data\hsb_school.sav".
sort cases by id.
save outfile = "D:\data\hsb_school_sorted.sav".

get file "D:\data\hsb_tests1.sav".
rename variables sid = id.
sort cases by id.
save outfile = "D:\data\hsb_tests1_sorted.sav"
/keep = id read write.

match files file = "D:\data\hsb_demo_sorted.sav"
/in = demo
/file = "D:\data\hsb_school_sorted.sav"
/in = school
/file = "D:\data\hsb_tests1_sorted.sav"
/in = test1
/by id.

list 
/cases from 1 to 10.

       id    female       ses      prog    schtyp      read     write demo school test1 
 
     1.00      1.00      1.00      3.00      1.00     34.00     44.00   1     1     1 
     2.00      1.00      2.00      3.00      1.00     39.00     41.00   1     1     1 
     3.00       .00      1.00      2.00      1.00     63.00     65.00   1     1     1 
     4.00      1.00      1.00      2.00      1.00     44.00     50.00   1     1     1 
     5.00       .00      1.00      2.00      1.00     47.00     40.00   1     1     1 
     6.00      1.00      1.00      2.00      1.00     47.00     41.00   1     1     1 
     7.00       .00      2.00      2.00      1.00     57.00     54.00   1     1     1 
     8.00      1.00      1.00      2.00      1.00     39.00     44.00   1     1     1 
     9.00       .00      2.00      3.00      1.00     48.00     49.00   1     1     1 
    10.00      1.00      2.00      1.00      1.00     47.00     54.00   1     1     1 
 
 
Number of cases read:  10    Number of cases listed:  10

Example 3: A one-to-many merge creating an active dataset

For this example, we will input two small datasets. Next, we need to name and declare the dataset (using the dataset name and dataset declare commands) we are creating in the star join command. This gives a name to a dataset that is not yet open or created. In our example, we use the name one_many for our new dataset.

data list list
/id test1 test2.
begin data.
1 50 80
2 55 88
3 44 77
4 66 33
end data.
dataset name one.

data list list
/id test3 test4.
begin data.
1 11 22
2 22 11
2 55 66
3 44 55
3 44 33
3 66 77
4 88 99
4 77 22
4 11 55
4 77 88
end data.

dataset name many.

dataset declare one_many.

star join
/select m2.test1 m2.test2, m1.test3 m1.test4
/from many as m1
/join one as m2
on m1.id = m2.id
/outfile file = one_many.

dataset activate one_many.

list.

    id  test1  test2  test3  test4 
 
  1.00  50.00  80.00  11.00  22.00 
  2.00  55.00  88.00  22.00  11.00 
  2.00  55.00  88.00  55.00  66.00 
  3.00  44.00  77.00  44.00  55.00 
  3.00  44.00  77.00  44.00  33.00 
  3.00  44.00  77.00  66.00  77.00 
  4.00  66.00  33.00  88.00  99.00 
  4.00  66.00  33.00  77.00  22.00 
  4.00  66.00  33.00  11.00  55.00 
  4.00  66.00  33.00  77.00  88.00 
 
 
Number of cases read:  10    Number of cases listed:  10

Example 4: Multiple identifying variables

In this example, we use two identifying variables. In other words, we need two variables to uniquely identify each case. Additionally, these variables are string variables with different lengths: the variable id1 has a length of three in the x_data dataset and a length of four in the y_data dataset. The variable id2 has a length of two in the x_data dataset and a length of three in the y_data dataset.

data list 
/id1 1-3 (A) id2 5-7 (A) x 9.
begin data.
a1a b1b 5
c2c d2d 6
a1a d2d 7
c2c e3e 8
end data.

dataset name x_data.

data list 
/id1 1-4 (A) id2 6-9 (A) y 11.
begin data.
a1a  b1b  1
c2c  d2d  2
a1a  d2d  3
c2c  e3e  4
end data.

dataset name y_data.

dataset declare x_y.

star join
/select data1.x , data2.y
/from x_data as data1
/join y_data as data2
on data1.id1 = data2.id1 and data1.id2 = data2.id2
/outfile file = x_y.

dataset activate x_y.

list.

id1 id2 x y 
 
a1a b1b 5 1 
a1a d2d 7 3 
c2c d2d 6 2 
c2c e3e 8 4 
 
 
Number of cases read:  4    Number of cases listed:  4