SPSS Tutorials: Data Creation in SPSS

1. Data Creation in SPSS

When you open the SPSS program, you will see a blank spreadsheet in Data View. If you already have another dataset open but want to create a new one, click File > New > Data to open a blank spreadsheet.

You will notice that each of the columns is labeled “var.” The column names will represent the variables that you enter in your dataset. You will also notice that each row is labeled with a number (“1,” “2,” and so on). The rows will represent cases that will be a part of your dataset. When you enter values for your data in the spreadsheet cells, each value will correspond to a specific variable (column) and a specific case (row).

Follow these steps to enter data:

2. Inserting or Deleting Single Cases

Sometimes you may need to add new cases or delete existing cases from your dataset. For example, perhaps you notice that one observation in your data was accidentally left out of the dataset. In that situation, you would refer to the original data collection materials and enter the missing case into the dataset (as well as the associated values for each variable in the dataset). Alternatively, you may realize that you have accidentally entered the same case in your dataset more than once and need to remove the extra case.

2.1. Inserting a Case

To insert a new case into a dataset:

2.2. Deleting a Case

To delete an existing case from a dataset:

3. Inserting or Deleting Single Variables

Sometimes you may need to add new variables or delete existing variables from your dataset. For example, perhaps you are in the process of creating a new dataset and you must add many new variables to your growing dataset. Alternatively, perhaps you decide that some variables are not very useful to your study and you decide to delete them from the dataset. Or, similarly, perhaps you are creating a smaller dataset from a very large dataset in order to make the dataset more manageable for a research project that will only use a subset of the existing variables in the larger dataset.

3.1. Inserting a Variable

To insert a new variable into a dataset:

New variables will be given a generic name (e.g. VAR00001). You can enter a new name for the variable on the Variable View tab. You can quick-jump to the Variable View screen by double-clicking on the generic variable name at the top of the column. Once in the Variable View, under the column “Name,” type a new name for the variable name you wish to change. You should also define the variable’s other properties (type, label, values, etc.) at this time.

All values for the newly created variable will be missing (indicated by a “.” in each cell in Data View, by default) since you have not yet entered any values. You can enter values for the new variable by clicking the cells in the column and typing the values associated with each case (row).

Is it possible to insert a variable using syntax? Technically, there’s no direct syntax command to do so. Instead, you’ll need to use two syntax commands. You’ll first use the COMPUTE command to initialize the new variable. You’ll then use the MATCH FILES command to actually re-order the variables. Suppose we want to insert a new column of blank values into the sample dataset after the first variable, ids. We can use this syntax to perform these tasks:

/*Compute new variable containing blanks (system-missing values).*/
COMPUTE newvar=$SYSMIS.
EXECUTE.

/*Reorder the variables to place the new variable in the desired position.*/
MATCH FILES
FILE = *
/KEEP = ids newvar ALL.

In the MATCH FILES command, FILE=* says to act on the the current active dataset. The /KEEP statement tells SPSS the specific order of the variables you want: we list the variables by name, in the order we want, separated by spaces, on the right side of the equals sign. The ALL option at the end of the line says to retain all remaining variables in their current order. The ALL option can only be used at the end of the line; the code will fail if you try to put it before other variable names. If we do not include ALL, SPSS will throw out any variables not named in the /KEEP statement.

3.2. Deleting a Variable

To delete an existing variable from a dataset:

Alternatively, you can delete a variable through the Variable View window:

You can also delete variables using command syntax.

/*Delete one variable.*/
DELETE VARIABLES var1.

/*Delete several variables.*/
DELETE VARIABLES var1 var2 var3.

4. ID Variables versus Row Numbers

Now that you know how to enter data, it is important to discuss a special type of variable called an ID variable. When data are collected, each piece of information is tied to a particular case. For example, perhaps you distributed a survey as part of your data collection, and each survey was labeled with a number (“1,” “2,” etc.). In this example, the survey numbers essentially represent ID numbers: numbers that help you identify which pieces of information go with which respondents in your sample. Without these ID numbers, you would have no way of tracking which information goes with which respondent, and it would be impossible to enter the data accurately into SPSS.

When you enter data into SPSS, you will need to make sure that you are entering values for each variable that correspond to the correct person or object in your sample. It might seem like a simple solution to use the conveniently labeled rows in SPSS as ID numbers; you could enter your first respondent’s information in the row that is already labeled “1,” the second respondent’s information in the row labeled “2,” etc. However, you should never rely on these pre-numbered rows for keeping track of the specific respondents in your sample. This is because the numbers for each row are visual guides only—they are not attached to specific lines of data, and thus cannot be used to identify specific cases in your data. If your data become rearranged (e.g., after sorting data), the row numbers will no longer be associated with the same case as when you first entered the data. Again, the row numbers in SPSS are not attached to specific lines of data and should not be used to identify certain cases. Instead, you should create a variable in your dataset that is used to identify each case—for example, a variable called StudentID.

Here is an example that illustrates why using the row numbers in SPSS as case identifiers is flawed:

Let’s say that you have entered values for each person for the School_Class variable. You relied on the row numbers in SPSS to correspond to your survey ID numbers. Thus, for survey #1, you entered the first respondent’s information in row 1, for survey #2 you entered the second person’s information in row 2, and so on. Now you have entered all of your data.

But suppose the data get rearranged in the spreadsheet view. A common way of rearranging data is by sorting—and you may very well need to do this as you explore and analyze your data. Sorting will rearrange the rows of data so that the values appear in ascending or descending order. If you right-click on any variable name, you can select “Sort Ascending” or “Sort Descending.” In the example below, the data are sorted in ascending order on the values for the variable School_Class.

But what happens if you need to view a specific respondent’s information? Or perhaps you need to double-check your entry of the data by comparing the original survey to the values you entered in SPSS. Now that the data have been rearranged, there is no way to identify which row corresponds to which participant/survey number.

The main point is that you should not rely on the row numbers in SPSS since they are merely visual guides and not part of your data. Instead, you should create a specific variable that will serve as an ID for each case so that you can always identify certain cases in your data, no matter how much you rearrange the data. In the sample data file, the variable ids acts as the ID variable.

If you do not have an ID variable in your dataset, a convenient way to generate one is to use the system variable $CASENUM. You can use the Compute Variables procedure (simply enter $CASENUM in the Numeric Expression box), or by running the following syntax after all of your data has been entered:

COMPUTE id=$CASENUM.
EXECUTE.

SPSS Tutorials: Variable Types

A variable’s type determines if a variable numeric or character, quantitative or qualitative. It also dictates what type of statistical analysis methods are appropriate for that data. This tutorial covers the variable types that SPSS recognizes.

Variable Types

In order for your data analysis to be accurate, it is imperative that you correctly identify the type and formatting of each variable. SPSS has special restrictions in place so that statistical analyses can’t be performed on inappropriate types of data: for example, you won’t be able to use a continuous variable as a “grouping” variable when performing a t-test.

Information for the type of each variable is displayed in the Variable View tab. Under the “Type” column, simply click the cell associated with the variable of interest. A blue “…” button will appear.

Click this and the Variable Type window will appear. You can use this dialog box to define the type for the selected variable, and any associated information (e.g., width, decimal places).

The two common types of variables that you are likely to see are numeric and string.

Numeric

Numeric variables, as you might expect, have data values that are recognized as numbers. This means that they can be sorted numerically or entered into arithmetic calculations. When viewed in the Data View window, system-missing values for numeric variables will appear as a dot (i.e., “.”). (Note that one should not type in a period character in a cell to specify a missing value. Simply leave the cell blank, and SPSS will recognize it as system-missing.)

Importantly, numeric variables in SPSS can also be used to denote nominal (unordered) or ordinal categorical variables. In those cases, it almost always inappropriate to treat those variables as numbers, even though SPSS may not stop you from doing so. For example, it’s extremely common to record demographic variables like sex using the number codes 1 and 2 instead of the words “male” and “female”. Although these would be defined as numeric variables in your SPSS dataset, it would not be appropriate to use them in arithmetic operations, since the number codes are stand-ins for nominal categories (and nominal categories can’t be used in arithmetic operations). So if you are examining a new dataset, you should not assume that all numeric variables represent interval or ratio variables.

All of the following are examples of variables that could be entered as numeric variables in an SPSS dataset:

Example: Continuous variables that can take on any number in a range (e.g., height in centimeters and weight in kilograms) should be treated as numeric variables. The researcher can choose as many or as few decimal places as they feel are necessary. In this situation, the Measure setting should be defined as Scale; see the Defining Variables tutorial for more information on how to set measurement levels. This particular type of numeric variable is appropriate to use in arithmetic operations (adding, subtracting, multiplying, dividing).

Example: Counts (e.g., number of people living in a household) should be treated as numeric variables with zero decimal places. In this situation, the Measure setting should be defined as Scale. Certain mathematical calculations are valid when applied to count variables (e.g., mean and standard deviation), but some statistical procedures requiring continuous numeric variables may not be (e.g., the dependent variable in a linear regression), depending on the distribution of the variable.

Example: Nominal categorical variables that have been coded numerically (e.g., recording a subject’s gender as 1 if male or 2 if female) should be treated as numeric variables with zero decimal places. In this situation, the Measure setting must be defined as Nominal. This type of numeric variable should never be used in mathematical calculations, nor used in any statistical procedure requiring continuous numeric variables (e.g. the dependent variable of a linear regression).

Example: Ordinal categorical variables that have been coded numerically (e.g., a questionnaire item with responses 1=Small, 2=Medium, 3=Large) should be treated as numeric variables with zero decimal places. In this situation, the number codes allow us to correctly convey that Large is “greater than” Small in a meaningful way; however, it is not safe to assume that the “distance” between Large and Medium is the same as the “distance” between Medium and Small. (This is because our choice of number codes is arbitrary and not tied to any physical meaning.) In this situation, the Measure setting must be defined as Ordinal. This type of numeric variable should never be used in any statistical procedure requiring continuous numeric variables (e.g. the dependent variable of a linear regression), and in most situations it is not appropriate to use ordinal variables in mathematical calculations, though there are some notable exceptions. (One such example is computing a composite score for a validated survey instrument by summing or averaging its constituent Likert items, though this is not without controversy.)

String

String variables — which are also called alphanumeric variables or character variables — have values that are treated as text. This means that the values of string variables may include numbers, letters, or symbols. In the Data View window, missing string values will appear as blank cells. However, note that these blank cells are not recognized by SPSS as system-missing values (i.e., SPSS considers even blank strings to be non-missing)! This has important implications if you plan to use a string variable in an analysis, since it will affect your sample size.

Example: Zip codes and phone numbers, although composed of numbers, would typically be treated as string variables because their values cannot be used meaningfully in calculations.

Example: Any written text is considered a string variable, including free-response answers to survey questions.

The next few variable types are all technically numeric, but indicate special formatting. If your data has been recorded in one of these formats, you must set the variable type appropriately so that SPSS can interpret the variables correctly. (For example, SPSS can not correctly use dates in calculations unless the variables are specifically defined as date variables.)

Comma

Numeric variables that include commas that delimit every three places (to the left of the decimals) and use a period to delimit decimals. SPSS will recognize these values as numeric even if they contain commas or use scientific notation.

Example: Thirty-thousand and one half: 30,000.50

Example: One million, two hundred thirty-four thousand, five hundred sixty-seven and eighty-nine hundredths: 1,234,567.89

Dot

Numeric variables that include periods that delimit every three places and use a comma to delimit decimals. SPSS will recognize these values as numeric even if they contain periods or use scientific notation.

Example: Thirty-thousand and one half: 30.000,50

Example: One million, two hundred thirty-four thousand, five hundred sixty-seven and eighty-nine hundredths:1.234.567,89

Note about comma versus dot notation: comma notation is standard in the United States. Oracle’s International Language Environments Guide gives a list of countries and what form of notation is typically found in each.

Scientific notation

Numeric variables whose values are displayed with an E and power-of-ten exponent. Exponents can be preceded by either an E or a D, with or without a sign, or only with a sign (no E or D). SPSS will recognize these values as numeric, with or without an exponent.

Example: 1.23E2, 1.23D2, 1.23E+2, 1.23+2.

Date

Numeric variables that are displayed in any standard calendar date or clock-time formats. Standard formats may include commas, blank spaces, hyphens, periods, or slashes as space delimiters.

Example: Dates: 01/31/2013, 31.01.2013

Example: Time: 01:02:33.7

Dollar

Numeric variables that contain a dollar sign (i.e., $) before numbers. Commas may be used to delimit every three places, and a period can be used to delimit decimals.

Example: Thirty-three thousand dollars and thirty-three cents: $33,000.33

Example: One million dollars and twelve point three cents: $1,000,000.123

Custom currency

Numeric variables that are displayed in a custom currency format. You must define the custom currency in the Variable Type window. Custom currency characters are displayed in the Data Editor but cannot be used during data entry.

Restricted number

Numeric variables whose values are restricted to non-negative integers (in standard format or scientific notation). The values are displayed with leading zeroes padded to the maximum width of the variable.

SPSS Tutorials: Date-Time Variables in SPSS

This tutorial covers how SPSS treats Date-Time variables, and how to use the Date & Time Wizard to create and compute variables using dates and times.

1. Working with Dates and Times in SPSS

Datasets often include variables that denote dates or time. Thus, it is important to know how SPSS treats and works with such variables. In the following sections, we will discuss:

2. Date-Time Variables in SPSS

In SPSS, date-time variables are treated as a special type of numeric variable. All SPSS date-time variables, regardless of whether they’re a date or a duration, are stored in SPSS as the number of seconds since October 14, 1582. This means that “under the hood”, date-time variables are actually integers! This might not seem important, but it’s what makes it possible to do “date arithmetic”, such as computing the elapsed time between two dates, or adding and subtracting units of time from a date.

Fortunately, you as the user do not normally need to interact with the underlying integers, and you can type in data values for date and time variables using normal date-time conventions. However, dates and times can be written using a number of different conventions, so we need a way to tell SPSS how to read and parse our date strings. That’s where the concept of date formats comes in.

3. Standard Formats for Dates and Times

When reading data containing dates or using certain date-time functions, we need to tell SPSS which date format to use, so that it knows how to correctly parse the components of the input string. A format is a named, pre-defined pattern that tells SPSS how to interpret and/or display different types of variables. There are different formats for different variable types, and each format in SPSS has a unique name.

Date-time formats are used in several situations:

Your choice of format will depend on the whether or not the input is a date or a duration, as well as the time units included in the data value, the order of the units (e.g. month-day-year versus year-month-day), and the presence or absence of delimiters [1].

Date Formats

The actual date formats that you will use in your SPSS syntax are as follows.

SPSS date formats as applied to an example date/time of Thursday, January 31, 2013, 1:02:33.72 (AM).

In the “Date-Time Unit” column, the date components are represented using the following codes:

In the “general form” column, the name of the format appears first, followed by the letter w (or w.d). The letter w denotes the number of “columns” (typically the number of characters in the input string), and the letter d represents the number of decimal places, if present. You will replace these with the appropriate number to use for the width of the date.

You’ll see an example of how date-time formats are used in the example of converting a string variable to a date variable.

Durations

The actual duration formats that you will use in your SPSS syntax are as follows.

In the “Duration Unit” column, the time components are represented using the following codes:

Just as with date formats, the “general form” of the format name contains w (or w.d). The letter w denotes the number of “columns” (typically the number of characters in the input string), and the letter d represents the number of decimal places, if present. You will replace these with the appropriate number to use for the width of the date.

Notice how in the column of examples, SPSS took the same underlying data and automatically converted the time units based on the formats we chose. When we used the DTIME format, it knew that 29 hours should “roll over” to 1 day, 5 hours. When we used the MTIME format, it knew that 29 hours, 14 minutes is equal to (29*60) + 14 = 1754 minutes. This is one of the benefits of using date-time variables to represent dates and durations: they give us the option to change how how the data is displayed without needing to do the conversion arithmetic ourselves.

Note: As of SPSS version 24, the above date formats will correctly recognize date strings without delimiters as long as the lengths of the other elements are correct (i.e., leading zeroes where necessary in the day, month, hour, minute, and second, so that those components are each two characters long). (Source) In previous versions, these date formats would not recognize dates that did not contain the appropriate delimiters.

4. Defining Date-Time Variables in the Variable View Tab

It is important to specify which variables in your data are dates/ times so that SPSS can recognize and use these variables appropriately. However, the procedure for defining a variable as date/time depends on its currently defined type (e.g., string, numeric, date/time). The following sections outline how to define a variable as date/time based on the variable’s current type.

Changing a variable type from string or numeric to date/time

If your dataset includes a variable whose values represent dates or time, but the variable is currently defined as string or numeric, you should specify that the variable is actually a date/time. You can specify the variable type as date/time by clicking the Variable View tab, locating the variable, and clicking on the cell beneath the “Type” column. A blue “…” button will appear. Clicking the blue “…” button opens the Variable Type window. Select “Date” from the list of variable types. Then, on the right, select the format in which the date/time for that variable should appear (by selecting the date/time format in which the values already appear). Click OK. Now SPSS will recognize the variable as date/time.

Note: These steps work only if the variable values are already in a standard date/time format but are currently defined as string/numeric…and only if you define the variable as date/time by selecting a date/time format that already mirrors the existing format. For example, if the values appear as “Aug 1991” you should select a date/time format that mirrors the existing format. If you try to select a format that includes additional or different information, the change in format may fail and blank out the data.

Example: This scenario is likely if you import data from another file source, such as Excel, and SPSS does not immediately define the variable type as date/time, even though the values are in a standard date/time format.

Thus, the following criteria must apply in order to use the steps outlined above:

Changing the variable type from string or numeric to a date/time format that is different from the date/time format in which the values currently appear

If the variable is already in a standard date/time format but is currently defined as string or numeric, and you wish to both A) define the variable as date/time, and B) choose a different date/time format than the one that matches the current format, you must proceed in two steps.

Note: If the dates for a selected variable appear as mm/dd/yyyy and are currently defined as “String” in the “Type” of variable in Variable View, you cannot change the “Type” to “Date” and select the new format in which you want the date/time values to appear. You must first select the format in which the dates/times currently appear. Then, you can repeat this process to select the new format in which you want the dates to appear. If you do not first define a variable as a “Date” and select the current date/time format before selecting the format to which you want to change it, the values for that variable will be defined as missing.

Example: If a variable with date/time values is currently defined as string or numeric, but all the values follow the form mm/dd/yyyy (e.g., 01/31/2013), then you must select this format (mm/dd/yyyy) when you change the variable’s type to date/time. Do not select a format that does not match the current format of the values.

Thus, the following criteria must apply in order to use the steps outlined above:

Changing variables defined as dates/time from one date format to another date format

If a variable type is already defined as date/time, then changing the format of the values to a different date/time format is simple. In Variable View, under the column “Type,” select the cell that corresponds to the variable you want to change. A blue “…” button will appear, which opens the “Variable Type” dialog box. “Date” should already be selected from the list of variable types on the left. On the right, select the new date/time format in which you would like the variable values to appear. Click OK. Now click the Data View tab to view your data; your dates should now appear in the format you selected.

Note: If you select a new format that includes space for information that does not actually appear in your dataset, it will appear as 0s in the data. For example, if your data only includes information about the month, day, and year, and you select a format that also includes space for the hour, minute, and second, values will appear like this one: 31-JAN-2013 00:00:00.

Example: Perhaps your date is defined as date/time and appears as “01/31/2013,” but you would like it to appear as “2013/01/31,” instead.

5. Setting the Century Range for Two-Digit Years

When writing dates, it’s common to see individuals abbreviate the year to two digits, especially in contexts where the century is “obvious” to the reader. This is fine when making notes to yourself, but when you’re trying to compile data for analysis, this can be hugely problematic, especially when working with data that covers a large time range, or is very far in the past.

In general, we recommend always using four-digit years when entering data for dates. But sometimes you may not be in control of how the data was entered — you may receive or request a dataset where the dates only used two-digit years. For these situations, it’s important to know how to appropriately define the century range in SPSS.

In SPSS, the century range refers to the 100-year range that SPSS will assume when parsing date variables with two-digit years. For example: when you read the date 1/1/80, do you assume that I mean 1/1/1980 or 1/1/2080? If you didn’t have any other context clues, you’d probably base your guess on the current year (2020). You might go with the century that makes the two-digit year closer to the current year, which would mean 1/1/1980. Or, you might assume that the century should match the current century, which would mean 1/1/2080.

The default century range in SPSS is based on the current year: it will start the range at 69 years prior to the current year and end the range at 30 years after the current year (source). So if you are using SPSS in the year 2020, it will assume that the century range is 1951 to 2050; but if you open SPSS a year later, SPSS will assume that the century range is 1952 to 2051.

Why does the century range matter? If you are going to compute elapsed time, or want to use your date variables as a predictor in a model, you can imagine how problematic it would be if one of the dates was off by 100 years! For this reason, it’s critical that you specify the appropriate century range when working with dates containing two-digit years.

To change the century range for two-digit years, follow these steps:

Using the Dialog Windows

Using Syntax

Alternatively, you can set the century range using the SET EPOCH command:

SET EPOCH=yyyy.

The yyyy to the right of the equals sign is the desired beginning year for the century range. For example, SET EPOCH=1900 would set the century range to 1900 to 1999, while SET EPOCH=1950 would set the century range to 1950 to 2049.

6. Date and Time Wizard

SPSS conveniently includes a Date and Time Wizard that can assist with transformations and calculations that involve date and time variables. To access the Date and Time Wizard, click Transform > Date and Time Wizard.

The Date and Time Wizard window will appear.

Although there are many options, it is useful to begin by first reading about how dates and times are represented in SPSS. We have selected this option (Learn how dates and times are represented) in the Date and Time Wizard window (depicted above). Now, click Next. You will see the following window.

When you are finished reading, click Back to return to the main Date and Time Wizard menu.

Note that the Date and Time Wizard can assist with many tasks related to dates and time, including:

We will not cover each of these options in this tutorial, but we will cover one of the most common uses for the Date and Time Wizard: calculations involving dates and times.

7. Example: Converting a string variable to a date variable

Problem Statement

If you have datetime variables in a text or CSV file, SPSS will often read those variables in as string or character variables, instead of treating them as actual dates. In order to have those variables correctly recognized, you’ll need to convert them from string to date.

In the sample dataset, the variable enrolldate (date of college enrollment) contains dates in the form dd-mmm-yyyy, but was read into the dataset as a string variable. Let’s convert that variable from a string to a numeric date.

Running the Procedure

Using the Date & Time Wizard

Using Syntax

COMPUTE date_of_enrollment=number(enrolldate, DATE11).
VARIABLE LABELS date_of_enrollment 'Date of college enrollment'.
VARIABLE LEVEL  date_of_enrollment (SCALE).
FORMATS date_of_enrollment (DATE11).
VARIABLE WIDTH  date_of_enrollment(11).
EXECUTE.

What’s going on in this syntax?

8. Example: Computing Elapsed Time between Two Date-Time Variables

Problem Statement

Sometimes you may need to calculate the length of time that has passed between two points in time. For example, you may wish to calculate the ages of people in your sample based on information you have about when they were born and what the current day/time/year is (or another date of your choosing). Any unit of time can be used. This means that you can calculate how many years, months, days, hours, minutes, or even seconds old each person is.

Before we can perform a calculation with dates and times, we first need to make sure that our dataset has at least two variables that represent time points. If you completed the above example, you will now have at least two date variables in the sample dataset: bday (the person’s date of birth) and now date_of_enrollment (the date the person enrolled in college). We can compute the age that each person was when they enrolled in college using these two time points.

Running the Procedure

Using the Date & Time Wizard

Once your new variable has been created, it is always a good idea to check that the calculation was accurate. You can do this by spot-checking some of the rows in your data. You can manually calculate the time between date_of_enrollment and bday for some of the cases in the data and then compare the manual calculation to the value SPSS created in the new variable age_at_enrollment.

Using Syntax

COMPUTE age_at_enrollment=(date_of_enrollment - bday) / (365.25 * time.days(1)).
VARIABLE LABELS age_at_enrollment "Age at time of enrollment (years)".
VARIABLE LEVEL age_at_enrollment (SCALE).
FORMATS age_at_enrollment (F8.2).
VARIABLE WIDTH  age_at_enrollment(8).
EXECUTE.

What’s going on in this syntax?

Computing Variables in SPSS

This tutorial shows how to compute new variables in SPSS using formulas and built-in functions. The data file in example is available to download here.

1. Computing Variables

Sometimes you may need to compute a new variable based on existing information (from other variables) in your data. For example, you may want to:

In this tutorial, we’ll discuss how to compute variables in SPSS using numeric expressions, built-in functions, and conditional logic.

To compute a new variable, click Transform > Compute Variable.

The Compute Variable window will open where you will specify how to calculate your new variable.

A- Target Variable: The name of the new variable that will be created during the computation. Simply type a name for the new variable in the text field. Once a variable is entered here, you can click on “Type & Label” to assign a variable type and give it a label. The default type for new variables is numeric.

B- The left column lists all of the variables in your dataset. You can use this menu to add variables into a computation: either double-click on a variable to add it to the Numeric Expression field, or select the variable(s) that will be used in your computation and click the arrow to move them to the Numeric Expression text field (C).

C- Numeric Expression: Specify how to compute the new variable by writing a numeric expression. This expression must include one or more variables from your dataset, and can use arithmetic or functions.

When writing an expression in the Compute Variables dialog window:

D- The center of the window includes a collection of arithmetic operators, Boolean operators, and numeric characters, which you can use to specify how your new variable will be calculated. There are many kinds of calculations you can specify by selecting a variable (or multiple variables) from the left column, moving them to the center text field, and using the blue buttons to specify values (e.g., “1”) and operations (e.g., +, *, /).

E- If: The If option allows you to specify the conditions under which your computation will be applied.

F- Function group: You can also use the built-in functions in the Function group list on the right-hand side of the window. The function group contains many useful, common functions that may be used for calculating values for new variables (e.g., mean, logarithm). To find a specific function, simply click one of the function groups in the Function Group list. You will now see a list of functions that belong to that function group in the Functions and Special Variables area. If you click on a specific function, a description of that function will appear in the text field to the left.

Click If (indicated by letter E in the above image) to open the Compute Variable: If Cases window.

1- The left column displays all of the variables in your dataset. You will use one or more variables to define the conditions under which your computation should be applied to the data.

2- The default specification is to Include all cases. To specify the conditions under which your computation should be applied, however, you will need to click Include if case satisfies condition. This will allow you to specify the conditions under which the computation will be applied to your data.

3- The center of the dialog box includes a collection of arithmetic operators, Boolean operators, and numeric characters, which you can use to specify the conditions under which your recode will be applied to the data. There are many kinds of conditions you can specify by selecting a variable (or multiple variables) from the left column, moving them to the center text field, and using the blue buttons to specify values (e.g., “1”) and operations (e.g., +, *, /). You can also use the built-in functions in the Function Group list under the right column.

After you are finished defining the conditions under which your computation will be applied to the data, click Continue. Note that when you specify a condition in the Compute Variable: If Cases window, the computation will only be performed on the cases meeting the specified condition. If a case does not meet that condition, it will be assigned a missing value for the new variable.

2. Computing Variables using Syntax

You do not necessarily need to use the Compute Variables dialog window in order to compute variables or generate syntax. You can write your own syntax expressions to compute variables (and it is often faster and more convenient to do so!) Syntax expressions can be executed by opening a new Syntax file (File > New > Syntax), entering the commands in the Syntax window, and then pressing the Run button.

The general form of the syntax for computing a new (numeric) variable is:

COMPUTE NewVariableName = formula.
EXECUTE.

The first line gives the COMPUTE command, which specifies the name of the new variable on the left side of the equals sign, and its formula on the right side of the equals sign. The formula on the right side of the equals sign corresponds to what you would enter in the Numeric Expression field in the Compute Variables dialog window.

The EXECUTE command on the second line is what actually carries out the computation and adds the variable to the active dataset. (If you have tried to run COMPUTE syntax but do not see variables added to your dataset and do not also see error or warning messages in the Output Viewer, you may have forgotten to run the EXECUTE statement.)

Notice how each line of syntax ends in a period.

It’s also possible to use COMPUTE syntax to compute or transform string variables (i.e., variables containing characters other than numbers). To compute string variables, the general syntax is virtually identical. However, with string variables, you must first “declare” a new variable as a string variable before you can define it using a COMPUTE statement:

STRING  NewVariableName (A20).
COMPUTE NewVariableName = formula.
EXECUTE.

On the first line, STRING statement declares the new variable’s name (NewVariableName) and its format (A20) of a new string variable. Note that the format must be put inside parentheses. The format specification for strings will always start with the letter A, followed by a number giving the “width” of the string (the maximum number of characters that variable can contain). In this case, the new variable will have a width of 20, so data values can contain up to 20 characters. When declaring a new string variable, you should take care to set the width of the string to be wide enough so that your data values aren’t accidentally cut short. On the second line, the COMPUTE statement gives the actual formula for the variable declared in the STRING statement. On the third line, the EXECUTE command tells SPSS to carry out the computation.

In general, when writing an expression or formula using COMPUTE syntax:

3. Example: Computing a New Variable Using Arithmetic

Now we will use what we have learned throughout this tutorial to demonstrate how to compute a new variable. In this example, we wish to compute BMI for the respondents in our sample. The height (in inches) and weight (in pounds) of the respondents were observed; so to compute BMI, we want to plug those values into the formula

Using the Compute Variables Dialog Window

Using Syntax

Alternatively, you can produce the same result by opening a syntax window (File > New > Syntax) and executing the following code:

COMPUTE BMI=(Weight*703)/(Height**2). 
EXECUTE.

This syntax can be generated automatically by following the dialog window steps above and clicking Paste instead of OK.

4. Example: Computing a New Variable Using a Built-In Function

Using the Compute Variables Dialog Window

Let’s instead try computing the average test score using the built-in mean function.

Using Syntax

Alternatively, you can produce the same result by opening a syntax window (File > New > Syntax) and executing the following code:

COMPUTE AverageScore2=MEAN(English, Reading, Math, Writing). 
EXECUTE.

This syntax can be generated automatically by following the dialog window steps above and clicking Paste instead of OK.

5. Example: Referring to a Range of Variables in a Function

Notice that in the sample dataset, the test score variables in the sample dataset are all next to each other. In the previous example, we explicitly specified all four test score variables in the MEAN function. But what if there had been ten or twenty test score variables? It would take much longer to manually enter all twenty variable names.

What if we wanted to refer to the entire range of test score variables, beginning with English and ending with Writing, without having to type out each variable’s name?

When using SPSS’s special built-in functions, you can refer to a range of variables by using the statement TO. Let’s repeat the previous example and show how the TO statement is used to refer to a range of variables inside a function.

This method is dependent on the positions of the variables in the dataset. If the variables are not in sequential order, this method may not work correctly.

Using the Compute Variables Dialog Window

If you’ve already verified the computation for AverageScore2, then you should be able to verify that AverageScore2 and AverageScore3 are identical.

Using Syntax

Alternatively, you can produce the same result by opening a syntax window (File > New > Syntax) and executing the following code:

COMPUTE AverageScore3=MEAN(English TO Writing).
EXECUTE.

This syntax can be generated automatically by following the dialog window steps above and clicking Paste instead of OK.

6. Example: Computing Subscale Scores when Some Values Missing

In the previous examples, we did not talk about what happens when one or more of the variables has missing values for a given case. In fact, if there is a missing value for one or more of the input variables, SPSS assigns the new variable a missing value. That is, there must be valid values for each input variable in order for the computation to work. This is called listwise exclusion.

Listwise exclusion can end up throwing out a lot of data, especially if you are computing a subscale from many variables.

In SPSS, you can modify any function that takes a list of variables as arguments using the .n suffix, where n is an integer indicating how many nonmissing values a given case must have. As long as a case has at least n valid values, the computation will be carried out using just the valid values.

In the previous example, we used the built-in MEAN() function to compute the average of the four placement test scores. If we change the formula for AverageScore3 to MEAN.3(English TO Writing), then any case with three or more nonmissing values will have a successful, nonmissing value for AverageScore3. (Stated another way, a given case could have at most one missing test score and still be OK.)

Alternatively, using the formula MEAN.2(English TO Writing) would require that two or more of the test score variables have valid values (i.e., a given case could have at most two missing test scores).

Syntax

If you click Paste after revising the formula, the following syntax will be written to the syntax editor window:

COMPUTE AverageScore3=MEAN.3(English TO Writing).
EXECUTE.

7. Example: Computing a New Indicator from Several Existing Indicators

A common scenario on health questionnaires is to have multiple questions about risk factors for a certain disease. These questions may originally be coded as 0 (absent) and 1 (present); or 0 (no) and 1 (yes). For example, on a questionnaire about ADHD, we may ask three questions about whether an individual’s biological parents or siblings have been diagnosed with ADHD:

Suppose we want to only have a single indicator variable, where 0 = does not have any risk factors, and 1 = has one or more risk factors. The function ANY() is a convenient way to compute this indicator. The ANY function is designed to return the following:

The application we will demonstrate is intended to be used when you want to check for one specific value across many variables.

For this example, we will use this tiny dataset. Each variable represents a “yes/no” question, with 1=No, 2=Yes.

You can copy, paste, and execute the following syntax to generate this dataset in SPSS, or you can download the linked SPSS datafile below.

DATA LIST FREE (",") / q1 to q3.
BEGIN DATA.
1,2,2,
2,1,,
1,1,1,
2,,1,
1,,2,
1,1,,
1,2,1,
2,,2,
1,1,2,
,,,
1,,,
,,2,
2,2,2,
END DATA.
VALUE LABELS q1 to q3 1 'No' 2 'Yes'.

Using the Compute Variables Dialog Window

Using Syntax

Alternatively, you can produce the same result by opening a syntax window (File > New > Syntax) and executing the following code:

COMPUTE any_yes=ANY(2, q1, q2, q3).
EXECUTE.

/*Optional: add labels to the new indicator variable*/
VALUE LABELS any_yes 0 'No' 1 'Yes'.

This syntax (minus the VALUE LABELS line) can be generated automatically by following the dialog window steps above and clicking Paste instead of OK.

Result

Let’s check that the ANY() function produced the results that we expected. If you run the above code, you should get results that look like the following:

You should see that as long as a particular row has a value of Yes for at least one of q1, q2, or q3, it will have a value of 1 for any_yes. Notice that in rows 6 and 11, nonmissing values are all equal to No, so the resulting value of any_yes is 0. Also notice that the only case with a missing value for any_yes is row 10, which has missing values for all three of q1, q2, and q3.

What does this mean? If we go back to the ADHD example used at the start of this section, it implies that anyone whose mother, father, or biological sibling has been diagnosed with ADHD, is themselves considered to have a risk factor for ADHD. It does not assign “extra risk” if someone has two or more relatives that have been diagnosed.

8. Example: Using String Functions to Transform a Character Variable

When working with string variables — and especially when working with text data that’s been manually typed into the computer — your data values may have variation in capitalization. If you want to use this type of variable in an analysis, you’ll have to “standardize” the data values so that they all have the same patterns of capitalization, because SPSS considers each unique capitalization to be a different data value (even if the strings are otherwise identical).

A common string transformation is to convert a string to all uppercase or all lowercase characters. In SPSS, the functions UPCASE() and LOWER() will convert a string variable’s values to all uppercase characters or all lowercase characters, respectively.

In the sample dataset, the variable Major is a string variable containing open-ended, write-in responses asking for the person’s college major. If you create a frequency table of this variable (Analyze > Descriptives > Frequencies), you’ll notice that there are many rows of the table, and that some of the rows of the table are identical except for differences in capitalization:

If we want to merge the otherwise-identical categories of “Art History” and “Art history”, we’ll need to transform this variable so that the characters are all uppercased or all lowercased.

Using the Compute Variables Dialog Window

Using Syntax

STRING major_lowercase (A58).
COMPUTE major_lowercase = LOWER(Major).
EXECUTE.

Result

After executing the transformation and rerunning the frequency table on the transformed variable, we should see that the counts and frequencies of the previously duplicated categories are now combined:

While this variable is still not ready for analysis — for example, several duplicated categories exist because of misspellings or minor variations in wording — we have now completed the first step.

Recoding Variables in IBM SPSS

This tutorial shows how to use Recode into Different Variables and DO IF syntax to change or merge the categories of string or numeric variables in SPSS. The data in example is available to download here.

1. Recoding (Transforming) Variables

Sometimes you will want to transform a variable by combining some of its categories or values together. For example, you may want to change a continuous variable into an ordinal categorical variable, or you may want to merge the categories of a nominal variable. In SPSS, this type of transform is called recoding.

In SPSS, there are three basic options for recoding variables:

Each of these options allows you to re-categorize an existing variable. Recode into Different Variables and DO IF syntax create a new variable without modifying the original variable, while Recode into Same Variables will permanently overwrite the original variable. In general, it is best to recode a variable into a different variable so that you never alter the original data and can easily access the original data if you need to make different changes later on.

Recode into Different Variables

Recoding into a different variable transforms an original variable into a new variable. That is, the changes do not overwrite the original variable; they are instead applied to a copy of the original variable under a new name.

To recode into different variables, click Transform > Recode into Different Variables.

The Recode into Different Variables window will appear.

The left column lists all of the variables in your dataset. Select the variable you wish to recode by clicking it. Click the arrow in the center to move the selected variable to the center text box, (B).

(A) Input Variable -> Output Variable: The center text box lists the variable(s) you have selected to recode, as well as the name your new variable(s) will have after the recode. You will define the new name in (C).

(B) Output Variable: Define the name and label for your recoded variable(s) by typing them in the text fields. Once you are finished, click Change. Now the center text box, (B), will display both the name of the original variable as well as the name for the new variable (e.g., “Height –> Height_categ”).

(C) Old and New Variables: Click the Old and New Values to specify how you wish to recode the values for the selected variable.

(D) If: The If option allows you to specify the conditions under which your recode will be applied. (We discuss the If option in more detail later in this tutorial.)

Old and New Values

Once you click Old and New Values, a new window where you will specify how to transform the values will appear

(1) Old Value: Specify the type of value you wish to recode (e.g., a specific value, missing data, or a range of values) and the specific value to be recoded (e.g., a value of “1” or a range of “1-5”).

When recoding variables, always handle the missing values first! The most common recoding errors happen when you don’t tell SPSS explicitly what to do with missing values: SPSS may recode missing values into one of the new valid categories. This is especially true if using the “Lowest thru”, “thru Highest”, or “Range – through” options.

(2) New Value: Specify the new value for your variable (i.e., a specific numeric code such as “2,” system-missing, or copy old values).

(3) Old -> New: Once you have selected the old and new values for your selected variable in (1) and (2), click Add in area (3), Old–>New. The recode that you have specified now appears in the text field. If you need to change one of the recodes that you have added to the Old–>New area section, simply click on the one you wish to change and make changes in (1) and (2) as necessary.

You will need to repeat these steps for each value that you wish to recode. Once you have specified all the transformations that you wish to make for the selected variable, click the “Continue” button.

(4) Output variables are strings and Convert numeric strings to numbers: These options change the variable type of the new variable.

The “If” option

Sometimes you may wish to recode values for a specific variable only when other conditions in your data are satisfied. This means that cases meeting the conditions will be recoded, and cases not meeting the conditions will be assigned a missing value. To specify such conditions, click If to bring up the Recode into Different Variables: If Cases window.

(1) The left column displays all of the variables in your dataset. You will use one or more variables to define the conditions under which your recode should be applied to the data.

(2) The default specification for a recode is to Include all cases. To specify the conditions under which the recode should be applied, however, you will need to click Include if case satisfies condition. This will allow you to specify the conditions under which the recode will be applied to your data.

(3) The center of the window includes a collection of arithmetic operators, Boolean operators, and numeric characters, which you can use to specify the conditions under which your recode will be applied to the data. There are many kinds of conditions you can specify by selecting a variable (or multiple variables) from the left column, moving them to the center text field, and using the blue buttons to specify values (e.g., “1”) and operations (e.g., +, *, /). You can also use the options in the Function group list.

(4) The Function Group box contains common functions that may be used for calculating values for new variables (e.g., mean, logarithm, sine). After selecting a category, you will see function names appear in the Functions and Special Variables box. Double-clicking on a function name will add it to the “Include if case satisfies condition” box.

When you are finished defining the conditions under which your recode will be applied to the data, click Continue.

Note: Recode into Different Variables does not include the ability to add value labels to the new categories, so immediately after recoding, you should add value labels to your new numeric codes.

When you are ready to run the procedure, click OK. Now your new variable will be recoded according to the criteria you specified. You can find your new variable in the last column in Data View or in the last row of Variable View.

2. Recode into Same Variables

Recoding into the same variable (Transform > Recode into Same Variables) works the same way as described above, except for that any changes made will permanently alter the original variable. That is, the original values will be replaced by the recoded values.

In general, it is good practice not to recode into the same variable because it overwrites the original variable. If you ever needed to use the variable in its original form (or wanted to double-check your steps), that information would be lost.

3. DO IF – ELSE IF Syntax

DO IF-ELSE IF syntax performs similarly to the Recode procedures, but allows for more control over specifying numeric ranges. If you want to discretize a numeric variable into more than three categories, or if you want to perform a recoding based on more than one variable, you’ll need to use DO IF-ELSE IF syntax. (You could use DO IF-ELSE IF for recoding a categorical variable, but there’s no real reason to use it over Recode; the Recode syntax is shorter and more efficient for that situation.)

The DO IF-ELSE IF syntax is:

DO IF (conditional statement).
  COMPUTE (variable assignment statement).
ELSE IF (conditional statement).
  COMPUTE (variable assignment statement).
...
ELSE.
  COMPUTE (variable assignment statement).
END IF.
EXECUTE.

The DO IF and ELSE IF lines tell SPSS to perform the nested computation if certain conditions are true. These conditions are statements (or chains of statements) that evaluate as true or false. For example:

A list of operators that SPSS recognizes in conditional (or logical) statements is given in the following table. Note that you can use the letter combinations or the mathematical symbols in your statements. You can also use parentheses to group or distribute the effects of an operator.

 

Operator Symbol Definition
Operators for logical statements.
EQ = Equal to
NE ~= Not equal to
LT < Less than
LE <= Less than or equal to
GT > Greater than
GE >= Greater than or equal to
AND & Both statements must be true
OR | One or both statements must be true
NOT ~ Negation (must not be true)

The ELSE line tells SPSS to perform its nested computation on all other values not accounted for by the previous conditional statements. ELSE is optional — you don’t necessarily have to use it, but it is often more convenient to use than addressing every possible outcome using ELSE IF. If you do use ELSE, it must be at the very end of the loop (right before the END DO statement).

When using DO IF, any conditions based on missing values must be included in the DO IF step; they can not be included in ELSE IF statements. If missing value conditions are used in ELSE IF statements, they are ignored.

The COMPUTE statements are where the new variable(s) are actually computed or set. Note that if you want to set a a variable equal to a missing value in a COMPUTE statement, use the syntax var=$SYSMIS. The term $SYSMIS refers to system-missing values. (Note that although SPSS indicates numeric missing values using period characters (.), you would not use the assignment statement var=.; this will return a syntax error.)

You may encounter this syntax error after executing a DO IF block:

Error # 4095.  Command name: EXECUTE
The transformations program contains an unclosed LOOP, DO IF, or complex file structure. Use the level-of-control shown to the left of the SPSS Statistics commands to determine the range of LOOPs and DO IFs. Execution of this command stops.

If this happens, you may need to add a hyphen (-) before the COMPUTE statement(s).

4. Example: Merging Categories

Problem Statement

Class ranks for high schools and colleges are are nicknames for what year of study the person is completing: “freshman” (first-year), “sophomore” (second-year), “junior” (third-year), “senior” (fourth-year). Class ranks are also sometimes divided into “underclassmen” (first or second-year students) and “upperclassmen” (third or fourth-year students).

In the sample dataset, the variable Rank has the categories Freshman (1), Sophomore (2), Junior (3), and Senior (4). Let’s use Recode into Different Variables to merge the categories and create a new indicator variable called RankIndicator with the levels Underclassman (1) and Upperclassman (2).

Running the Procedure

We will show three different ways of defining the categories that produce identical results. You only have to use one of these; we show multiple methods to show that there is flexibility in how you define the groups.

Method 1

Using the Dialog Windows

This method tells SPSS exactly how to map each old category onto a new category.

Using Syntax
RECODE Rank (SYSMIS=SYSMIS) (1=1) (2=1) (3=2) (4=2) INTO RankIndicator.
VARIABLE LABELS  RankIndicator 'Class Rank (binary)'.
EXECUTE.

Method 2

This method uses ranges. Note that this method works OK for integers, but will often yield unexpected results when used on variables that have one or more nonzero decimal places.

Using the Dialog Windows
Using Syntax
RECODE Rank (SYSMIS=SYSMIS) (1 thru 2=1) (3 thru 4=2) INTO RankIndicator2. 
VARIABLE LABELS  RankIndicator2 'Class Rank (binary)'. 
EXECUTE.

Method 3

This method uses the “Lowest thru” and “thru Highest” ranges. The “Lowest thru” option acts as “less than or equal to some-number“, and the “thru Highest” option acts as “greater than or equal to some-number“.

Using the Dialog Windows
Using Syntax
RECODE Rank (SYSMIS=SYSMIS) (Lowest thru 2=1) (3 thru Highest=2) INTO RankIndicator3. 
VARIABLE LABELS  RankIndicator3 'Class Rank (binary)'. 
EXECUTE.

After recoding, we should be able to compare the frequencies old and new variables. There should be an identical number of missing values; the number of underclassmen should equal the sum of the number of freshmen and sophomores; and the number of upperclassmen should equal the sum of the number of juniors and seniors.

5. Example: Dichotomizing a Continuous Variable

Problem Statement

One important use of the Recode procedure is dichotomizing or discretizing a continuous variable. Dichotomizing a continuous variable transforms a scale variable into a binary categorical variable by splitting the values into two groups based on a cut point. Discretizing a continuous variable transforms a scale variable into an ordinal categorical variable by splitting the values into three or more groups based on several cut points.

In the sample dataset, the variable CommuteTime represents the amount of time (in minutes) it takes the respondent to commute to campus. Let’s try recoding this variable into three ordinal groups:

Running the Procedure

To check your work, go to the Variable View tab in the Data Editor window. Right-click on the new CommuteLength variable and click Descriptives Statistics. This will create a quick frequency table and summary statistics of the new variable. Make sure that the new variable has the same number of missing values as the original variable. You will also want to set the value labels for the new variable before doing any analysis using this variable.

Syntax

RECODE CommuteTime (SYSMIS=SYSMIS) (Lowest thru 30=1) (60 thru Highest=3) (ELSE=2) INTO CommuteLength.
EXECUTE.

Discussion

Why didn’t we use the “Range” option to specify category 2?

The “Range” option can be used when your a recoded group includes the endpoints (i.e., is defined by “greater than or equal to” AND “less than or equal to” statements). However, it should NOT be used if one or both of the endpoints is “open” (which happens if a group is defined by a “[strictly] greater than” or “[strictly] less than” statement).

Using “All other values” to define group 2 was completely dependent on us correctly accounting for all other possible categories first, including the missing values. Had we not first handled the missing values, category 2 would have included all of the cases with 30 < time < 60 and all of the cases with missing values.

6. Example: Discretizing a Continuous Variable with DO IF Syntax

The above example showed how to discretize a continuous variable into three categories using Recode into Different Variables. Recode into Different Variables was able to correctly account for all possible values in that situation. However, if we wanted to discretize into four or more categories, Recode into Different Variables isn’t equipped to properly define each range. We’ll illustrate this with a test case, then show how to use DO IF syntax to properly implement the desired recoding scheme.

Problem Statement

Suppose we have test scores as percentages, and want to convert those percentages to a letter grade. A typical grading scheme in the United States is:

Recall that the Range specification in Recode into Different Variables allows us to specify a range of values which includes both endpoints. With that constraint, how would we achieve a grouping that was intended to have an open endpoint? For the “D” and “C” grades, we could try specifying the ranges as [60, 69.9] -> D and [70, 70.9] -> C. This could work if scores were only recorded to one decimal place, but what would happen to a score with two decimal places — say, 69.99? Imagine a number line:

In that instance, the score 69.99 would fall into a “gap” not covered by any recoding rules. In general, your instructions to SPSS should be specified in such a way that all possible outcomes are accounted for, regardless of whether you’re using the menus or syntax.

In the sample dataset, the variable Math represents the subjects’ scores (out of 100 points) on a math placement test. Suppose we want to recode these scores to have a letter grade using the scheme described above. Let’s use DO IF syntax to perform this recode and save the results as a new variable, MathGrade.

Running the Procedure

This computation must be done using syntax.

NOTE: This syntax has been tested in SPSS Version 22 and 23. We have found that it may not work properly in SPSS Version 20. If you are using version 20, you may need to put dashes before each COMPUTE statement inside the DO IF block.

Output

If the recode was performed successfully, we should see the new variable in the Data Editor window.

If the new variable appeared but all of the values are missing, then there is something wrong with your code; you may have forgotten an EXECUTE statement.

We should also be able to check our new variable to make sure that it performed as we expected. There should be the same number of missing values that we started with, and each of the original scores should be classified into exactly one of the grade categories. We can check this using the Compare Means via the syntax below, or via the menus (Analyze > Compare Means > Means. The dependent variable is Math, and the layer/ independent variable is MathGrade):

MEANS TABLES=Math BY MathGrade
  /CELLS=COUNT MIN MAX.

Remember that before you perform any further analysis with this variable, you’ll want to add value labels showing 1=’F’, 2=’D’, and so on.

Recoding String Variables (Automatic Recode) in SPSS

In SPSS, recoding categorical string variables to numeric codes and converting blank strings to missing values can be done automatically using Automatic Recode. The data in example is available to download here.

1. Recoding Categorical String Variables to Labeled Numeric Variables using Automatic Recode

When writing down the observed values of a categorical variable, you can choose to write the data values as words or as numeric codes. Either method of recording categorical variables is valid, but it is often easier to work with numeric codes in SPSS than it is to work with strings. This is because, when referring to the content of a string during a computation, the content must match exactly. If the content of the two strings is not an exact match, the computer will not recognize them as identical. This includes placing an extra space at the end of a string: the human eye won’t detect the discrepancy, but the computer will. (Note that if your data was originally recorded in Excel, it is very easy for the values of string variables to accidentally be recorded with extra spaces at the end.)

If you have already recorded your categorical variables as strings, you can easily convert them to a labeled, numerically coded variable using the Automatic Recode procedure. This procedure assigns each unique category a numeric code, then saves the converted values as a new variable. It also automatically adds value labels: whatever the string value was before becomes the value label.

Additionally, if you have used blanks to indicate missing values for string variables, you may have noticed that SPSS doesn’t automatically recognize those observations as missing. This is because SPSS, by default, recognizes “blank” strings as valid values. In this situation, you must use Automatic Recode in order for SPSS to recognize blank strings as missing values. Otherwise, SPSS will consider the “blank” category as a valid category.

Note: Before using this procedure, you should resolve any issues with “mismatched” category strings. For example, if there are different capitalizations of the same word, you should “normalize” them to all use the same capitalization before you enter the variable into Automatic Recode. Functions like UPCASE() and LOWCASE() can perform these transformations (see the Compute Variables tutorial).  A frequency table will help determine which categories, if any, are mismatched.

1. Running the Automatic Recode Procedure

To open the Automatic Recode procedure, click Transform > Automatic Recode.

(A) Variable -> New Name: The original variable(s) being transformed, and the name of the new variable(s) that the results will be saved as.

(B) New Name field and Add New Name button: These fields will activate after at least one variable has been added to the Variable -> New Name box. You will need to supply a new variable name and click Add New Name for each variable being recoded.

(C) Recode Starting from: Should the new category numbering be in alphabetical order (Lowest value) or reverse alphabetical order (Highest value) with respect to the original values? This setting is applied to all of the variables being recoded.

(D) Use the same recoding scheme for all variables: When checked, the same numeric code is never re-used across variables, unless the category names are identical.

(E) Treat blank string values as user-missing: When checked, the numeric category assigned to blank strings will be set as a special missing value. This setting must be checked in order for missing values to be properly recognized.

To automatically recode variables:

2. Example: Recognizing Blank Strings as User-Missing Values

Problem Statement

In the sample data file, the variable State is a string variable representing whether the student is an in-state student or an out-of-state student. If you create a frequency table of this variable (Analyze > Descriptive Statistics > Frequencies), you will notice something strange:

The dataset has 435 observations in all, and SPSS reports that there are zero missing values. But there is an apparently unlabeled category listed under the “Valid” categories in the frequency table that has 27 observations. This is because SPSS does not automatically recognize blanks as missing values. (Note: this behavior is different than SAS, which automatically recognizes blanks as missing values for string variables.)

In order for our analyses to be accurate, we’ll need to fix this issue.

Running the Procedure

Using the Automatic Recode Dialog Window

Using Syntax

AUTORECODE VARIABLES=State 
  /INTO state_code 
  /BLANK=MISSING 
  /PRINT. 

Output

Running the procedure will produce the following message in the Output Viewer window:

State into state_code (State Residency) 
Old Value     New Value  Value Label 
 
In state              1  In state 
Out of state          2  Out of state 
            M         3M 

This message tells us the mapping scheme that SPSS generated for the categories: “In state” became 1, “Out of state” became 2, and blanks became 3, which was set as a special missing value code. You can confirm In the Variable View window, you can see that in addition to copying the original string values to the category labels, SPSS also defined category 3 as “missing.”

Now when we create a frequency table for the recoded variable, it should reflect the proportion of values that are missing:

In the output, the recoded blank values are correctly counted as missing, but show their assigned numeric code in the Value Labels column in the output. You can improve the appearance of the missing value category by simply adding a value label (e.g. “Missing”) for that particular code.

3. What if I use Automatic Recode when I’ve already defined special missing value strings?

If you recorded the missing values for a string variable using some kind of non-blank indicator (for example, 999 or -999) and have already defined that user-missing value in the Variable View window, Automatic Recode will preserve the ‘missing’ designation, but will still convert the category code to be in the range of the other categories.

Suppose that the below syntax was applied to a string variable with the valid categories “blue”, “green”, and “red”, with missing values recorded using the code “999”.

AUTORECODE VARIABLES=VAR00001
  /INTO v1
  /GROUP
  /BLANK=MISSING
  /PRINT.

Running that syntax will produce the following output:

User-missing values from VAR00001
Old Value  New Value  Value Label

blue               1  blue
green              2  green
red                3  red
999      M         4M 999

Here, you can see that observations originally coded as “999” have been recoded to the numeric indicator 4 with the value label “999”. The letter M indicates that the label or code is a missing value indicator.

4. When would I use the same recoding scheme for all variables?

We have not yet discussed the option Use the same recoding scheme for all variables. What are some reasons to use this option?

Here is a simple example of using the same recoding scheme for all variables.

Notice that both VAR00001 and VAR00002 have a category called “red”. Since both variables represent colors, it makes sense to use a single coding scheme for all of the possible color categories (i.e., the category “red” will be represented by the numeric code 4 for both variables, rather than having a different code in VAR00001 versus VAR00002). Executing the recoding syntax produces the following output:

AUTORECODE VARIABLES=VAR00001 VAR00002
  /INTO v1 v2
  /GROUP
  /PRINT.
User-missing values from VAR00001
Old Value  New Value  Value Label

blue               1  blue
green              2  green
orange             3  orange
red                4  red
violet             5  violet
999      M         6M 999

As you can see from the syntax, SPSS first alphabetizes all possible unique nonmissing category values across the two variables, then assigns numeric codes to each category. Notice that the “missing” category (“999”) was recoded last, even though it is alphabetically before the category names.

Weighting Cases in IBM SPSS

This SPSS tutorial shows how to use Weight Cases to apply a weighting variable, especially when your data measures counts.

1. Weighting Cases

In SPSS, weighting cases allows you to assign “importance” or “weight” to the cases in your dataset. Some situations where this can be useful include:

Weighting cases in SPSS works the same way for both situations.

To turn on case weights, click Data > Weight Cases.

2. Example: Reproducing an Existing 3×2 Table

Problem Statement

Suppose you are helping a friend with their statistics homework, and see that they have included the following write-up in their report:

You immediately notice several things wrong with this report so far:

You get the feeling that they may have used the Chi-square test of independence, but want to verify this for yourself.

Running the Procedure

Whenever you want to re-create a frequency table or crosstab, you first need to figure out how many unique combinations of the factors there are, and how many observations there were for each factor combination. In this situation, we have two variables: ClassRank (which has three levels) and PickedAMajor (which has two levels). So there are 3*2 = 6 unique factor combinations. They are:

Freshman No 212 responses
Freshman Yes 114 responses
Sophomore No 171 responses
Sophomore Yes 168 responses
Junior No   92 responses
Junior Yes 198 responses

When we go to enter our data in SPSS, we will need to create three new variables: ClassRank, PickedAMajor, and a frequency variable (let’s name it “Freq”). After entering the data, your Data View window should look like this:

Now we need to weight the cases with respect to Freq. Click Data > Weight Cases.

Click Weight cases by, then double-click Freq to move it to the Frequency Variable field. Click OK.

Now we can run our crosstab and verify your friend’s results. Click Analyze > Descriptive Statistics > Crosstabs.

When the Crosstabs window opens, select the variable ClassRank in the left column and move it to the Row(s) field using the first arrow button, then select variable PickedAMajor in the left column and transfer it to the Column(s) field using the second arrow button. Doing this will reproduce the 3×2 table that your friend made.

To produce a Chi-square test of independence, click Statistics. This will open the Crosstabs: Statistics window. Select the Chi-square check box in the upper left-hand corner, then click Continue.

Click OK to run the crosstab.

Syntax

WEIGHT BY Freq.

CROSSTABS
  /TABLES=ClassRank BY PickedAMajor
  /FORMAT=AVALUE TABLES
  /STATISTICS=CHISQ
  /CELLS=COUNT
  /COUNT ROUND CELL.

Output

Within your output, you should see the following two tables:

The 3×2 table matches your friend’s output. From the Chi-square Tests table, we see that this test result was significant at the 5% level (χ(2)=68.207, p < .001). From this result, we infer that there is a significant association between a student’s class rank and whether or not they have picked a major.

How to Rank Cases in SPSS

In its simplest form, a rank transform converts a set of data values by ordering them from smallest to largest, and then assigning a rank to each value. In SPSS, the Rank Cases procedure can be used to compute the rank transform of a variable. The data used for example is available to download here.

1. Rank Cases

A rank variable represents the ordering of the values of a numeric variable. Because ranks are the cornerstone of many nonparametric statistical methods, it is useful to know how to compute the rank transform of a variable in your dataset.

In SPSS, rank variables can be computed using the Rank Cases procedure. To open Rank Cases, click Transform > Rank Cases.

(A) Variables: The variables to compute rank transforms on. The new ranks will be saved to new variables (whose names will be automatically generated).

(B) By: (Optional) Assign ranks within groups. By variables should be nominal or ordinal, and have a small number of categories.

(C) Assign Rank 1 to: Should ranks be assigned in increasing or decreasing order? By default, ranks are assigned by ordering the data values in ascending order (smallest to largest), then labeling the smallest value as rank 1. Alternatively, Largest value orders the data in descending order (largest to smallest), and assigns the largest value the rank of 1.

(D) Display summary tables: When checked, a summary of the new rank variables is printed to the Output window. The summary includes the original variables, the name of the new variables, the rank order, the ranking method, and the method used for ties. This option is on by default.


(E) Rank types: (Optional) Choose one or more formulas to compute the ranks. Each box you check on this screen will add another rank variable to your dataset.

By default, only the “Rank” option is selected; this computes simple ranks. For details about the other rank types and the proportion estimation formulas, please see the official SPSS documentation for Rank Cases. Note that the Proportion Estimation Formula options are inactive unless Proportion estimates and/or Normal scores are selected.

(F) Ties: How should ranks be assigned in the case of ties? (A tie occurs when two or more observations share the exact same value.) There are four options for how to resolve ties: Mean, Low, High, and Sequential ranks to unique values. By default, mean ranks are assigned to ties.

2. Example: Rank Transforms for Non-Normal Data

Many hypothesis tests require assumptions about the distribution of the data or residuals. A common way to adjust for non-normality is to perform a transform on that variable; for example, taking the log, square root, or square of a variable. Rank transforms are another type of transform. Suppose we want to perform a rank transform on a variable in the sample dataset that is non-normally distributed: MileMinDur.

Before the Procedure

Before we compute the ranks, let’s check how many nonmissing values MileMinDur has. Let’s also check the distribution of the mile run times graphically. The Frequencies procedure makes it easy to do both of these things at once:

There are two important things we want to take note of:

This means that after we run the Rank Cases procedure, the resulting variable will only have assigned ranks for the 392 cases with nonmissing mile run times.

Running the Procedure

Syntax

RANK VARIABLES=MileMinDur (A)
  /RANK
  /PRINT=YES
  /TIES=MEAN.

Output

After executing the procedure, SPSS will add a new variable at the end of your dataset, and will print a table summarizing the computation in the Output window:

This table summarizes what the Rank Cases procedure did. It created a new variable named RMileMin, and assigned it the variable label “Rank of MileMinDur”. It ranked the values in ascending order (i.e., the smallest value has rank 1), and it used the mean rank for values with ties.

We can inspect the new variable using the Descriptives procedure to get the sample size, minimum, maximum, mean, and standard deviation of the new variable:

Notice that we have the same sample size as the original variable (392).

Sorting cases or variables in SPSS

Sorting a dataset rearranges the rows with respect to one or more variables. This tutorial discusses how to sort data using the drop-down menus in SPSS. The data used for example is available to download here.

1. Sorting Data

Sorting data allows us to re-organize the data in ascending or descending order with respect to a specific variable. Some procedures in SPSS require that your data be sorted in a certain way before the procedure will execute. There are two options for sorting data:

We cover how to perform each sorting option below.

2. Sort Cases

Sorting cases will rearrange the rows based on a given variable (or variables). The values for the selected variables can be sorted in ascending (smallest to largest, or alphabetical) or descending order (largest to smallest, or reverse alphabetical).

Once you sort the cases of a dataset, it is not possible to “un-sort” the data to its original order. If the original order of your rows is important, make sure you have a variable of that specifically and uniquely identifies the correct order of the cases first! That way, you can return to the original row order by sorting on the “order identifier” variable.

If you do not have an “order ID” variable in your dataset, a convenient way to generate one is to use the system variable $CASENUM. You can use the Compute Variables procedure (simply enter $CASENUM in the Numeric Expression box), or by running the following syntax after all of your data has been entered:

COMPUTE id=$CASENUM.
EXECUTE.

Sorting from the Data View

In the Data View, you can quickly sort your data with respect to a single variable by right-clicking on the variable name and selecting Sort Ascending or Sort Descending.

Sorting with the Sort Cases procedure

If you want to sort your data with respect to two or more variables, or if you want to have the sorted data written to a new file, you’ll want to use the Sort Cases procedure:

You can check that your cases were sorted correctly by visually inspecting the data in Data View. In this example, cases have been sorted according to class rank, and then sorted by most to least recent birthdate.

Note: SPSS considers missing values the “smallest” value, so they will appear first if sorting in ascending order, and will appear last if sorting in descending order.

Sorting with Syntax

This syntax performs the same sort shown in the screencap above (sort by ascending class rank, then descending birth date).

SORT CASES BY Rank(A) bday(D).

3. Sort Variables

Sorting variables will rearrange the order of the variables (columns) in your data. Variables can be sorted on only one attribute at a time: Name, Type, Width, Decimals, Label, Values, Missing, Columns, Align, Measure, or a custom attribute. Variables can be sorted in ascending or descending order with respect to the selected attribute.

To sort variables, follow these steps:

Now your variables will be sorted according to the attribute you selected. In this example, the variables are sorted in ascending order according to their names (i.e., alphabetically).

Syntax

SORT VARIABLES BY NAME (A).

Grouping or Splitting Data in SPSS

In SPSS, Split File is used to run statistical analyses on subsets of data without separating your data into two different files.

1. Grouping or Splitting Data

When analyzing data, it is sometimes useful to temporarily “group” or “split” your data in order to compare results across different subsets. This can be useful when you want to compare frequency distributions or descriptive statistics with respect to the categories of some variable (e.g., Gender) – especially if you want separate tables of results for each group.

To split your dataset, click Data > Split File.

SPSS Version 25 Drop-Down Menu

 

SPSS Version 22 Drop-Down Menu

The Split File window will appear. By default, the dataset is not split according to any criteria; this is indicated by Analyze all cases, do not create groups.

You can choose one of two ways to split the data:

For both splitting methods, there are two considerations to be made:

2. Turning Off Split File

When you no longer want to split your analyses by group, you can turn Split File off through the same window you used to turn it on.

You can now run all analyses normally again.

Syntax

SPLIT FILE OFF.

Example

What are the differences in the split file options?

The Compare and Organize options produce numerically identical results when the same grouping variable(s) are applied. This is true regardless of what statistical analysis is used. The difference between the two options is how the numeric results are presented.

The choice of which splitting method to use is entirely about what format the user wants their results in. Do you want a single table with all results, or separate tables for each group’s results? A good rule of thumb is to choose Compare Groups if you want to be able to directly compare the results of your groups, and to choose Organize Output by Groups if the information is from separate trials or samples (such as cohorts from different years).

Problem Statement

Suppose that we want to get a summary of the differences in height between males and females in the sample data. Let’s couple the Split File procedure with the Descriptives procedure to get summary statistics for the two groups. We’ll use both Split File methods so that we can compare what their outputs look like.

Splitting using Compare Groups

If you choose to split your data using the Compare groups option and then run a statistical analysis in SPSS, your output will be displayed in a single table that organizes the results according to the grouping variable(s) you specified.

Running the Procedure

To split the data in a way that will facilitate group comparisons:

After splitting the file, the only change you will see in the Data View is that data will be sorted in ascending order by the grouping variable(s) you selected.

Now let’s view the aforementioned descriptive statistics for the variable Height with respect to Gender. Select Analyze > Descriptive Statistics > Descriptives. Double click on the Height variable, then click OK.

Syntax
SORT CASES  BY Gender.

SPLIT FILE LAYERED BY Gender.

DESCRIPTIVES VARIABLES=Height
  /STATISTICS=MEAN STDDEV MIN MAX.

Output

This table gives us a breakdown of how many observations were in each group (N), and the minimum, maximum, average, and standard deviation of each group. The ‘.’ group contains cases with missing gender values and nonmissing height values. At a glance, we can quickly take note that in this sample:

Note: This combination of Split File: Compare Groups with Descriptives is very similar to what you would get with the Compare Means procedure. The major difference is that Split File includes the missing values in the grouping/splitting variable, whereas Compare Means excludes missing values in the grouping variable.

Splitting using Organize Output by Groups

If you choose to split your data using the Organize output by groups option and then run a statistical analysis in SPSS, your output will be broken into separate tables for each category of the grouping variable(s) specified.

Running the Procedure

To split the data in a way that separates the output for each group:

After splitting the file, the only change you will see in the Data View is that data will be sorted in ascending order by the grouping variable(s) you selected.

Now we will re-run the same descriptive statistics procedure that we ran before. You can go through the menu system again (Analyze > Descriptive Statistics > Descriptives), or you can click on the Recall recently used dialogs icon, which will bring up a list of recently used procedures:

Syntax
SORT CASES  BY Gender.

SPLIT FILE SEPARATE BY Gender.

DESCRIPTIVES VARIABLES=Height
  /STATISTICS=MEAN STDDEV MIN MAX.

Output

After re-running the descriptive statistics, we see that the output is broken into three sections based on values of the Gender variable. The first section (“Gender = .”) reports the minimum, maximum, average, and standard deviation of Height for the students who had missing values for Gender. The second section reports those same statistics for the male students; the third section reports the statistics for the females.