Data Coding, Entry, and Checking in SPSS: Code Data for Data Entry

1. Guidelines for Data Coding

Coding is the process of assigning numbers to the values or levels of each variable. Before starting the coding process, we want to present some broad suggestions or rules to keep in mind as you proceed. These suggestions are adapted from rules proposed in Newton and Rudestam’s (1999) useful book entitled Your Statistical Consultant. We believe that our suggestions are appropriate, but some researchers might propose alternatives, especially for guidelines 1, 2, 4, 5, and 7.

  1. All data should be numeric. Even though it is possible to use letters or words (string variables) as data, it is not desirable to do so. For example, we could code gender as M for male and F for female, but in order to do most statistics you would have to convert the letters or words to numbers. It is easier to do this conversion before entering the data into the computer as we have done with the HSB data set (see Fig. 1.3). You will see in Fig. 2.3 that we decided to code females as 1 and males as 0. This is called dummy coding. In essence, the 0 means “not female.” Dummy coding is useful if you will want to use the data in some types of analyses and for obtaining descriptive statistics. For example, the mean of data coded this way will tell you the percentage of participants who fall in the category coded as “1.” We could, of course, code males as 1 and females as 0, or we could code one gender as 1 and the other as 2. However, it is crucial that you be consistent in your coding (e.g., for this study, all males are coded 0 and females 1) and that you have a way to remind yourself and others of how you did the coding. Later in this chapter, we show how you can provide such a record, called a codebook or dictionary.
  2. Each variable for each case or participant must occupy the same column in the Data Editor. It is important that data from each participant occupy only one line (row), and each column must contain data on the same variable for all the participants. The data editor, into which you will enter data, facilitates this by putting the short variable names that you choose at the top of each column, as you saw in Chapter 1, Fig. 1.3. If a variable is measured more than once (e.g., pretest and posttest), it will be entered in two columns with somewhat different names, such as mathpre and mathpost.
  3. All values (codes) for a variable must be mutually exclusive. That is, only one value or number can be recorded for each variable. Some items, like our item 6 in Fig. 2.3, allow for participants to check more than one response. In that case, the item should be divided into a separate variable for each possible response choice, with one value of each variable (usually 1) corresponding to yes (i.e., checked) and the other to no (usually 0, for not checked). For example, item 6 becomes variables 6, 7, and 8 (see Fig. 2.3). Items should be phrased so that persons would logically choose only one of the provided options, and all possible options should be provided. A final category labeled “other” may be provided in cases where all possible options cannot be listed, but these “other” responses are usually quite diverse and thus may not be very useful for statistical purposes.
  4. Each variable should be coded to obtain maximum information. Do not collapse categories or values when you set up the codes for them. If needed, let the computer do it later. In general, it is desirable to code and enter data in as detailed a form as available. Thus, enter actual test scores, ages, GPAs, and so forth, if you know them. It is good practice to ask participants to provide information that is quite specific. However, you should be careful not to ask questions that are so specific that the respondent may not know the answer or may not feel comfortable providing it. For example, you will obtain more information by asking participants to state their GPA to two decimals (as in Figs. 2.1 and 2.2) than if you asked them to select from a few broad categories (e.g., less than 2.0, 2.0-2.49, 2.50-2.99, etc). However, if students don’t know their GPA or don’t want to reveal it precisely, they may leave the question blank or write in a difficult to interpret answer, as discussed later. These issues might lead you to provide a number of categories, each with a relatively narrow range of values, for variables such as age, weight, and income. Never collapse such categories before you enter the data into the data editor. For example, if you have age categories for university undergraduates 16-17, 18-20, 21-23, and so forth, and you realize that there are only a few students younger than 18, keep the codes as is for now. Later you can make a new category of 20 or younger by using a function, Transform => Recode. If you collapse categories before you enter the data, the extra information will no longer be available.
  5. For each participant, there must be a code or value for each variable. These codes should be numbers, except for variables for which the data are missing. We recommend using blanks when data are missing or unusable because this program is designed to handle blanks as missing values. However, sometimes you may have more than one type of missing data, such as items left blank and those that had an answer that was not appropriate or usable. In this case you may assign numeric codes such as 98 and 99 to them, but you must tell the program that these codes are for missing values, or it will treat them as actual data.
  6. Apply any coding rules consistently for all participants. This means that if you decide to treat a certain type of response as, say, missing for one person, you must do the same for all other participants.
  7. Use high numbers (values or codes) for the “agree,” “good,” or “positive” end of a variable that is ordered. Sometimes you will see questionnaires that use 1 for “strongly agree,” and 5 for “strongly disagree.” This is not wrong as long as you are clear and consistent. However, you are less likely to get confused when interpreting your results if high values have a positive meaning.

2. Make a Coding Form

Now you need to make some decisions about how to code the data provided in Figs. 2.1 and 2.2, especially data that are not already in numerical form. When the responses provided by participants are numbers, the variable is said to be “self-coding.” You can just enter the number that was circled or checked. On the other hand, variables such as gender or college have no intrinsic value associated with them. See Fig. 2.3 for the decisions we made about how to number the variables, code the values, and name the eight variables. Don’t forget to number each of the questionnaires so that you can later check the entered data against the questionnaires.

Fig. 2.3. A blank survey showing how to code the data.

Source: Morgan George A, Leech Nancy L., Gloeckner Gene W., Barrett Karen C. (2012), IBM SPSS for Introductory Statistics: Use and Interpretation, Routledge; 5th edition; download Datasets and Materials.

Leave a Reply

Your email address will not be published. Required fields are marked *