Exploratory Data Analysis (EDA) with SPSS

1. What Is EDA?

After the data are entered into this program, the first step to complete (before running any inferential statistics) is EDA, which involves computing various descriptive statistics and graphs. Exploratory data analysis is used to examine and get to know your data. Chapters 2, 3, and especially this chapter, focus on ways to do exploratory data analysis. EDA is important to do for several reasons:

To see if there are problems in the data such as outliers, non-normal distributions, problems with coding, missing values, and/or errors from inputting the data.
To examine the extent to which the assumptions of the statistics that you plan to use are met.

In addition to these two reasons, which are discussed in this chapter, one could also do EDA for other purposes, such as:

To get basic information regarding the demographics of subjects to report in the Method or Results section.
To examine relationships between variables to determine how to conduct the hypothesistesting analyses. For example, correlations can be used to see if two or more variables are so highly related that they should be combined for further analyses and/or if only one of them should be included in the central analyses. We create parents’ education, in Chapter 5, by combining father’s education and mother’s education, because they are quite highly correlated.

2. How to Do EDA

There are two general methods used for EDA: generating plots of the data and generating numbers from your data. Both are important and can be very helpful methods of investigating the data. Descriptive statistics (including the minimum, maximum, mean, standard deviation, and skewness), frequency distribution tables, boxplots, histograms, and stem and leaf plots are a few procedures used in EDA.

After collecting data and inputting them, many students jump immediately to doing inferential statistics (e.g., t tests and ANOVA’s). Don’t do this! Many times there are errors or problems with the data that need to be located and either fixed or at least noted before doing any inferential statistics.

At this point, you are probably asking “Why?” or “I’ll do that boring descriptive stuff later while I am writing the methods section.” Wait! Being patient can alleviate many problems down the road.

In the next two sections, we discuss checking for errors and checking assumptions. Some of this discussion reviews material presented in Chapters 2 and 3, but it is so important that it is worth repeating.

3. Check for Errors

There are many ways to check for errors. For example:

Look over the raw data (questionnaires, interviews, or observation forms) to see if there are inconsistencies, double coding, obvious errors, and so forth. Do this before entering the data into the computer.
Check some, or preferably all, of the raw data (e.g., questionnaires) against the data in your Data Editor file to be sure that errors were not made in the data entry.
Compare the minimum and maximum values for each variable in your Descriptives output with the allowable range of values in your codebook.
Examine the means and standard deviations to see if they look reasonable, given what you know about the variables.
Examine the N column to see if any variables have a lot of missing data, which can be a problem when you do statistics with two or more variables. Missing data could also indicate that there was a problem in data entry.
Look for outliers (i.e., extreme scores) in the data.

4. Statistical Assumptions and Checking Assumptions

Statistical Assumptions

Every inferential statistical test has assumptions. Statistical assumptions are much like the directions for appropriate use of a product found in an owner’s manual. Assumptions explain when it is and isn’t reasonable to perform a specific statistical test. When the t test was developed, for example, the person who developed it needed to make certain assumptions about the distribution of scores in order to be able to calculate the statistic accurately. If the assumptions are not met, the value that the program calculates, which tells the researcher whether or not the results are statistically significant, will not be completely accurate and may even lead the researcher to draw the wrong conclusion about the results. In Chapters 7-10, inferential statistics and their assumptions are described.

Parametric tests. These include most of the familiar ones (e.g., t test, ANOVA, correlation). They usually have more assumptions than do nonparametric tests. Parametric tests were designed for data that have certain characteristics, including approximately normal distributions.

Some parametric statistics have been found to be “robust” with regard to one or more of their assumptions. Robust means that the assumption can be violated quite a lot without damaging the validity of the statistic. For example, one assumption of the t test and ANOVA is that the dependent variable is normally distributed for each group. Statisticians who have studied these statistics have found that even when data are not normally distributed (e.g., are skewed) they can still be used under many circumstances.

Nonparametric tests. These tests (e.g., chi-square, Mann-Whitney U, Spearman rho) have fewer assumptions and often can be used when the assumptions of a parametric test are violated. For example, they do not require normal distributions of variables or homogeneity of variances.

Check Assumptions

Homogeneity of variances. Both the t test and ANOVA may be affected if the variances (standard deviation squared) of the groups to be compared are substantially different. Thus, this is a critical assumption to meet or correct for. Fortunately, the program provides the Levene’s test to check this assumption, and it provides ways to adjust the results if the variances are significantly different.

Normality. As mentioned previously, many parametric statistics assume that certain variables are distributed approximately normally. That is, the frequency distribution would look like a symmetrical bell-shaped or normal curve, with most subjects having values in the mid range and with similar small numbers of participants with both high and low scores. A distribution that is asymmetrical with more high than low scores (or vice versa) is skewed. Thus, it is important to check skewness. There are also several other ways to check for normality, some of which are presented in Chapter 3. In this chapter, we look in detail at one graphic method, boxplots. However, remember that t (if two-tailed) and ANOVA are quite robust to violations of normality.

Check other assumptions of the specific statistic. In later chapters, we discuss other assumptions as they are relevant to the problem posed.

The type of variable you are exploring (whether it is nominal, ordinal, dichotomous, or normal/ scale) influences the type of exploratory data analysis (EDA) you will want to do. Thus, we have divided the problems in this chapter by the measurement levels of the variable because, for some types of variables, certain descriptive statistics or plots will not make sense (e.g., a mean for a nominal variable, or a boxplot for a dichotomous variable). Remember that the researcher has labeled the type of measurement as either nominal, ordinal, or scale when completing the Data Editor Variable View. Because the researcher is the one making these decisions, the label can and should change if the results of the EDA determine the variables are not labeled correctly. Remember also that we decided to label dichotomous variables as nominal, and variables that we assumed were normally distributed were labeled as scale.

For all the problems in Chapter 4, you will be using the HSB data file.

Retrieve hsbdata.sav from the Web site. It is desirable to make a working copy of this file. See Appendix A for instructions if you need help with this or getting started. Appendix A also shows how to set your computer to print the syntax.

Source: Morgan George A, Leech Nancy L., Gloeckner Gene W., Barrett Karen C.

(2012), IBM SPSS for Introductory Statistics: Use and Interpretation, Routledge; 5th edition; download Datasets and Materials.