Randomness of Missing Data with SPSS

Are data missing completely at random (MCAR)? If data are MCAR, then whether or not a value is missing (missingness) is not related systematically to the values of that variable or any other variables (see Little & Rubin, 2002). If such a condition holds, then the only problem created by missing data is reduction of power. However, it is not possible to determine whether missingness on variable y is related to the values of y since the values for y are missing for everyone who has missing data for y! Thus, one can never be assured that missingness is random. Moreover, most behavioral scientists recognize that MCAR is almost never true when large amounts of data are missing. More typically, data are missing because of specific factors.

For example, it can be assumed that longitudinal studies will have fewer participants in later time points than in the first time point, for a variety of reasons. They may be missing because participants moved, and moving is not usually for random reasons. Certain groups are more mobile than others, and people typically move for specific reasons such as job loss, divorce, military transfer, job transfer, or similar. Yet, employment and marital status, being in the military or in a company that requires frequent transfers, and so on are associated with many differences that are relevant to behavioral science.

Participants who did not move may decline further participation in a longitudinal study, and this is also not likely to be random with respect to measures for the study: They may have had a bad experience, another child might have been born so they were too busy, an at-home parent may have begun working and became too busy, or a child participant may have begun having difficulties at school or home, etc.

Similarly, intervention studies are likely to have more participant drop-out in the non-intervention condition if non-intervention is associated with escalating difficulties or in the intervention condition if the intervention is unpleasant or time-consuming or creates difficulties. Yet, those who drop out for such reasons are very important to include in outcomes. If data are not MCAR and one simply eliminates from analyses all participants who have at least one missing datum (listwise deletion, which is the default for many SPSS programs), then results will not accurately represent the population.

Are data missing at random (MAR)? A term that sounds similar to MCAR, but is quite different, is data missing at random (MAR). MAR means that although the data are not missing for completely random reasons, missingness of variable y is not related to values of y, it is only related to other variables in the dataset (or, at least, once you have taken into account the other variables, missingness is unrelated to values of the target variable; see Little & Rubin, 2002). So, in contrast to MCAR, which required that missingness was unrelated to any variables in the dataset, MAR only requires that missingness is unrelated to the variable that is missing the data. Since it is OK for missingness to be related to other variables in the dataset besides the one being imputed, one should be able to impute the missing values of the target variable from other variables in the dataset (as we will demonstrate in this chapter). Then, once one has modeled or imputed the missing data from the other variables in the dataset, the resulting dataset should be only randomly different from the complete dataset. If data are MAR, then it is reasonable to impute them, as we will do in this chapter.

Are data not missing at random (NMAR)? Not missing at random means that the missingness of the variable is systematically related to scores on the very variable that has missing data. If this is the case, the assumptions of multiple imputation are not satisfied. Some would argue that one still should impute the values because listwise or pairwise deletion is likely to bias results to an even greater extent under these circumstances. However, it is important to realize that the assumptions of multiple imputation would not be satisfied, and some bias could be introduced. However, ways of dealing with NMAR missing data are quite complex and not available in SPSS.

If Maximum Likelihood (ML) estimation is selected, certain programs, such as Mixed Models (see Chapter 12) use all data that are available from each participant, implicitly imputing missing dependent variable data as they are creating best fit models and parameters based on all of the available data. However, they usually do not use cases that are missing predictor data, so they still cannot compensate for missing predictors. Moreover, they do not actually create or save imputed values for missing data that then can be used in other analyses.

The SPSS Missing Values Analysis program allows you to do both multiple and single imputation of data. However, in this chapter, we only will show you how to do multiple imputation of data, because this is considered the most accurate way to impute both independent and dependent variables that are nominal, dichotomous, ordinal, or scale data. First, we will examine the patterns of missing values in the data, using descriptive statistics and figures. This will help us decide whether multiple imputation is necessary.

Assumptions for Multiple Imputation of Data

There are two assumptions for Multiple Imputation of Data: (1) data cannot be NMAR and (2) data should have a multivariate normal distribution. Multiple imputation of data is appropriate when missingness of the data on a particular variable, y, is not systematically related to the values of variable y.

The data may be MCAR or MAR, but they should not be NMAR (the values on the very variable with missing data are systematically related to whether or not they are missing).

Although, as mentioned before, it is not possible to directly assess whether missingness of y is related to y (since y is missing for all of the relevant individuals), it may be possible to determine that it is likely that missingness on y is related to values of y. For example, you might be interested in how much participants valued your 6-week intervention, based on a self-report Likert scale. You might collect the data only at the end of the intervention period and only collect data from those who completed all sessions, thinking that participants couldn’t really determine how valuable the intervention was if they didn’t participate in all of the sessions. Although at first glance, the logic behind this approach seems sensible, it seems quite likely that the missing data (from those who declined to participate further sometime after the first session and/or missed some of the sessions) would have been lower values in comparison to those of participants who did participate in all sessions. The fact that they declined to fully participate is likely to be related to their finding the intervention less valuable. Thus, the assumption of multiple imputation that data are not NMAR would be violated.

Nevertheless, you might choose to do multiple imputation anyway and just acknowledge the limitations of the imputation under the circumstances, since the bias in the data would be even greater if you simply used listwise deletion to eliminate all cases with missing data. In the latter case, you clearly would be misrepresenting the level of value participants placed on the intervention. To the extent that missingness could be predicted by other variables, imputation would at least partially compensate for the bias in the data. However, you would need to acknowledge that data almost certainly were NMAR, so results probably do not accurately represent the low end of the measure and should be interpreted with caution. We will show you how to look at the patterns of missing data for the variables you plan to impute, which can help you in deciding whether it is reasonable to assume that data are not NMAR.

Multiple imputation also assumes a multivariate normal distribution. Some sense of this can be obtained by looking at Descriptives for the variables or by plotting the distributions of the variables and doing matrix scatterplots between pairs of the variables (see earlier chapters). Some corrections are possible if this assumption is violated; however, if no correction is made, the standard error (SE) may be too high or too low. These corrections, such as repeatedly drawing subsets of a fixed sample size from the data and bootstrapping the SE from these samples, are beyond the scope of this chapter but should be considered if it is apparent that there are important violations of normality in the data. The accuracy of the imputations can be improved by careful selection of the variables to include in the imputations. You should always include all of the variables that will be used in your analysis, but in addition, you may want to include other variables as predictors that you think may be related to missingness.

First Steps in Multiple Imputation

We will start by assessing the characteristics of the missing data—to see whether it seems reasonable to consider the data missing at random (MAR). There are tests to assess whether or not the missing values are missing completely at random, but this condition is extremely rare in behavioral research unless one uses a planned missing values design (beyond the scope of this chapter), and it is not truly needed for multiple imputation. So, the best thing to do to determine if it is reasonable to consider the missing data MAR is to think logically about the likely reasons for missing data and to conduct analyses to see if the missing values show a clear pattern that can be explained in terms of the dependent variable.

We will use the anorectic3.sav data file which is a modified version of a file that is provided with the SPSS program; it was altered to create missing values. It is a longitudinal dataset with 4 timepoints. At each timepoint, variables that might affect the tendency of participants with eating disorders to lose or gain weight were assessed.

In Problem 13.1, we will examine the data to see if the patterns of missing data are consistent with MAR, and thus whether multiple imputation is advisable. In Problem 13.2, we will conduct the imputation to obtain datasets with missing data “filled in”. Then, in Problem 13.3, we will use this new dataset with the imputed values in it to conduct an analysis.

Source: Leech Nancy L. (2014), IBM SPSS for Intermediate Statistics, Routledge; 5th edition;

download Datasets and Materials.

Leave a Reply

Your email address will not be published. Required fields are marked *