Multiple Imputation of Missing Values - Logit Regression by using Stata

Chapter 8 introduced Stata’s methods for multiple imputation of missing values, illustrated by a regression example. Multiple-imputation methods work with other types of analysis as well, including the logit-type models discussed in this chapter. For an illustration, we return to the Granite State Poll data and the climate-change belief indicator warmop2. The previous section tested age, gender, education and political party as possible predictors of response to climate- knowledge question warmice. Those four background characteristics are the usual suspects in research on the social bases of environmental concern, so it is reasonable to guess that one or more will be related to warmop2. Should we also consider household income, another important background characteristic, as a possible predictor? One problem with income on surveys is that it tends to have a lot of missing values, because many people feel disinclined to answer this question.

Ten variables from Granite2011_06.dta will be used in this analysis. Four of these (employ, ownrent, married and yrslive) hold no theoretical interest with respect to climate change beliefs, but might prove helpful for imputing the missing values of income.

Although we listed warmop2, sex and married in the misstable command, Stata detects that they have no missing values and does not include them in the output. On the other hand, out of 516 interviews, we have 171 missing values on income. If we regress the dichotomous variable warmop2 on income along with the usual suspects, our estimation sample includes just 340 observations.

Only political party appears to have a significant effect. Would we reach the same conclusion if we could run this analysis without setting so much of the data aside? Multiple imputation provides a way to approach that question.

As a first step for imputation, we drop the 42 observations that have missing values on any of the variables of interest, except for income. After doing this we have a dataset with 516 – 42 = 474 observations, including 137 for which income is missing.

Next we set the multiple imputation data format as mlong, a memory-efficient choice. income is registered as imputed, meaning we will try to fill in its missing values. Other variables are registered as regular, so they will not be imputed.

. mi set mlong

. mi register imputed income

(137 m=0 obs. now marked as incomplete)

. mi register regular warmop2 sex educ party employ ownrent married yrslive

137 observations with missing income values are predicted by regression on employ, ownrent, married and yrslive. Fifty sets of imputed values are created, each with these 137 predicted values plus random noise. The imputations are then pooled to estimate a new logit regression model.

Following multiple imputation, the logit coefficient on party remains similar (-1.15 compared with the previous -1.18), and still statistically significant. Other results show greater change, including shifts in coefficients but also generally smaller standard errors, reflecting more precise estimates from the imputation-enhanced data. Through these changes the coefficients on age (negative) and educ (positive) now appear statistically significant as well, in keeping with results from many previous studies of climate-change beliefs. income, on the other hand, shows little effect in either the pre- or post-imputation model. That finding could support a decision to leave income out of a final model, and focus on other more important and less troublesome predictors.

Source: Hamilton Lawrence C. (2012), Statistics with STATA: Version 12, Cengage Learning; 8th edition.

STATA

Multiple Imputation of Missing Values – Logit Regression by using Stata

One thought on “Multiple Imputation of Missing Values – Logit Regression by using Stata”

Leave a Reply Cancel reply