Multiple Imputation of Missing Values in Stata

Nations3.dta contains information about 194 countries, but missing values restrict our analysis in previous sections to a subset of 178 that have complete information on all variables of interest. This listwise deletion approach of setting aside incomplete observations is, out of necessity, a common statistical practice. Its known drawbacks include loss of observations and statistical power. If the observations with missing values differ systematically from other observations, listwise deletion could also bias coefficient estimates.

There may be other variables in the data that are statistically related to those with missing values. In such cases regression could be used to predict what the missing values might be, and those predictions substituted for the missing values in further analytical steps. This regression imputation of missing values can restore observations and apparent statistical power, and reduce the likelihood of biased coefficients. However, the imputed values will generally have lower variance than non-missing values for that variable, leading to standard error estimates that are biased toward zero. In other words, regression imputation may cause us to over-estimate the precision or statistical significance of our results.

Multiple imputation of missing values starts from the core idea of regression imputation, then adds further steps to obtain more realistic estimates of standard errors or uncertainty. These involve creation of multiple sets of artificial observations in which missing values are replaced by regression predictions plus random noise. Then a final step pools the information of these multiple imputations to estimate the regression model along with its standard errors and tests.

Stata’s mi family of multiple-imputation procedures supports a variety of data organizations, estimation methods and modeling techniques including logit-type models for categorical variables. The Stata Multiple-Imputation Reference Manual covers the choices in 365 pages, and could well be supplemented by a companion volume filled with more examples.

For a basic example, we revisit the life expectancy regression.

Three of the variables in this analysis — loggdp, chldmort and school — have missing values that in combination reduce the available sample from 194 to 178 observations.

The misstable summarize command counts three types of observations, depending on their missing-value status:

obs = .          Stata’s default missing value, referred to as “soft missing.”

obs>.           Missing value codes shown as .a, .b, .c etc. which could take value labels, referred to as “hard missing.”

obs<.           Nonmissing values.

Stata can impute only soft missing values, and not the hard missing ones. All missing values in the Nations3.dta example are soft, so the situation is simple. A survey example with hard missing values will be considered in Chapter 9.

The first step in multiple imputation is to declare the data using an mi set command, which specifies how imputed values will be organized. There are four possible styles, described in the Reference Manual. For this example we choose the memory-efficient style mlong, in which new observations or rows will be added to the data.

. mi set mlong

In multiple-imputation notation, the original un-imputed data with missing values are denoted m = 0. Imputations with sets of filled-in missing values are denoted m = 1, m = 2, m = 3 and so forth. M denotes how many such imputations are done. Before proceeding further we need to register the variables of interest as one of three types.

imputed    Has missing values to be imputed.

passive    A variable that is a function of imputed variables or of other passive variables. It will have missing values in the original data (m = 0), and varying values in each imputation (m = 1, m = 2 etc.).

regular     Neither imputed nor passive, a variable that has the same values (missing or not) across all m.

Our example has missing values for loggdp, chldmort and school, so these variables we register as imputed.

. mi register imputed loggdp chldmort school

The other variables life, adfert, urban and reglcontain no missing values and should be registered as regular.

. mi register regular life adfert urban regl

The next step does actual imputation. Missing values of loggdp, chldmort and school (the mi register imputed variables) are imputed by regression on the mi register regular variables regl, adfert and urban. We use a multivariate normal (mvn) regression method. There will be 50 separate imputations denoted as m = 0 (the original data with missing values) or m = 1 through m = 50, each of which contains imputations for the 16 observations that originally had missing values. Thus, 50*16 = 800 observations will be added to the data, making a total of 194 + 800 = 964 observations.

Having 50 separate imputations, each with the missing values replaced, provides a basis for later estimating the sample-to-sample variation when we pool these values for regression. The rseed(12345) option in this mi impute command sets an arbitrary seed for Stata’s random- number generator. By using rseed() we make the example repeatable, which might be desirable for instructional purposes. Otherwise Stata will chose its own seed, giving slightly different results the next time the same command is given.

A final step uses these imputations to regress life expectancy on the 6 predictors. In principle the imputation process results in estimates that are more efficient (lower standard errors) and less biased than our earlier regression which dropped all observations with missing values.

These mi estimate results closely resemble those from our earlier ordinary regression. That presents a best-case scenario in which simple and more complicated methods agree: the findings appear reasonably stable. In a research report, we could present either analysis along with a footnote explaining that we had tested another approach as well, and reached the same conclusion.

Chapter 9 presents a second example of multiple imputation, using survey data and logistic instead of linear regression models. Consult help mi and the Stata Multiple-Imputation Reference Manual for more on this topic.

Source: Hamilton Lawrence C. (2012), Statistics with STATA: Version 12, Cengage Learning; 8th edition.

Leave a Reply

Your email address will not be published. Required fields are marked *