Dummy Variables in Linear Regression by using Stata

Categorical variables can become predictors in a regression when they are expressed as one or more {0,1} dichotomies called dummy variables. For example, we have already seen large regional differences in life expectancies (Figure 7.1). The categorical variable region takes values from 1 (Africa) to 5 (Oceania) which can be re-expressed as a set of five {0,1} dummy variables. The tabulate command offers an automatic way to do this, generating one dummy variable for each category of the tabulated variable when we include a gen (generate) option. In the example below the resulting dummy variables are named regl through reg5. regl equals 1 for African nations and 0 for others; reg2 equals 1 for the Americas and 0 for others; and so forth.

Regressing life on one dummy variable, regl (Africa), is equivalent to performing a two-sample t test of whether mean life is the same across categories of regl. Is the mean life expectancy significantly different, comparing Africa with other parts of the world?

The t test confirms that the 16.72-year difference between means for Africa (56.49) and other regions (73.21) is statistically significant (t = 15.17, p = .000). We get exactly the same result from the dummy variable regression (t = -15.17, p = .000), where the coefficient regl (b₁ = -16.72) likewise indicates that mean life expectancy is 16.72 years lower in Africa than in other regions (b₀ = 73.21).

Figure 7.6 graphs this dummy variable regression. All the data points line up along two vertical bands at regl = 1 (Africa) and regl = 0 (elsewhere). To spread the points out visually this graph example employs a jitter(5) option, which adds a small amount of spherical random noise to the location of each point, so they do not all plot on top of each other. jitter() does not affect the regression line, which simply connects the mean of life when regl = 0 (73.21) with
the mean of life when regl = 1 (56.49). Both of these means or predicted values are plotted as solid squares. The difference between the two means equals the regression slope, -16.72 years. Note that the 0 and 1 values of regl are re-labeled in the xlabel() option of this graph command.

The five world regions have been re-expressed as five dummy variables, but it is not possible to include all five in one regression because of multicollinearity: the values of any four of these dummy variables perfectly determine the fifth. Consequently, we can represent all the information of a k-category categorical variable through k-1 dummy variables. For example, we earlier saw that per capita gross domestic product (in log form, loggpd) and child mortality rate (chldmort) together explain about 88% of the variance in life expectancy. Including four dummy variables for regions 1-4 raises this only to about 89% (R²_a = .8872).

None of the regional dummy variables have significant effects, when we include them all and control for loggdp and chldmort. The nonsignificant coefficients suggests that a simpler model might fit just as well, and give a clearer picture of those effects that really do matter. The first step toward a reduced model involves dropping reg3, the weakest of these predictors. The result below fits just as well (R²_a = .8873) and yields more precise estimates (lower standard errors) of other region effects. The coefficient on regl now appears significant.

Next, dropping reg4 and finally reg2 results in a reduced model that still explains 89% of the variance in life expectancy (R²_a = .8879) but with just three predictors.

From this purely statistical investigation we might conclude that the differences in life expectancy among other regions of the world are largely accounted for by variations in wealth and child mortality, but in Africa there are circumstances at work (such as wars) that further depress life expectancy.

Source: Hamilton Lawrence C. (2012), Statistics with STATA: Version 12, Cengage Learning; 8th edition.

Leave a Reply Cancel reply