Diagnostic Graphs with Linear Regression by using Stata

Stata offers many specialized graphs useful for diagnostic purposes following a regression. A few of these are illustrated in this section; type help regress postestimation for a list. Our example here will be an elaboration of the Arctic ice model, in which September sea ice area is predicted from year and year2 (after year is centered) along with annual surface air temperature anomaly (tempN). Centered year (yearO) and its square (year02) were calculated earlier, but are generated again here assuming they no longer exist in memory. The three predictors together explain about 82% of the variance in ice area.

. use C:\data\Arctic9.dta, clear . gen yearO = year -1995 . gen year02 = yearO A2 . regress area yearO year02 tempN

Number of obs = 33
n 3, 29, _ 19.72
Prob > F = 0.0000
R-squared _ 0.8372
Adj R-squared = 0.8204
Roo1 Ms L _ .35889
 

area 6 o e f . Std. Lii. t P>|t| [95% Conf. T ntei v a ! |
y ea r0 -.0601115 .0111399 -5.40 0.000 -.0828951 -.0373279
‘. ea 6 0 2 -.0019336 .0008202 -2.36 0.025 -.0036111 -.0002562
lenpN – .2 796799 .1 5 3 8-98 -1.82 0.079 -.594338 .0349783
_cons 5 .2466 5 .1344514 39.02 0.000 ………….. 5. 5 2 1 6 3 4
 

As in the previous section, a test for residual autocorrelation is prudent. The Q test finds no significant autocorrelation at lags 1 through 10 — that is, comparing residuals from each year with residuals from 1 through 10 years previously. Some autocorrelation does appear at lags longer than 10, but that is unlikely to affect our results.

. predict areares2, resid . corrgram areares2, lag(10)

-1 0 1 -1 0 1 LAG      AC   PAC    Q Prob>Q [Autocorrelation] [Partial Autocor]
1 0 1140 0. 1141 .4691. 0.4934      
2 -0.1826 -0.2003 1.7112 0. 4250      
3 -0.3273 -0.2968 5.8358 0.11 9 9      
4 -0.0554 -0.0157 5.9581 0.2023      
5 0.02 38 -0.1040 5.9816 0. 3080      
6 -0.1620 -0.4049 7.104 6 0 .3113      
7 -0.1077 -0.1646 7 . 62 0. 3673      
8 0.2 3 3 2 0 . 3 3 8 4 10.132 0.2559      
9 0.3583 0.2410 16.309 0.06.07      
10 -0.0160 -0.2435 16.322 0.0908      
 

A residual-versus-fitted plot could be drawn by calculated predicted values and graphing areares2 against those. A faster way is by the rvfplot command. The example in Figure 7.17 adds a horizontal line marking zero, the residual mean. It also labels the data points by year. The plot reveals one outlier with a high positive residual (1996) but no obvious signs of trouble.

. rvfplot, yline(0) mlabel(year)

•   2005

•   2000

  • 1980
  #200f ?001

• 1997

• 1938

>|gc^!»61i9«t^S67

#2009

# 20JJ-jn.| n

#2003  
  #2002

#   1995 #1993

•    19.99

#1979

• 1985# 1982 #1.9.89

#19.93 *1984

    • 19.90
• 2007    
3 4 5 6
Filled values

 

Added-variable plots are valuable diagnostic tools known by different, names including partial- regression leverage plots, adjusted partial residual plots, or adjusted variable plots. They depict the relationship between y and one x variable, adjusting for the effects of other x variables. If we regressed y on x2 and xi, and likewise regressed x1 on x2 and xi, then took the residuals from each regression and graphed these residuals in a scatterplot, we would obtain an added- variable plot for the relationship betweeny and x1, adjusted for x2 and xi. An avplot command performs the necessary calculations automatically. We can draw the added-variable plot for predictor tempN, for example, just by typing

. avplot tempN

Speeding the process further, we could type avplots to obtain a complete set of tiny added- variable plots with each of the predictor variables in the preceding regression. Figure 7.18 shows the results from the regression of area onyearO, year02 and tempN. The lines drawn in added- variable plots have slopes equal to the corresponding partial regression coefficients. For example, the slope of the line at lower left in Figure 7.18 equals -.2797, exactly the coefficient on tempN in our 3-predictor regression.

Added-variable plots help to identify observations exerting a disproportionate influence on the regression model. In simple regression with one x variable, ordinary scatterplots suffice for this purpose. In multiple regression, however, the signs of influence become more subtle. An observation with an unusual combination of values on several x variables might have high leverage, or potential to influence the regression, even though none of its individual x values is unusual by itself. High-leverage observations show up in added-variable plots as points horizontally distant from the rest of the data. Most of the horizontally extreme points in Figure 7.18 appear at positions consistent with the rest of the data, however.

One high outlier appears in the upper right plot of Figure 7.18, suggesting a possible influence that steepens (makes more negative) the coefficient on year02. When we draw one added- variable plot using the singular avplot command, and label its data points, 1996 shows up as the outlier.

. avplot year02, mlabel(year)

 

Component-plus-residual plots (produced by cprplot) take a different approach. A component- plus residual plot for variable x1 graphs each residual plus its component predicted from x1:

ei + bj x1i

against values of x1. Such plots might help diagnose nonlinearities and suggest alternative functional forms. An augmented component-plus-residual plot (Mallows 1986) works somewhat better, although both types often seem inconclusive. Figure 7.20 shows an augmented component-plus-residual plot from the regression of area on yearO, year02 and tempN.

. acprplot tempN, lowess

The straight line in Figure 7.20 corresponds to the regression model. The curved line reflects lowess smoothing, which would show us if there was much nonlinearity. The curve’s downturn at extreme left can be disregarded as a lowess artifact because only a few cases determine its location (see Chapter 8). If more central parts of the lowess curve showed a systematically curved pattern, departing from the linear regression model, we would have reason to doubt the model’s adequacy. In Figure 7.20, however, the component-plus-residuals medians closely follow the regression model. This plot reinforces the conclusion that the present regression model adequately accounts for all nonlinearity visible in the raw data, leaving none in its residuals.

As its name implies, a leverage-versus-squared-residuals plot graphs leverage (hat matrix diagonals) against the residuals squared. Figure 7.21 shows such a plot for the area regression. To identify individual outliers, we label the markers by year. The option mlabsize(medsmall) calls for medium small marker labels, somewhat larger than the default size of small. (See help testsizestyle for a list of other choices.) mlabpos(11) places these labels placed at an 11 o’clock position relative to marker symbols. Most of the years form a jumble at lower left in Figure 7.21, but 1996 once again stands out.

. lvr2plot, mlabel(year) mlabsize(medsmall) mlabpos(11)

Lines in a leverage-versus-squared-residuals plot mark the means of leverage (horizontal line) and squared residuals (vertical line). Leverage tells us how much potential for influencing the regression an observation has, based on its particular combination of x values. Extreme x values or unusual combinations give an observation high leverage. A large squared residual indicates an observation with a y value much different from that predicted by the regression model. 1996 has by far the largest squared residual, indicating that the model fits that year least well. But its combination of tempN and year values are middle-of-the-road, so 1996 has below-average leverage.

Diagnostic graphs and statistics draw attention to influential or potentially influential observations, but they do not say whether we should set those observations aside. That requires a substantive decision based on careful evaluation of the data and research context. There is no substantive justification for setting aside 1996 in the Arctic ice example, but suppose just for illustration we try that step anyway. As might be expected from 1996’s low leverage, omitting this year turns out to make little difference to the area regression results. The coefficients on yearO and tempN remain about the same, with tempN’s effect now significant. The coefficient on year02 is a bit closer to zero, but still negative and significant. R2a increases slightly, from .82 to .85, with this nonconforming year left out. The fact that these differences are minor, and we have no substantive reason to discard 1996, both argue for keeping it in the analysis.

. regress area yearO year02 tempN if year!=1996

Sou rce s S df M S   Number of obs n 3, 2 8 1 Prob > F R-squared Adj R-squared

Root M s F

= 32

– 0 1.8 2′ – 0.0000 – 0.8088 – 0.8548 = .31982

Model Res i dual 1 8.9 0 99037 2.8010 133 3 0.32330123 28 .10228,201  
1 o 1 a 1 21.8339471 31 .704320873  
a i ea l o e f . Std. Fi i t P>|t| [95% Conf. T ntei v a l |
y ea r’R y earnl

lea p N

_ cons

-.0602946

-.0015288

-.2820721 3.182518

.00992.3 .0007439

.1371037

.1218130

-0.07

-2.05

-2.00

42.54

0.000 0.049 0.049 0.000 -.0800302

-.0030527

– .5 02920 3

4.9! 3014

-.0399591 -4.87e-00

-.0012239

5.4 32002

 

Chambers et al. (1983) and Cook and Weisberg (1994) provide more detailed examples and explanations of diagnostic plots and other graphical methods for data analysis.

Source: Hamilton Lawrence C. (2012), Statistics with STATA: Version 12, Cengage Learning; 8th edition.

1 thoughts on “Diagnostic Graphs with Linear Regression by using Stata

  1. marizon ilogert says:

    Those are yours alright! . We at least need to get these people stealing images to start blogging! They probably just did a image search and grabbed them. They look good though!

Leave a Reply

Your email address will not be published. Required fields are marked *