Confidence Bands in Simple Regression by using Stata

This section introduces some additional graphics that help to visualize a regression model or diagnose possible problems. Continuing with the Arctic9.dta data, variable tempN describes mean annual air temperature anomalies for the entire region from 64 to 90 degrees north latitude, estimated from land and sea surface records by NASA. Temperature anomalies represent differences, in degrees Celsius, from regional temperatures over the reference years 1951-1980. Positive anomalies thus represent above-average temperatures, relative to 1951-1980.

We have seen that area, extent and volume of Arctic sea ice (especially at their September minimum) have declined over the 1979-2011 period of satellite observation. Unsurprisingly, Arctic surface air temperatures warmed over this same period, although this is not the whole cause for sea ice decline. The warming trend amounts to about .058 °C/year, or .58 °C/decade — considerably faster than the globe as a whole. For comparison, other NASA data not shown here indicate a global warming trend of just .16 °C/decade over these years.

Figure 7.15 plots this upward trend in Arctic temperature anomalies. Actual temperatures are overlaid on regression line with 95% confidence interval for the conditional mean, as specified by the twoway lfitci, stdp part of this command. Other options call for a medium-thick regression line, and medium-large text for the main title. A degree symbol, ASCII character 186, is inserted in the y-axis title (see Figure 3.16 in Chapter 3 for other ASCII characters).

. graph twoway lfitci tempN year, stdp lwidth(medthick)

|| connect tempN year, msymbol(Th)

|| , ytitle(“Annual temperature anomaly,’=char(186)’C”)

legend(off) xlabel(1980(5)2010) yline(0)

title(“Arctic temperature trend with 95% c.i. for

conditional mean”, size(medlarge))

Many yearly values lie outside the confidence intervals in Figure 7.15, emphasizing the fact that these intervals refer to conditional mean values, or the trend itself, rather than to individual predictions. Suppose we wished to make an individual prediction for the year 2012, and also find an appropriate confidence interval for this prediction. One way to do that would be to use the Data Editor to add a new 34th row of data, containing only the year value 2012. Alternatively, the following two commands accomplish the same thing.

. set obs 34

. replace year = 2012 in 34

Then repeat the regression to obtain predicted values and standard errors of forecasts (stdf). Upper and lower 95% confidence limits are approximately the predicted values minus or plus twice the standard error of forecasts: tempNhat minus or plus 2*tempNse.

. predict tempNhat

. label variable tempNhat “Predicted temperature”

. predict tempNse, stdf

. label variable tempNse “Standard error of forecast”

. gen tempNlo = tempNhat – 2*tempNse . label variable tempNlo “lower confidence limit”

. gen tempNhi = tempNhat + 2*tempNse . label variable tempNhi “upper confidence limit”

. list year tempN* in -5/l

We might now graph the tempNlo and tempNhi values in a range area (twoway rarea), range spike (rspike), capped spike (reap) or similar plot to show the confidence intervals. Figure 7.16 takes a simpler approach using twoway lfitei, stdf range(1979 2012). We overlay both a connected-line (connect) plot of observed temperatures 1979-2011 with hollow triangles as markers (msymbol(Th)), and a scatterplot of predicted temperature for 2012 only, with a square marker symbol (msymbol(S)). Added text states the numerical predicted value and confidence limits (tempNhat, tempNlo and tempNhi for 2012, copied from the table above). Note that the lfitci, stdf confidence band is wide enough to cover roughly 95% of the observations, unlike the lfit, stdp band in Figure 7.15.

The temperature, ice area and other variables in Arctic9.dta form time series, a type of data that often exhibits autocorrelation, or serial correlation between successive data values. If regression errors are in fact autocorrelated, then the usual formulas for standard errors, confidence intervals and hypothesis tests — such as those used in this section — could be misleading. Consequently, researchers modeling time series data routinely check for residual autocorrelation, and apply specialized time series regression methods when it is present.

Time series regression methods (Chapter 12) require data that are declared as time series using the tsset command. This identifies the variable providing an index of time.

. tsset year

lime variable:  year, 197a in 2011

delta:     1 year

 

For tsset data, several methods become available to check for autocorrelation. One well known but minimally informative method is the Durbin-Watson test.

. estat dwatson

Durbin-Watson d-statistic( 2, 33) = 2.091689

Textbooks often contain look-up tables for the Durbin-Watson test. With 33 observations and 2 parameters estimated, the calculated value of 2.09 lies well above the a = .05 table’s upper limit (1.51), so we do not reject the null hypothesis that there is no first-order, positive autocorrelation. That is good news for the validity of Figures 7.15-16. It offers no assurance about autocorrelation at other lags such as two, three or four years previously, however.

A more informative approach calculates autocorrelation coefficients for the residuals at many lags, with a cumulative portmanteau test or Ljung-Box Q statistic. This test is accomplished by applying corrgram to the model residuals (here named tempNres).

The Q tests in this output do not approach statistical significance at any lags from 1 to 14 years. Thus, corrgram more persuasively agrees with estat dwatson in finding no significant autocorrelation among residuals from the temperature model. The tests and confidence intervals in this section are not called into question.

Source: Hamilton Lawrence C. (2012), Statistics with STATA: Version 12, Cengage Learning; 8th edition.

Leave a Reply

Your email address will not be published. Required fields are marked *