Collapsing Data in Stata

Long after a dataset has been created, we might discover that for some purposes it has the wrong organization. Fortunately, several commands facilitate drastic restructuring of datasets. The simplest of these, collapse, aggregates data into means, medians or other statistics for groups defined by one or more variables. For illustration, we return to the data on monthly global temperatures from January 1880 to December 2011 (global2.dta), graphed earlier in Figure 2.1.

With collapse, we could build a simplified dataset containing mean temperature anomalies for 132 years instead of 1,584 separate months.

. collapse (mean) temp, by(year)

. label variable temp “NCDC annual mean temp anomaly, deg C”

. save C:\data\global_yearly.dta, replace

. describe

Our new annual dataset might be visualized with a spike plot, in which vertical spikes indicate distance of each year’s temperature anomaly above or below the 1901-2000 mean.

A wider range of statistics can be collected using the flexible statsby command, which works as a prefix for other analyses. In the following example we return to global2.dta and generate a new variable called decade (1880 for years 1880-1889, 1890 for 1890-1899, and so forth). Then we create a new dataset consisting of summarize statistics for temperature, by decade.

The new dataset contains number of observations, mean, variance, maximum and other summarize statistics for each decade. Figure 2.3 graphs the maximum monthly temperature anomaly (max) for each decade (setting aside the “2010” decade which just has two years).

statsby can also make datasets of results from regression models or other analyses. Type help statsby or consult the Data Management Reference Manual for more information and examples. Selecting

Statistics > Other > Collect statistics for a command across a by list

from the menus brings up the dialog box for this command. Another useful aggregation command, contract, creates a dataset that resembles a frequency table for any combinations of specified variables (see help contract).

Source: Hamilton Lawrence C. (2012), Statistics with STATA: Version 12, Cengage Learning; 8th edition.

1 thoughts on “Collapsing Data in Stata

Leave a Reply

Your email address will not be published. Required fields are marked *