Computing g from Commonly Reported Results

As when computing r, you can compute standardized mean differences from a wide range of commonly reported information. Although I have presented three different types of standardized mean differences (g, d, and gGlass), I describe only the computation of g in detail in the following. If you are inter­ested in using gGlass as the effect size in one’s meta-analysis, the primary studies must report means and standard deviations for both groups; if this is the case, then you can simply compute gGlass using Equation 5.7. If you prefer d to g (although, again, they are virtually identical with larger sample sizes), then you can use the methods described in this section to compute g and then transform g into d using the following equation with no loss of precision:

1. From Descriptive Data

The most straightforward situation arises when the primary study reports means and standard deviations for both groups of interest. In this situation,you simply compute g directly from this information using Equation 5.5. For convenience, this equation is

 

2. From Inferential Tests

2.1. Continuous Dependent Variables

As when computing r, it is possible to compute g from the result of independent sample £-tests or 1 df F-ratios (see below for dependent sample or repeated- measures tests). For the independent sample £-test, the relevant equation is:

When these two group sizes are equal, this equation simplifies to the ratio on the right. In instances where the primary studies report the results of the t-test but not the sample sizes for each group (but instead only an overall sample size), this simplification can be used if the group sizes can be assumed to be approximately equal. Figure 5.1 (see similar demonstration in Rosenthal, 1991) shows the percentage underestimation in g when one incorrectly assumes that group sizes are equal. The x-axis shows the percent­age of cases in the larger group, beginning at 50% (equal group sizes) to the left and moving to larger discrepancies in group size as one moves right. It can be seen that the amount of underestimation is trivial when groups are similar in size, reaching 5% underestimation at around a 66:34 (roughly 2:1) discrepancy in group sizes. The magnitude of this underestimation increases rapidly after this point, becoming what I consider unacceptably large when group sizes reach 3:1 or 4:1 (i.e., when 75-80% of the sample is in one of the two groups). If this unequal distribution is expectable (which might be determined by considering the magnitudes of group sizes in other studies reporting sample sizes by group), then it is probably preferable to use r as an index of effect size given that it is not influenced by the magnitude of this unequal distribution.

As expected given the parallel between independent sample t-tests and two group ANOVAs, it is also possible to compute standardized mean differ­ences from F-ratios with 1 df in the numerator using the following formulas:

Because F-ratios are always positive, it is important that you carefully consider the direction of group differences and take the positive or negative square root of Equation 5.21, depending on whether Group 1 or 2 (respec­tively) has the higher mean.

Although computing r was equivalent whether the results were from independent (between-group) or dependent (repeated-measures) results, this is not the case when computing g. Therefore, it is critically important that you be sure whether the reported t or F values are from independent or dependent tests. If the results are from dependent, or repeated measures, tests, the following equations should be used:

Unlike Equations 5.20 and 5.21 for the independent sample situation in which there were separate formulas for unequal and equal group sizes, the dependent (repeated-measures) situation to which Equation 5.22 applies contains only an overall N, the size of the sample over time (or other type of repeated measures). It also merits mention that the same t or F values yield a standardized mean difference that is twice as large in the independent sam­ple (between groups) than in the dependent (repeated-measures) situations, so a mistake in using the wrong formulas would have a dramatic impact on computed standardized mean effect size.

2.2. Dichotomous Dependent Variables

Primary studies might also compare two groups on the percentage or propor­tion of participants scoring affirmative on a dichotomous variable. This may come about either because the primary study authors artificially dichoto­mized the variable or because the variable truly is dichotomous. If the latter case is consistent across all studies, then you might instead choose the odds ratio as a preferred index of effect size (i.e., associations between a dichoto­mous grouping variable and dichotomous measure). However, there are also instances in which the standardized mean difference is appropriate in this situation, such as when the primary study authors artificially dichotomized the variable on which the groups are compared (in which case corrections for this artificial dichotomization might be considered; see Chapter 6), or when you wish to consider the dichotomous variable of the study in relation to a continuous variable of other studies (in which case it might be useful to consider moderation across studies using continuous versus dichotomous variables; see Chapter 10).

When two groups are compared on a dichotomous or dichotomized vari­able, you need to identify the 1 df c2 and direction of effect (i.e., which group has a higher percentage or proportion). This information might be reported directly, or you may need to construct a 2 X 2 contingency table from reported results. For instance, a primary study might report that 50% of Group 1 had the dichotomous characteristic, whereas 30% of Group 2 had this characteristic; you would use this information (and sample size) to compute a contingency table and x2. You then convert this x2 to g using the following equation:

As with computing g from the F-ratio, it is critical that you take the correct positive or negative square root. The positive square root is taken if Group 1 has the higher percentage or proportion with the dichotomous characteristic, whereas the negative square root is taken if Group 2 more commonly has the characteristic.

3. From Probability Levels of Significance Tests

The practices of computing g from exact significance levels, ranges of signifi­cance (e.g., p < .05), and reports that a difference was not significant follow the practices of computing r described earlier. Specifically, you determine the Z for the exact probability (e.g., Z = 2.14 from p = .032), the lower-bound Z when a result is reported significant at a certain level (e.g., Z = 1.96 for p < .05), or the maximum Z when a result is reported as not significant (e.g., maximum Z = ±1.96, although the conservative choice in this option is to assume g = 0), as described in Section 5.2.4. You then use the following equa­tion to estimate the standardized mean difference from this Z: 13

4. From Omnibus Test Results

Although you might consider the grouping variable to most appropriately consist of two levels (i.e., two groups), there is no assurance that primary study authors have all reached the same conclusion. Instead, the groups of interest may be subdivided within primary studies, resulting in omnibus comparisons among three or more groups. For example, you might be inter­ested in comparing aggressive versus nonaggressive children, but a primary study might further subdivide aggressive children into those who are aggres­sive rarely versus frequently. Another example might be if you wish to com­pare a certain type of psychotherapy versus control, but the primary study reports results for three groups: control, treatment by graduate students, and treatment by doctoral-level practitioners. Studies might also report omnibus tests involving groups that are not of interest to a particular meta-analysis. For instance, a meta-analysis comparing psychotherapy versus control might include a study reporting outcomes for three different groups: control, psy­chotherapy, and medication. In each of these cases, it is necessary to reorga­nize the results of the study to fit the two-group comparison of interest. Next, I describe ways of doing so from reported descriptive statistics and F-ratios with df > 2.

4.1. From Descriptive Statistics

The simplest case is when studies report sample sizes, means, and standard deviations from three or more groups. Here, you can either select specific groups or else aggregate groups to derive data from the two groups of inter­est.

When you are interested in only two groups from among those reported in a study (e.g., interested in control and psychotherapy from a study report­ing control, psychotherapy, and medication), it is straightforward to use the reported means and standard deviations from those two groups to compute g (using Equation 5.5). When doing so, it is important to code the sample size, N (used to compute the standard error of the effect size for subsequent weighting), as the combined sample sizes from the two groups of interest (N = n1 + n2) rather than the total sample size from the study.

When the primary study has subdivided one or both groups of interest, you must combine data from these subgroups to compute descriptive data for the groups of interest. For example, when comparing psychotherapy versus control from a study reporting data from two psychotherapy groups (e.g.,those being treated by graduate students versus doctoral-level practitioners), you would need to combine data from these two psychotherapy groups before computing g. Combining the subgroup sample sizes is straightforward, ngroup = nsubgroup1 + nsubgroup2. The group mean is also reasonably straightforward, as it is computed as the weighted (by subgroup size) average of the subgroup means,

The combined group standard deviation is somewhat more complex to obtain in that it consists of two components: (1) variance within each of the groups you wish to combine and (2) variance between the groups you wish to combine. Therefore, to obtain a combined group standard deviation, you must compute sums of squared deviations (SSs) within and between groups. The SSwithin is computed for each group g as sg 2*(ng – 1), and then these are summed across groups. The SSbetween is computed as S [(Mg – GM)2 * ng], (where GM is the grand mean of these two groups), summed over groups. You add the two SSs (i.e., SSwithin and SSbetween) to produce the total sum of squared deviations, SStotal. The (population estimated of the) variance for this combined group is then computed as SStotal / (ncombined – 1) and the (population estimate of the) standard deviation (scombined).

Combining more than two subgroups to form a group of interest is straightforward (e.g., averaging three or more groups). It may also be neces­sary to combine subgroups to form both groups of interest (e.g., multiple treatment and multiple control groups). Once you obtain the descriptive data (sample sizes, means, and standard deviations) for the groups of interest, you use Equation 5.5 to compute g.

4.2. From df > 2 F-Ratio

As when computing r, it is also possible to compute g from omnibus ANOVAs if the primary study reports the F-ratio and means from each group. Here, you follow the general procedures of selecting or aggregating subgroups as described in the previous section. Doing this for the sample sizes and reported means is identical to the approach described in Section 5.3.4.a. The only difference in this situation is that you must infer the within-group stan­dard deviations from the results of the ANOVA, as these are not reported (if they are, then you can simply use the procedures described in the previous subsection).

Because omnibus ANOVAs typically assume equal variance across groups, this search is in fact for one standard deviation common across groups (equivalent to a pooled standard deviation). The MSwithin of the ANOVA represents this common group variance, so the square root of this M5within represents the standard deviation of groups, which are then used as described in the previous subsection (i.e., you must still combine the SSs within and between groups to be combined). If the primary study reports an ANOVA table, you can readily find this MSwithin within the table. If the primary study does not report this MSwithin, it is possible to compute this value from the reported means and F-ratio. As described earlier, you first compute the omnibus MSbetween-omnibus = S(Mg – GM)2 / G – 1 across all groups comprising the reported ANOVA (i.e., this MSpetween-omnibus rep­resents the amount of variance between all groups in the omnibus com­parison), and then compute MSwithin = MSbetween/F. As mentioned, you then take the square root of MSwithin to obtain Sppooled, used in computing the SSwithin of the group to be combined. It is important to note that you should add the SSpetween of just the groups to be combined in computing the SStotaj for the combined groups, which is used to estimate the combined group standard deviation.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Computing o from Commonly Reported Results

When you are interested in computing the odds ratio (o, sometimes denoted by OR), or the association between two dichotomous variables, the range of typically reported data is usually more limited than that described in the pre­vious two sections. In this section, I describe computing an odds ratio from three common situations: studies reporting descriptive data such as propor­tions or percentages in two groups, inferential tests (i.e., x2 statistic) from 2 X 2 contingency tables, and studies reporting only the significance of such a test. I also describe the less common situations of deriving odds ratios from research reports involving larger (i.e., df > 1) contingency tables or those ana­lyzing continuous variables.

1. From Descriptive Data

The most straightforward way of computing o is by constructing a 2 X 2 contingency table from descriptive data reported in primary studies. Many studies will report the actual cell frequencies, making it simple to construct this table. Many studies will alternatively report an overall sample size, the sample sizes of groups from one of the two variables, and some form of preva­lence of the second variable by these two groups. For example, a study might report that 50 out of 300 children are aggressive and that 40% of the aggres­sive children are rejected, whereas 10% of the nonaggressive children are rejected. This information could be used to identify the number of nonag­gressive nonrejected children, nQQ = (300 – 50)(1 – 0.10) = 225; the number of nonaggressive rejected children, uqi = (300 – 50)(0.10) = 25; the number of aggressive nonrejected children, n^ = (50)(1 – 0.40) = 30; and the number of aggressive rejected children, ny = (50)(0.40) = 20.

After constructing this 2 X 2 contingency table, you can simply compute o from this information using Equation 5.11, which I reproduce as follows.

For example, given the cell frequencies of aggression and rejection described above, you could compute o = (225*20)/(25*30) = 6.0.

Special consideration is needed if one or more cells of this contingency table are 0. In this situation, it is advisable to add 0.5 to each of the cell fre­quencies (Fleiss, 1994). This solution tends to produce a downward bias in estimating o (Lipsey & Wilson, 2001, p. 54). Although the impact of having a small number of studies for which this is the case is likely negligible, this bias is problematic if many studies in a meta-analysis have small sample sizes (and 0 frequency cells). Meta-analysts for whom this is the case should con­sult Fleiss (1994) for alternative methods of analysis.

2. From Inferential Tests

Instead of fully reporting the contingency table (or descriptive data sufficient to reconstruct it), some studies might report a test of significance of this con­tingency, the x2 statistic. In this situation, it is important to ensure that the reported value is from a 1 dƒ x2, meaning that it is from a 2 X 2 contingency table (see Section 5.4.4 for use of larger contingency tables). The c2 statistic by itself is not sufficient to compute o, however; it is also necessary to know the sample size and marginal proportions of this contingency. As described by Lipsey and Wilson (2001, pp. 197-198), values of the x2 statistic, overall sam­ple size (N), and marginal proportions (po. and pi. for the row, or variable 1, marginal proportions; p.o and p.i for the column, or variable 2, marginal proportions) allow you to identify the cell frequencies of a 2 X 2 contingency table. Specifically, you compute the frequency of the first cell using the fol­lowing equation:

It is important to use the correct positive or negative square root given the presence of a positive or negative (respectively) association between the two dichotomous variables.

Then you compute the remaining cells of the 2 × 2 contingency table using the following: n01 = p0•N – n00; n10 = p•0N – n00; and n01 = N – n00 – n01 – n10. You then use this contingency table to compute o as described in the previous section (i.e., Equation 5.11).

3. From Probability Levels of Significance Tests

Given the possibility of computing o from values of x2 (along with N and marginal proportions), it follows that you can compute o from levels of statis­tical significance of 2 X 2 contingency analyses. Given an exact significance level (p) and sample size (N), you can identify the corresponding x2 by either consulting a table of x2 values (at 1 dƒ) or using a simple computer program like Excel (“chiinv” function).

Similarly, you can use a range of significance (e.g., p < .o5) and sample size to compute a lower-bound value of x2 (i.e., assuming p = .o5) and cor­responding o. Given only a reported nonsignificant 2 X 2 contingency, you could compute the minimum (i.e., < 1.o) and maximum (i.e., > 1.o) values of o from value of x2 at the type I error rate (e.g., p = .05), but a more conserva­tive approach would be to assume o = 1 (null value for o). In both of these situations, however, it would be preferable to request more information (o or a contingency table) from the primary study authors.

4. From Omnibus Results

Some primary studies might report more than two levels of one or both vari­ables that you consider dichotomous. For example, if you are considering associations between dichotomous aggression and dichotomous rejection statuses, you might encounter a primary study presenting results within a 3 (nonaggressive, somewhat aggressive, frequently aggressive) X 3 (nonre­jected, modestly rejected, highly rejected) contingency table.

If these larger contingency tables are common among primary studies, this might be cause for you to reconsider whether the variables of interest are truly dichotomous. However, if you are convinced that dichotomous rep­resentations of both variables are best, then the challenge becomes one of deciding which of the distinctions made in the primary study are important or real and which are artificial. Given the example of the 3 X 3 aggression by rejection table, I might decide that the distinction between frequent aggres­sion versus other levels (never and sometimes) is important, and that the distinction between nonrejected and the other levels (modestly and highly rejected) is important.

After deciding which distinctions are important and which are not, you then simply sum the frequencies within collapsed groups. Given the aggression and rejection example, I would combine frequencies of the never- aggressive nonrejected and the sometimes-aggressive nonrejected children into one group (n00); combine the frequencies of never-aggressive modestly rejected, sometimes-aggressive modestly rejected, never-aggressive highly rejected, and sometimes-aggressive highly rejected children into another group (n01); and so on. You could then use this reduced table to compute o as described above (Section 5.4.1).

5. From Results Involving Continuous Variables

If you find that many studies represent one of the variables under consider­ation as continuous, it is important to reconsider whether your conceptual­ization of dichotomous variables is appropriate. Presumably the representa­tion of variables in studies as continuous suggests that there is an under­lying continuity of that variable, in which case you should not artificially dichotomize this continuum (even if many studies in the meta-analysis do). You would then use a standardized mean difference (e.g., g) to represent the association between the dichotomous and continuous variable.

If you are convinced that the association of interest is between two truly dichotomous variables and that a primary study was simply misinformed in analyzing a variable as continuous, then an approximate transformation can be made. You would first compute g from this study, and then estimate o = . This equation is derived from the logit method of transforming log odds ratios to standardized mean differences (Haddock et al., 1998; Hasselblad & Hedges, 1995; for a comparison of this and other methods of transforming to standardized mean difference, see Sanchez-Meca, Marin-Martinez, & Chacon-Moscoso, 2003) and is not typically used to transform g to o. Again, stress that the first consideration if you encounter continuous representa­tions of dichotomies in primary studies is to rethink your decision to concep­tualize a variable as dichotomous.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Comparisons among r, g, and o

I have emphasized the importance of basing the decision to rely on r, g, or o on conceptualizations of the association involving two continuous vari­ables, a dichotomous and a continuous variable, or two dichotomous vari­ables, respectively. At the same time, it can be useful to understand that you can compute values of one effect size from values of another effect size. For example, r and g can be computed from one another using the following formulas:

Similarly, g and o can be computed from one another using the following equations:

Finally, you can transform from o to r by reconstructing the contingency table (if sufficient information is provided), through intermediate transfor­mations to g, or through one of several approximations of the tetrachoric cor­relation (see Bonett, 2007). An intermediate transformation to g or algebraic rearrangement of the tetrachoric correlation approximations also allows you to transform from r to o.

This mathematical interchangeability among effect sizes has led to arguments that one type of effect size is preferable to another. For example, Rosenthal (1991) has expressed preference for r over d (and presumably other standardized mean differences, including g) based on four features. First, comparisons of Equations 5.13 and 5.14 for r versus 5.20 and 5.21 for g reveal that it is possible to compute r accurately from only the inferential test value and degrees of freedom, whereas computing g requires knowing the group sample sizes or else approximating this value by assuming that the group sizes are equal. To the extent that primary studies do not report group sizes and it is reasonable to expect marked differences in group sizes, r is prefer­able to d. A second, smaller, argument for preferring r to g is that you use the same equations to compute r from independent sample versus repeated- measures inferential tests, whereas different formulas are necessary when computing g from these tests (see Equations 5.20 and 5.21 vs. 5.22). This should not pose too much difficulty for the competent meta-analyst, but con­sideration of simplicity is not trivial. A third advantage of r over standardized mean differences, according to Rosenthal (1991), is in ease of interpretation. Whether r or standardized mean differences (e.g., g) are more intuitive to readers is debatable and currently is a matter of opinion rather than care­ful study. It probably is the case that most scientists have more exposure to r than to g or d, but this does not mean that they cannot readily grasp the meaning of the standardized mean difference. The final, and perhaps most convincing, argument for Rosenthal’s (1991) preference is that r can be used whenever d can (e.g., in describing an association between a dichotomous variable and a continuous variable), but it makes less sense to use g in many situations where r could be used (e.g., in describing an association between two continuous variables).

Arguments have also been put forth for preferring o to standardized mean differences (g or d) or r when both variables are truly dichotomous. The magnitudes of r (typically denoted with ^) or standardized mean differ­ences (g or d) that you can compute from a 2 X 2 contingency table depend on the marginal frequencies of the dichotomies. This dependence leads to attenuated effect sizes as well as extraneous heterogeneity among studies when these effect size indices are used with dichotomous data (Fleiss, 1994; Haddock et al., 1998). This limitation is not present for o, leading many to argue that it is the preferred effect size to index associations between dichotomous data.

I do not believe that any type of effect size index (i.e., r, g, or o) is inher­ently preferable to another. What is far more important is that you select the effect size that matches your conceptualization of the variables under con­sideration. Linear associations between two variables that are naturally con­tinuous should be represented with r. Associations between a dichotomous variable (e.g., group) and a continuous variable can be represented with a standardized mean difference (e.g., g) or r, with a standardized mean differ­ence probably more naturally representing this type of association.14 Asso­ciations between two natural dichotomies are best represented with o.

If you wish to compare multiple levels of variables in the same meta­analysis, I recommend using the effect size index representing the more con­tinuous nature for both. For example, associations of a continuous variable (e.g., aggressive) with a set of correlates that includes a mixture of continuous and dichotomous variables (e.g., a continuous rejection variable and a dichot­omous variable of being classified as rejected) could be well represented with the correlation coefficient, r (Rosenthal, 1991). Similarly, associations of a dichotomous variable (e.g., biological sex) with a set containing a combina­tion of continuous (e.g., rejection) and dichotomous (e.g., rejection classifi­cation) variables could be represented with a standardized mean difference such as g (Sanchez-Meca et al., 2003). In both cases, it would be important to evaluate moderation by the type (i.e., continuous versus dichotomous) of correlate.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: Using Effect Size Calculators and Meta-Analysis Programs

As I described in Chapter 1, several computer programs are designed to aid in meta-analysis, some of which are available for free and others for purchase. All meta-analytic programs perform two major steps: effect size calculation and effect size combination. Effect size combination (as well as comparison) is the process of aggregating results across studies, the topic of Chapters 8-10 later in this book. Effect size calculation is the process of taking results from each study and converting these into a common effect size, the focus of this chapter.

Relying on an effect size calculator found in meta-analysis programs to compute effect sizes (as well as to combine results across studies) can be a time-saving tool. However, I discourage beginning meta-analysts from rely­ing on them. All of the calculations described in this chapter can be per­formed with a simple hand calculator or spreadsheet program (e.g., Excel), and the meta-analytic combination and comparison I describe later in this book can be performed using these spreadsheets or simple statistical analysis software (e.g., SAS or SPSS). In other words, I see little need for specific soft­ware when conducting a meta-analysis.

Having said both that these programs can save time but that I recom­mend not using them initially, you may wonder if I think that you have too much time on your hands. I do not. Instead, my concern is that these pro­grams make it easy for beginning meta-analysts who are less familiar with the calculations to make mistakes. The value of struggling with the equations in this chapter is that doing so forces you to think about what the values mean and where to find them within the research report. The danger of using an effect size calculator is of mindless use, in which users put in whatever values they can find in the report that look similar to what the program asks for.

At the same time, I do not entirely discourage the use of these meta­analysis programs. They can be of great use in reducing the burden of tedious calculations after you understand these calculations. In other words, if you are just beginning to perform meta-analyses, I encourage you to compute some effect sizes by hand (i.e., using a calculator or spreadsheet program) as well as using one of these programs. Inconsistencies should alert you that either your hand calculations are inaccurate or that you are not providing the cor­rect information to the program (or that the program is inaccurate, though this should be uncommon with the more commonly used programs). After you have confirmed that you obtain identical results by hand and the pro­gram, then you can decide if using the program is worthwhile. I offer this same advice when combining effect sizes, which I discuss later in this book.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

The Controversy of Correction to effect Sizes in Meta-Analysis

There is some controversy about correcting effect sizes used in meta-analyses for methodological artifacts. In this section I describe arguments for and against correction, and then attempt to reconcile these two positions.

1. Arguments for Artifact Correction

Probably the most consistent advocates of correcting for study artifacts are John Hunter (now deceased) and Frank Schmidt (see Hunter & Schmidt, 2004; Schmidt & Hunter, 1996; as well as, e.g., Rubin, 1990). Their argu­ment, in a simplified form, is that individual primary studies report effect sizes among imperfect measures of constructs, not the constructs themselves. These imperfections in the measurement of constructs can be due to a variety of sources including unreliability of the measures, imperfect validity of the measures, or imperfect ways in which the variables were managed in primary studies (e.g., artificial dichotomization). Moreover, individual studies contain not only random sampling error (due to their finite sample sizes), but often biased samples that do not represent the population about which you wish to draw conclusions.

These imperfections of measurement and sampling are inherent to every primary study and provide a limiting frame within which you must inter­pret the findings. For instance, a particular study does not provide a perfect effect size of the association between X and Y, but rather an effect size of the association between a particular measure of X with a particular measure of Y within the particular sample of the study. The heart of the argument for artifact correction is that we are less interested in these imperfect effect sizes found in primary studies and more interested in the effect sizes between latent constructs (e.g., the correlation between construct X and construct Y).

The argument seems reasonable and in fact provides much of the impetus for the rise of such latent variable techniques as confirmatory factor analysis (e.g., Brown, 2006) and structural equation modeling (e.g., Kline, 2005) in pri­mary research. Our theories that we wish to evaluate are almost exclusively about associations among constructs (e.g., aggression and rejection), rather than about associations among measures (e.g., a particular self-report scale of aggression and a particular peer-report method of measuring rejection). As such, it makes sense that we would wish to draw conclusions from our meta­analyses about associations among constructs rather than associations among imperfect measures of these constructs reported in primary studies; thus, we should correct for artifacts within these studies in our meta-analyses.

A corollary to the focus on associations among constructs (rather than imperfect measures) is that artifact correction results in the variability among studies being more likely due to substantively interesting differences rather than methodological differences. For example, studies may differ due to a variety of features, with some of these differences being substantively inter­esting (e.g., characteristics of the sample such as age or income, type of inter­vention evaluated) and others being less so (e.g., the use of a reliable versus unreliable measure of a variable). Correction for these study artifacts (e.g., unreliability of measures) reduces this variability due to likely less interest­ing differences (i.e., noise), thus allowing for clearer illumination of differ­ences between studies that are substantively interesting through moderator analyses (Chapter 9).

2. Arguments against Artifact Correction

Despite the apparent logic supporting artifact correction in meta-analysis, there are some who argue against these corrections. Early descriptions of meta-analysis described the goal of these efforts as integrating the findings of individual studies (e.g., Glass, 1976); in other words, the synthesis of results was reported in primary studies. Although one might argue that these early descriptions simply failed to appreciate the difference between the associa­tions between measures and constructs (although this seems unlikely given the expertise Glass had in measurement and factor analysis), some modern meta-analysts have continued to oppose artifact adjustment even after the arguments put forth by Hunter and Schmidt. Perhaps most pointedly, Rosen­thal (1991) argues that the goal of meta-analysis “is to teach us better what is, not what might some day be in the best of all possible worlds” (p. 25, italics in original). Rosenthal (1991) also cautions that these corrections can yield inaccurate effect sizes, such as when corrections for unreliability yield cor­relations greater than 1.0.

Another, though far weaker, argument against artifact correction is sim­ply that such corrections add another level of complexity to our meta-analytic procedures. I agree that there is little value in making these procedures more complex than is necessary to best answer the substantive questions of the meta-analysis. Furthermore, additional data-analytic complexity often requires lengthier explanation when reporting meta-analyses, and our focus in most of these reports is typically to explain information relevant to our content-based questions rather than data-analytic procedures. At the same time, simplicity alone is not a good guide to our data-analytic techniques. The more important question is whether the cost of additional data-analytic complexity is offset by the improved value of the results yielded.

3. Reconciling Arguments Regarding Artifact Correction

Many of the critical issues surrounding the controversy of artifact correc­tion can be summarized in terms of whether meta-analysts prefer to describe associations among constructs (those for correction) or associations as found among variables in the research (those against correction). In most cases, the questions likely involve associations among latent constructs more so than associations among imperfectly measured variables. Even when questions involve measurement (e.g., are associations between X and Y stronger when X is measured in certain ways than when X is measured in other ways?), it seems likely that one would wish to base this answer on the differences in associations among constructs between the two measurement approaches rather than the magnitudes of imperfections that are common for these mea­surement approaches. Put bluntly, Hunter and Schmidt (2004) argue that attempting to meta-analytically draw conclusions about constructs without correcting for artifacts “is the mathematical equivalent of the ostrich with its head in the sand: It is a pretense that if we ignore other artifacts then their effects on study outcomes will go away” (p. 81). Thus, if you wish to draw conclusions about constructs, which is usually the case, it would appear that correcting for study artifacts is generally valuable.

At the same time, one must consider the likely impact of artifacts on the results. If one is meta-analyzing a body of research that consistently uses reliable and valid measures within representative samples, then the benefits of artifact adjustment are likely small. In these cases, the additional complex­ity of artifact adjustment is likely not warranted. To adapt Rosenthal’s (1991) argument quoted earlier, if what is matches closely with what could be, then there is little value in correcting for study artifacts.

In sum, although I do not believe that all, or even any, artifact adjust­ments are necessary in every meta-analysis, I do believe it is valuable to always consider each of the artifacts that could bias effect sizes. In meta-analyses in which these artifacts are likely to have a substantial impact on at least some of the included primary studies, it is valuable to at least explore some of the following corrections.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Artifact Corrections to Consider in Meta-Analysis

Hunter and Schmidt (2004; see also Schmidt, Le, & Oh, 2009) suggest several corrections to methodological artifacts of primary studies. These corrections involve unreliability of measures, poor validity of measured variables, arti­
ficial dichotomization of continuous variables, and range restriction of vari­ables. Next I describe the conceptual justification and computational details of each of these corrections. The computations of these artifact corrections are summarized in Table 6.1.

Before turning to these corrections, however, let us consider the general formula for all artifact corrections. The corrected effect size (e.g., r, g, o), which is the estimated effect size if there were no study artifacts, is a function of the effect size observed in the study divided by the total artifact correc- tion1:

Here, a is the total correction for all study artifacts and is simply the product of the individual artifacts described next (i.e., a = a1 * a2 * . . . , for the first, second, etc., artifacts for which you wish to correct).2 Each individual artifact (ai) and the total product of all artifacts (a) have values that are 1.0 (no artifact bias) or less (with the possible exception of the correction for range restriction, as described below). The values of these artifacts decrease (and adjustments therefore increase) as the methodological limitations of the studies increase (i.e., larger problems, such as very low reliability, result in smaller values of a and larger corrections).

Artifact adjustments to effect sizes also require adjustments to standard errors. Because standard errors represent the imprecision in estimates of effect sizes, it makes conceptual sense that these would increase if you must make an additional estimate in the form of how much to correct the effect size. Specifically, the standard errors of effect sizes (e.g., r, g, or o; see Chap­ter 5) are also adjusted for artifact correction using the following general formula:

The one exception to this equation is when one is correcting for range restriction. This correction represents an exception to the general rule of Equation 6.2 because the effect size is used in the computation of a, the arti­fact correction (see Equations 6.7 and 6.8). In this case of correcting for range restriction, you multiply arange by ESadjusted/ESobserved prior to correcting the standard error.

1. Corrections for Unreliability

This correction is for unreliability of measurement of the variables comprising the effect sizes (e.g., variables X and Y that comprise a correlation). Unreliabil­ity refers to nonsystematic error in the measurement process (contrast with systematic error in measurement, or poor validity, described in Section 6.2.4).

Reliability, or the repeatability of a measure (or the part that is not unreliable), can be indexed in at least three ways. Most commonly, reliability is considered in terms of internal consistency, representing the repeatability of measurement across different items of a scale. This type of reliability is indexed as a func­tion of the associations among items of a scale, most commonly through an index called Cronbach’s coefficient alpha, a (Cronbach, 1951; see, e.g., DeVel- lis, 2003). Second, reliability can be evaluated in terms of agreement between multiple raters or reporters. This interrater reliability can be evaluated with the correlation between sets of continuous scores produced by two raters (or average correlations among more than two raters) or with Cohen’s kappa (k) representing agreement between categorical assignment between raters (for a full description of methods of assessing interrater reliability, see von Eye & Mun, 2005). A third index of reliability is the test—retest reliability. This test- retest reliability is simply the correlation (r) between repeated measurements, with the time span between measurements being short enough that the con­struct is not expected to change during this time. Because all three types of reliability have a maximum of 1 and a minimum of 0, the relation between reliability and unreliability can be expressed as reliability = 1 – unreliability.

Regardless of whether reliability is indexed as internal consistency (e.g., Cronbach’s a), interrater agreement (r or k), or test-retest reliability (r), this reliability impacts the magnitude of effect sizes that a study can find. If reliability is high (e.g., near perfect, or close to 1) for the measurement of two variables, then you expect that the association (e.g., correlation, r) the researcher finds between these variables will be an unbiased estimate of the actual (latent) population effect size (assuming the study does not contain other artifacts described below). However, if the measurement of one or both variables comprising the association of interest is low (reliability far below 1, maybe even approaching 0), then the maximum (in terms of absolute value of positive or negative associations) effect size the researcher might detect is substantially lower than the true population effect size. This is because the correlation (or any other effect size) between the two variables of inter­est is being computed not only from the true association between the two constructs, but also between the unreliable aspects of each measure (i.e., the noise, which typically is not correlated across the variables).

If you know (or at least have a good estimate of) the amount of unre­liability in a measure, you can estimate the magnitude of this effect size attenuation. This ability is also important for your meta-analysis because you might wish to estimate the true (disattenuated) effect size from a primary study reporting an observed effect size and the reliability of measures. Given the reliability for variables X and Y, with these general reliabilities denoted as
rxx and ryy, you can estimate the corrected correlation (i.e., true correlation between constructs X and Y) using the following artifact adjustment (Baugh, 2002; Hunter & Schmidt, 2004, pp. 34-36):

As described earlier (see Equation 6.1), you estimate the true effect size by dividing the observed effect size by this (and any other) artifact adjust­ment. Similarly, you increase the standard error (SE) of this true effect size estimate to account for the additional uncertainty of this artifact correction by dividing the standard error of the observed effect size (formulas provided in Chapter 5) by this (and any other) artifact adjustment (see Equation 6.2).

An illustration using a study from the ongoing example meta-analysis (Card et al., 2008) helps clarify this point. This study (Hawley, Little, & Card, 2007) reported a bivariate correlation between relational aggression and rejection of r = .19 among boys (results for boys and girls were each cor­rected and later combined). However, the measures of both relational aggres­sion and rejection exhibited marginal internal consistencies (as = .82 and .81, respectively), which might have contributed to an attenuated effect size of this correlation. To estimate the adjusted (corrected) correlation, I compute first and

The standard error of Fisher’s transformation of this uncorrected correlation is .0498 (based on N = 407 boys in this study); I also adjust this standard error to se adjusted = .0498 _ _0610. This larger standard error represents the greater imprecision in the adjusted effect size estimated using this correction for unreliability.

This artifact adjustment (Equation 6.3) can also be used if you wish to correct for only one of the variables being correlated (e.g., correction for X but not Y). This may be the preference because the meta-analyst assumes that one of the variables is measured without error, because the primary studies fre­quently do not report reliability estimates of one of the variables, or because the meta-analyst is simply not interested in one of the variables.3 If you are interested in correcting for unreliability in one variable, then you implicitly assume that the reliability of the other variable is perfect. In other words, correction for unreliability in only one variable is equivalent to substituting 1.0 for the reliability of the other variable in Equation 6.3, so this equation simplifies to the artifact correction being the square root of the reliability of the single variable.

Before ending discussion of correction for unreliability, I want to men­tion the special case of latent variable associations. These latent variable asso­ciations include (1) correlations between factors from an exploratory factor analysis with oblique factor rotation (e.g., direct oblimin, promax), and (2) correlations between constructs in confirmatory factor analysis models.4 You should remember that these latent correlations are corrected for measure­ment error; in other words, the reliabilities of the latent variable are perfect (i.e., 1.0). Therefore, these latent correlations are treated as effect sizes already corrected for unreliability, and you should not further correct these effect sizes using Equation 6.3. This point can be confusing because many primary studies will report internal consistencies for these scales; but these internal consistencies are relevant only if the study authors had conducted manifest variable analyses (e.g., using summed scale scores) with these measures.

2. Corrections for Imperfect Validity

The validity of a measure refers to the systematic overlap between the mea­sure and the intended construct (i.e., the thing the measure is meant to mea­sure). It is important to distinguish between validity and reliability. Reli­ability, described earlier, refers to the repeatability of a measure across items, raters, or occasions; high reliability is indicated by different items, raters, or occasions of measurement having high correspondence (i.e., being highly correlated). However, reliability does not tell us whether we are measuring what we intend to measure, but only that we are measuring the same thing (whatever it may be) consistently. In contrast, validity refers to the consis­tency between the measure and the construct. For instance, “Does a par­ticular peer nomination instrument truly measure victimization?”, “Does a parent-report scale really measure depression?”, and “Does a particular IQ test measure intelligence?” are all questions involving validity. Low valid­ity means that the measure is reliably measuring something other than the intended construct. Reliability and validity are entirely independent phe­nomena: A scale can have high reliability and low validity, and another scale can have low reliability and high validity (if one assesses validity by correct­ing for attenuation due to unreliability, as in latent variable modeling; Little, Lindenberger, & Nesselroade, 1999).

You can conceptualize a measure’s degree of validity as the disattenu- ated correlation between the measure and the construct. In other words, the validity of a measure (X) in assessing a construct (T) is txt when the mea­sure is perfectly reliable. If the effect size of interest in your meta-analysis is the association between the true construct (T) and some other variable (Y; assuming for the moment that this variable is measured with perfect valid­ity), then the association you are interested in might be represented as tty (you could also apply this correction to other effect sizes, but the use of cor­relations here facilitates understanding). Therefore, the observed association between the measure (X) and the other variable (Y) is equal to the product of the validity of the measure and the association of the construct with the other variable, txy = txt * tty. To identify the association between the con­struct (T) and the other variable (Y), which is what you are interested in, you can rearrange this expression to tty = txy/txt In other words, the adjustment for imperfect validity in a study is:

Here, txt represents the validity coefficient or disattenuated (for mea­surement unreliability) correlation between the measure (X) and intended construct (T).

This adjustment is mathematically simple, yet its use contains two chal­lenges. The first challenge is that this adjustment assumes that whatever is specific to the measure (X) that is not part of the construct (T) is uncorrelated with the other variable (Y). In other words, the reliable but invalid portion of the measure (e.g., method variance) is assumed to not be related to the other variable of interest (either the construct Ty or its measure, Xy). The second challenge in applying this adjustment is simply in obtaining an estimate of the validity coefficient (xt). This validity will almost never be reported in the primary studies of the meta-analysis (if a study contained this more valid variable, then you would simply use effect sizes from this variable rather than the invalid proxy). Most commonly, you need to obtain the validity coefficient from another source, such as from validity studies of the measure (X). When using the validity coefficient from other studies, however, you must be aware of both (unreliable) sampling error in the magnitude of this correlation and (reliable but unknown) differences in this correlation between the validity population and that of the particular primary study you are coding. For these reasons, I suspect that many fields will not contain adequate information to obtain a good estimate of the validity coefficient, and therefore this artifact adjustment may be difficult to use.

3. Corrections for Artificial Dichotomization

It is well known that artificial dichotomization of a variable that is naturally continuous attenuates associations that this variable has with others, yet this practice is all too common in primary research (see MacCallum et al., 2002). An important distinction is whether a variable is artificially dichotomized or truly dichotomous. When analyzing associations between two continu­ous variables (typically using r as an index of effect size; see Chapter 5), you might find that a primary study artificially dichotomized one of the vari­ables in one of many possible ways, including median splits, splits at some arbitrary level (e.g., one standard deviation above the mean), or at some rec­ommended cutoff level (e.g., at a level where a variable of maladjustment is considered “clinically significant”). Or you might find that the primary study dichotomized both variables of interest (again, through median splits, etc.). Finally, you might be interested in the extent to which two groups (a natu­rally dichotomous variable) differ on a continuous variable, which is dichoto­mized in some studies (e.g., the studies report the percentages of each group that have “clinically significant” levels of a maladjustment variable). In each of these cases, you need to recognize that the dichotomization of variables in the primary studies is artificial; that it does not represent the true continuous nature of the variable.

Corrections for one variable that is artificially dichotomized are straight­forward. You need only to know the numbers, proportion, or percentages of individuals in the two artificial groups. Based on this information, the arti­fact adjustment for dichotomization of one variable is (Hunter & Schmidt, 1990; Hunter & Schmidt, 2004, p. 36; MacCallum et al., 2002):

The numerator, Φ(c), is the normal ordinate at the point c that divides the standard normal distribution into proportions P and Q. Because this value is unfamiliar to many, I have listed values of Φ(c), as well as the artifact adjustment for dichotomization of one variable (fldichotomizaticm), for various proportional splits in Table 6.2.

To illustrate this correction using the ongoing example (Card et al., 2008), I consider a study by Crick and Grotpeter (1995). Here, the authors artificially dichotomized the relational aggression variable by classifying children with scores one deviation above the mean as relationally aggressive and the rest as not relationally aggressive. Of the 491 children in the study, 412 (83.9%) were thus classified as not aggressive and 79 (16.1%) as relationally aggressive. The numerator of Equation 6.5 for this example is found in Table 6.2 to be .243. The denominator for this example is .368. Therefore, adichotomization = .243/.368 = .664 (accepting some rounding error), as shown in Table 6.2. For this example, the uncorrected correlation between relational aggression and rejection was .16 (computed from F(1,486) = 12.3) and the standard error of Zr was .0453 (from N = 491). The adjustment for artificial dichotomization yields radjusted = .16/.664 = .24 and SEadjusted = .0453/.664 = .0682. (Note that this adjustment is only for artificial dichotomization; we ultimately corrected for unreliability as well, which is why this effect size differs from that used in our analyses and used later.)

If both of two continuous variables are dichotomized, the correction becomes complex (specifically, you must compute a tetrachoric correlation; see Hunter & Schmidt, 1990). Fortunately, a simple approximation holds in most cases you are likely to encounter. Specifically, the artifact adjustment for two artificially dichotomized variables can be approximated by:

In other words, the artifact adjustment for two variables is approximated by the product of each adjustment for the dichotomization of each variable. The feasibility of this approximation depends on the extremity of the dichot- omization split and the corrected correlation (radjusted). Hunter and Schmidt (1990) showed that this approximation is reasonable when (1) one of the vari-ables has a median (P = .50) split and the corrected correlation is less than .70; (2) one of the variables has an approximately even split (.40 < P < .60) and the corrected correlation is less than .50; or (3) neither of the variables has extreme splits (.20 < P < .80) and the corrected correlation is less than .40. If any of these conditions apply, then this approximation is reasonably accurate (less than 10% bias). If the dichotomizations are more extreme or the corrected correlations are very large, then you should use the tetrachoric correlation described by Hunter and Schmidt (1990).

4. Corrections for Range Restriction

Estimates of associations between continuous variables are attenuated in studies that fail to sample the entire range of population variability on these variables. As an example (similar to that described by Hunter, Schmidt, & Le, 2006), it might be the case that GRE scores are strongly related to success in graduate school, but because only applicants with high GRE scores are admitted to graduate school, a sample of graduate students might reveal only small correlations between GRE scores obtained and some index of success (i.e., this association is attenuated due to restriction in range). This does not necessarily mean that there is only a small association between GRE scores and graduate school success, at least if we define our population as all poten­tial graduate students rather than just those admitted. Instead, the estimated correlation between GRE scores and graduate school success is attenuated (reduced) due to the restricted range of GRE scores for those students about whom we can measure success. Aside from GRE scores and graduate school success, it is easy to think of numerous other research foci for which restric­tion of range may occur: Studies of correlates of job performance are limited to those individuals hired, educational research too often includes only chil­dren in mainstream classrooms, and psychopathology research might only sample individuals who seek psychological services. Combination and com­parison of studies using samples of differing ranges might prove difficult if you do not correct for restrictions in range of one or both variables under consideration.

The first step in adjusting effect sizes for range restriction in one vari­able is to define some amount of typical (standard) deviation of that vari­able in the population and then determine the amount of deviation within the primary study sample relative to this reference population. This ratio of study (restricted) deviation to reference (unrestricted) standard deviation is denoted as u = 5Dstudy/SDreference (Hunter & Schmidt, 2004, p. 37). With some studies, determining this u may be straightforward. For example, if a study reports the standard deviation of the sample on an IQ test (e.g., 10) with a known population standard deviation (e.g., 15), then we could compare the sample range on IQ relative to the population range (e.g., u = 10/15 = 0.67).

In other situations, the authors of primary studies select individuals scor­ing in the top or bottom of a certain percentile range (e.g., selection of those above the median of a variable for inclusion is equivalent to selecting the top 50th percentile). In these situations, it is possible to compute the amount that the range is restricted. Although such calculations are complicated (see Barr & Sherrill, 1999), Figure 6.1 shows the values of u given the proportion of individuals selected for the study. The x-axis of this figure represents the proportion of individuals included in the study (e.g., selection of all individu­als above the 10th percentile means retention of 0.90 of participants). It can be seen that the less selective a study is (i.e., a higher proportion is retained, shown on the right side of the figure), the less restricted the range (i.e., u approaches 1), whereas the more selective a study is (i.e., a lower proportion is retained, shown on the left side of the figure), the more restricted the range (i.e., u becomes smaller). Note that the computations on which Figure 6.1 is based assume a normal distribution of the variable within the reference population and are only applicable with one-sided truncated data (i.e., the research selected individuals based on their falling above or below a single score or percentile cutoff).

In other situations, however, it may be difficult to determine a good esti­mate of the sample variability relative to that of the population. Although a perfect solution likely does not exist, I suggest the following: Select primary studies from all included studies that you believe do not suffer restriction of range (i.e., those that were fully sampled from the population to which you wish to generalize). From these studies, estimate the population (i.e., unrestricted) standard deviation by meta-analytically combining standard deviations (see Chapter 7). Then use this estimate to compute the degree of range restriction (u) among studies in which participants were sampled in a restrictive way.

The artifact adjustment for range restriction is based on this ratio of the sample standard deviation relative to the reference population standard devi­ation (as noted: u = 5Dstudy/SDreference) as well as the correlation reported in the study.5 Specifically, this adjustment is (Hunter & Schmidt, 2004, p. 37):

 

A unique aspect of this artifact adjustment for range restriction is that it can yield a values greater than 1.0 (in contrast to my earlier statement that these adjustments are always less than 1.0). The situation in which this can occur is when the sample range is greater than the reference population range (i.e., range enhancement). Although range enhancement is probably far less common than range restriction, this situation is possible in studies where individuals with extreme scores were intentionally oversampled.

A more complex situation is that of indirect range restriction (in contrast to the direct range restriction I have described so far). Here, the variables comprising the effect size (e.g., X and Y) were not used in selecting partici­pants, but rather a third variable (e.g., Z) that is related to one of the variables of interest was used for selection. If (1) the range of Z in the sample is smaller than that in the population, and (2) Z is associated with X or Y, then the effective impact of this selection is that the range of X or Y in the sample is indirectly restricted. Continuing the example I used earlier, imagine that we are interested in the association between IQ (X) and graduate school success (Y). Although students might not be directly selected based on IQ, the third variable GRE (Z) is correlated with IQ, and we therefore have indirect restric­tion in the range of IQ represented in the sample.

This situation of indirect range restriction may be more common than that of direct range restriction that I have previously described (Hunter et al., 2006). It is also more complex to correct. Although I direct readers inter­ested in a full explanation to other sources (Hunter & Schmidt, 2004, Ch. 3; Hunter et al., 2006; Le & Schmidt, 2006), I briefly describe this procedure. First, you need to consider both the sample standard deviation of the indi­rectly restricted variable (e.g., IQ, if GRE scores are used for selection and are associated with IQ) as well as the reliability of this restricted variable. You then compute an alternative value of u for use in Equation 6.7. Specifically, you compute this alternative ratio, denoted as uj by Hunter and Schmidt (2004; Hunter et al., 2006, p. 106) using the following formula:

As mentioned, this alternative ratio uj is then applied as u in Equation 6.7.

Another situation of range restriction is that of restriction on both vari­ables comprising the effect size. In the example involving GRE scores and graduate school success, the sample may be restricted in terms of both selec­tion on GRE scores (i.e., only individuals with high scores are accepted into graduate schools) and graduate school success (e.g., those who are unsuccess­ful drop out of graduate programs). This is an example of range restriction on both variables of the effect size, or double-range restriction (also called “dou­bly truncated” by Alexander, Carson, Alliger, & Carr, 1987). Although no exact methods exist for simultaneously correcting range restriction on both variables (Hunter & Schmidt, 2004, p. 40), Alexander et al. (1987) proposed an approximation in which one corrects first for restriction in range of one variable and then for restriction of range on the second variable (using the r corrected for range restriction on the first variable in Equation 6.7). Alexan­der et al. (1987) show that this approximation is generally accurate for most situations meta-analysts are likely to encounter, and Hunter and Schmidt (2004) report that this approximation can be used to correct for either direct or indirect range restriction.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: When (and How) to Correct: Conceptual, Methodological, and Disciplinary Considerations in Meta-Analysis

1. General considerations

As I described earlier, one consideration in deciding whether to correct for artifacts is the expected magnitude of effects these artifacts have on the results. Given the numerous artifact adjustments described in the previous section, you might reasonably choose to correct only for those that seem most pressing within the primary studies being synthesized.

How pressing a particular type of artifact is within a meta-analysis is partly a conceptual question and partly an empirical question. First, you must consider the collection of primary studies in light of your conceptual expertise of the area. Relevant questions include the following: How valid are the measures within this research in relation to the construct I am inter­ested in? How representative are the samples relative to the population about which I want to draw conclusions? Again, there is not a statistical answer to such questions; rather, these questions must be answered based on your understanding of the field.

In addition to conceptual considerations, you might also base conclu­sions on empirical grounds. Specifically, you can consider the data reported in primary studies to draw conclusions about the presence of important arti­facts. For example, I recommend coding the internal consistencies of relevant measures within the primary studies, meta-analyzing these reliabilities (see Chapter 7), and determining (1) whether the collection of studies has gener­ally high or low reliabilities of measures and (2) whether substantial variabil­ity exists across studies in these reliabilities. Similarly, if many studies use similar measures of a variable (i.e., with the same scale), then you could code and evaluate standard deviations across studies (see Chapter 7) to determine whether some studies suffer from restricted ranges. In short, for each of the potential artifacts described in the previous section, you should consider the available empirical evidence to determine whether this artifact is uniformly or inconsistently present in the primary studies being analyzed. If a particu­lar artifact is uniformly present, then correcting for it will yield more accurate overall effect size estimates (among latent constructs). If a particular artifact is present in some studies but not in others (or present in differing degrees across studies), then correcting for this artifact will reduce less interesting (i.e., artifactual) variability across studies and allow for a clearer picture of substantively interesting variability in effect sizes.

2. Disciplinary Considerations

Whereas I view the conceptual and empirical considerations as most impor­tant in deciding whether and how to correct for artifacts, the reality is that these corrections are more common in some fields than in others. This means that one meta-analyst working within one field might be expected to correct for certain artifacts, whereas another meta-analyst working within another field might be met with skepticism if certain (or any) corrections were to be performed. These disciplinary practices are unfortunate, especially because they are more often due to those who are influential in a field more so than consideration of particular needs of a field. Nevertheless, it is useful to recog­nize the common practices within your particular field.

Notwithstanding recognition of these disciplinary practices, I want to encourage you to not feel restricted by these practices. In other words, do not base your decision to perform or not perform certain artifact corrections only on common practices within your field. Instead, carefully consider the conceptual and empirical basis for making certain corrections, and then use (or not) these corrections to obtain results that best answer your research questions.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Describing Single Variables in Meta-Analysis

There are relatively few instances of meta-analyzing single variables, yet this information could be potentially valuable. At least three types of information regarding single variables could be important: (1) the mean level of individu­als on a continuous variable; (2) the proportions of individuals falling into a particular category of a categorical variable; and (3) the amount of variability (or standard deviation), in a continuous variable.

1. Mean Level on Variable

Meta-analysis of reported means on a single variable may have great value. One potential is that meta-analytic combination (see Chapters 8 and 9) allows you to obtain a more precise estimate of this mean than might be obtained in primary studies, especially when those primary studies have small sample sizes. Perhaps more importantly, meta-analytic comparison (see Chapter 10) allows you to iden­tify potential reasons why means differ across studies (e.g., methodological differ­ences such as condition or reporter; sample characteristics such as age or ethnic­ity). Thus, the meta-analysis of means of single variables has considerable value.

At the same time, there is also an important limiting consideration in the meta-analysis of means in that the primary studies must typically report this value in the same metric. For example, if one study measures the variable of interest on a 0-4 scale, whereas another uses a 1-100 scale, it usually does not make sense to combine or compare means across these studies.1 Some excep­tions can be considered, however. The first exception is if the different scales are due to the primary study authors scoring comparable measures in differ­ent ways, then it is usually possible to transform one of the scales to the metric of the other. For example, if two primary studies both use a 6-item scale with items having values from 1 to 5, one study may form a composite by averaging the items, whereas the other forms a composite by summing the items. In this case, it would be possible to transform one of the two means to the same scale of the other (i.e., multiplying the average by 6 to obtain the sum, or dividing the sum by 6 to obtain the average), and the means of the two studies could then be combined and compared. A second, more general exception is that it might usually be possible to transform studies using different scales into a common metric. From the example I provided of one study using a 0-4 scale and the other using a 1-100 scale, it is possible to transform a mean on one scale to an equivalent mean on the other using the following equation:

A caution in using different scales is that even if both studies use a com­mon range of scores (e.g., 0-4), it is probably only meaningful to combine and compare means if the studies used the same anchor points (e.g., if one used response options of never, rarely, sometimes, often, and always, whereas the other used 0 times, once, 2-3 times, 4-6 times, and 7 or more times, it would make little sense to combine or compare these studies). This may prove an especially difficult obstacle if you are attempting to combine multiple scales in which scores from one scale are transformed to scores of another using Equation 7.1. This requirement of primary studies reporting the variable on the same—or at least a comparable—metric means that you will often include only studies using the same measure (e.g., a particular measure of depres­sion, such as the Children’s Depression Inventory; Kovacs, 1992) or else very similar measures (e.g., child- and teacher-reported aggression using parallel items and response options). I suspect that this rather restrictive requirement is the primary reason why meta-analysis of means is not more common. If you are using different but similar measures, or transformations to place val­ues of different measures on a common scale, I highly recommend evaluating the measure as a moderator (see Chapter 9).

If you do have a situation in which the combination or comparison of means is feasible, computing this effect size (and its standard error) is straightforward. The equation for computing a mean is well known, but I reproduce it here:

However, it is typically not necessary (or possible) for you to compute this mean, as this is usually reported within the primary study. Therefore, coding the mean, which is an effect size (of the central tendency of a single variable), is usually straightforward.

Occasionally, however, the primary studies will report frequency tables rather than means for variables with a small number of potential options. For example, a primary study might report the number or proportion of individ­uals scoring 0, the number or proportion scoring 1, and so on, on a measure that has possible options of 0, 1, 2, 3, and 4. Here, you can use these frequen­cies of different scores to re-create the raw data and then compute the mean from these data (using Equation 7.2). An easier way to compute this mean is using the following equivalent formula provided by Lipsey and Wilson (2001, p. 176), summing over all potential values of a variable:

Before ending my discussion of calculating the mean as an effect size, it is important to consider the standard error of this estimate of the mean (which is used for weighting in the meta-analysis; see Chapter 8). To compute the standard error of a study’s estimate of the mean, you must obtain the (population estimate of the) standard deviation (s) and sample size (N) from that study, which are then used in the following equation:

After computing the mean and standard error of the mean for each study, you can then meta-analytically combine and compare results across studies using techniques described later in this book (see Chapters 8-10).

2. Proportion of Individuals in categories

Whereas the mean is a useful effect size for the typical score (i.e., central tendency) of a single continuous variable, the proportion is a useful effect size for a particular category of a categorical variable. For example, we may be interested in the proportion of children who are aggressive or the propor­tion of individuals who meet certain criteria for rejected social status, if we believe the meaningful conceptualization of aggression or rejection is cat­egorical. In these cases, we are interested in the prevalence of an affirmative instance of a single dichotomous variable.2

This proportion is often either directly reported in primary studies (as either a proportion or percentage, which can be divided by 100 to obtain the proportion), or else can be computed from the reported frequency falling in this category (k) relative to the total sample size (N):

This proportion works well as an effect size in many situations, but is problematic when proportions are far from 0.50.3 For this reason, it is useful to transform proportions (p) into logits (l) prior to meta-analytic combina­tion or comparison:

This logit has the following standard error dependent on the proportion (p) and sample size (N) (Lipsey & Wilson, 2001, p. 40):

Analyses would then be performed on the logit (l), weighted by the stan­dard error (SEj) as described in Chapters 8 through 10. For reporting, it is useful to back-transform results (e.g., mean effect size) in logits (l) back to proportions (p), using the following equation:

3. Variances and Standard Deviations

Few meta-analyses have used variances, or the equivalent standard deviation (the square root of the variance), as effect sizes. However, the magnitude of interindividual difference is a potentially interesting focus, so I offer this brief description of using these as effect sizes for meta-analysis.

The standard deviation, which is the square root of the variance, is cal­culated from raw data as follows:

This equation is the unbiased estimate of population standard deviation (and the square root of variance) from a sample (versus a description of the sample variability, which would be computed using N rather than N – 1 in the denominator). This is also the statistic commonly reported in primary research. In fact, you will almost never need to calculate this standard devia­tion, as doing so requires raw data that are typically not available. Fortu­-nately, standard deviations (or variances) are nearly always reported as basic descriptive information in primary studies.4

To meta-analytically combine or compare standard deviations (or vari­ances) across studies, you must also compute the standard error used for weighting (see Chapter 8). The standard error of the standard deviation is a function of the standard deviation itself and the sample size (Pigott & Wu, 2008):

The standard error of a variance estimate, as you might expect, is simply Equation 7.10 squared (i.e., SEs2 = s 2/2N)

At this point, you may have concluded that meta-analysis of standard deviations (and therefore variances) is straightforward. To a large extent this is true, though three qualifiers should be noted. First, as with the mean, it is necessary that the studies included all use the same measure, or at least mea­sures that can be placed on the same scale. Just as it would make little sense to combine means from studies’ incomparable scales, it does not make sense to combine magnitudes of individual difference (i.e., standard deviations) from incomparable scales. Second, standard deviations are not exactly nor­mally distributed, especially with small samples. Following the suggestion of Pigott and Wu (2008), I suggest that you do not attempt to meta-analyze stan­dard deviations if many studies have sample sizes less than 25. A third con­sideration involves the possibility of diminished standard deviations due to ceiling or floor effects. Ceiling effects occur when most individuals in a study score near the top of the scale, and floor effects occur when most individuals score near the bottom of the scale. In both situations, estimates of standard deviation are lowered because there is less “room” for individuals to vary given the constraints of the scale. For example, if we administered a third- grade math test to graduate students, we would expect that most of them would score near the maximum of the test, and the real individual variability in math skills would not be captured by the observed variability in scores on this test. I suggest two strategies for avoiding this potential biasing effect: (1) visually observe the means of studies and consider excluding those studies where the mean is close to the bottom or top of the scale, and (2) compute a correlation across studies between means and standard deviations—the presence of an association suggests a potential floor or ceiling effect, whereas the absence of association would suggest this bias is not present.5

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

When the Metric Is Meaningful: Raw Difference Scores

Paralleling the situation when you might want to meta-analyze means and standard deviations—that is, when included studies share a common (or comparable) scale for variable X—there may also be instances when we are interested in comparing two groups on a variable measured on a common scale across studies. For example, studies may all compare two groups on variable X using a common scale for X. Although Chapter 5 described the value of standardizing mean differences (e.g., g), in this situation of com­mon scales across studies, it may be more meaningful to meta-analytically combine and compare studies on this common scale. In other words, it may sometimes be useful to retain the meaningful metric of the scale on which variables were measured in primary studies (see also Becker, 2003).

There are various circumstances in which you may prefer to compare two groups in terms of raw rather than standardized differences. For exam­ple, gender differences in height may be more meaningful when expressed as inches than as number of standard deviations. Similarly, the effectiveness of a weight-loss program might be more meaningful if expressed in pounds (e.g., the treatment group lost, on average, 10 pounds more than the control group). If you are meta-analytically combining or comparing studies that all use the same measure of the variable of interest (or at least measures that use the same scale), it is straightforward to use these raw, or unstandardized, dif­ferences as effect sizes.

The unstandardized mean is simply the raw difference in means between two groups (Lipsey & Wilson, 2001, p. 47):

You probably recognize this equation from Chapter 5, where we dis­cussed the various standardized mean differences (Equations 5.5 to 5.7). In Equation 7.11, however, there is no denominator involving some variant of the standard deviation. This standard deviation denominator of Equations 5.5 to 5.7 served to standardize the mean differences in terms of standard deviation units. Here, where the metric is meaningful, you do not wish to standardize this mean difference, but instead leave it in its unstandardized, or raw score, metric.

To estimate the standard error of this unstandardized mean difference (for weighting in your meta-analysis; see Chapter 8), you use the follow­ing equation (see Bond, Wiitala, & Richard, 2003; Lipsey & Wilson, 2001, p. 47):

Once you have computed unstandardized mean differences and associ­ated standard errors for each study, it is then possible to meta-analytically combine and contrast these metrically meaningful effect sizes. However, Bond and colleagues (2003) discourage reliance on these traditional tech­niques and suggest more complicated procedures.6 I suspect that their cau­tions are most appropriate when studies have small sample sizes and that the increase in precision from their more advanced techniques diminishes with larger sample sizes. However, further quantitative studies are needed to evaluate this claim. Regardless, their alternative formulas do not affect the computation of a mean effect size, but rather inferences about this effect size and heterogeneity. For now, I encourage you to consider the alternative for­mulas of Bond et al. (2003) if you are meta-analyzing unstandardized mean differences from studies with small sample sizes and your initial analyses of significance of the mean effect or test of heterogeneity are very close to your chosen level of statistical significance. In other cases, you may find the standard methods described in Chapter 8 more straightforward with little substantive impact on your results.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Regression Coefficients and Similar Multivariate Effect Sizes in Meta-Analysis

1. Regression Coefficients

In many areas of study, researchers are interested in associations of one variable (X), with another variable (Y) controlling for other variables (Zs). For example, education researchers might wish to understand the relation between ethnicity and academic success, controlling for SES. Or a develop­mental researcher might be interested in whether (and to what extent) chil­dren’s use of relational aggression (e.g., gossiping about others, intentionally excluding someone from group activities) is associated with maladjustment, above and beyond their use of overt aggression (e.g., hitting, name calling, which is strongly correlated with relational aggression; see Card et al., 2008). In these cases, the central question involves the magnitude of unique, inde­pendent association between X and Y after controlling for Z (or multiple Zs).

In primary research, this situation is handled through multiple regres­sion and similar techniques. Specifically, in these situations you would regress the presumed dependent variable (Y) onto both the predictor of inter­est (X) and other variables that you wish to control (one or more Zs). The well-known equation for this regression is (e.g., Cohen et al., 2003):

Here, the regression coefficient of X (B]) is of most interest. However, this value by itself is often less intuitive than several alternative indexes. The first possibility is the standardized regression coefficient, bi, which is inter-pretably similar to the unstandardized regression coefficient but expressed according to a range from 0 (no unique association) to ±1 (perfect unique association).7 If the X and Y variables are all measured according to a com­mon scale, then the unstandardized regression coefficient may be meta- analyzed. But the more common situation of X and Y measured on differ­ent scales across different studies requires that we rely on the standardized regression coefficient. This standardized regression coefficient will often be reported in primary studies, but when only the component bivariate correla­tions are reported, you can rely on the following definitional formula (Cohen et al., 2003, p. 68) to compute this coefficient from correlations:

2. Semipartial correlations

Another index of the unique association is the semipartial correlation (sr), which is the (directional) square root of the variance of X that does not over­lap with Z with all of Y (vs. the partial correlation, which quantifies the vari­ance of this nonoverlapping part of X relative to the part of Y that does not overlap with Z). Although sr is often reported, you may need to calculate it from bivariate correlations using the following definitional formulas (Cohen et al., 2003, pp. 73-74):

As with r, it is preferable to transform sr using Fisher’s Z transformation (Equation 5.2) before analysis and then back-transform average Zsr to sr or pr for reporting.

3. Standard Errors of Multivariate Effect Sizes

Thus far, I have talked about three potential effect sizes for meta-analysis of associations of X with Y, controlling for Z. As I describe further in Chap­ter 8, it is also necessary to compute standard errors for each for potential use in weighting meta-analytic combination and comparison across studies. The following formulas provide the standard errors for these four effect sizes (Cohen et al., 2003):

Having described the computations of independent associations and their standard errors, I need to caution you about their potential use in meta-analysis. A critical limiting factor in using these effect sizes from multiple regression analyses is that every study should include the same covariates (Zs) in analyses from which results are drawn. In other words, it is meaningful to compare the independent association between X and Y only if every study included in our meta-analysis controls for the same Z or set of Zs. If different studies include fewer or more, or simply different, covariates, then it makes no sense to combine the effect sizes of the type described here (i.e., regression coefficients, semipartial or partial correlations) from these studies.

If different studies do use different covariates, then you have two options, both of which require access to basic, bivariate correlations among all relevant variables (Y, X, and all Zs). The first option is to compute the desired effect sizes (i.e., regression coefficients, semipartial or partial correla­tions) from these bivariate correlations for each study and then meta-analyze these now-comparable effect sizes. This requires that all included studies report the necessary bivariate correlations (or you are able to obtain these from the authors). The second option is to meta-analyze the relevant bivari­ate correlations from each study in their bivariate form and then use these meta-analyzed bivariate correlations as sufficient statistics for multivariate analysis. This option is more flexible than the first one in that it can include studies reporting some but not all bivariate correlations. I discuss this latter approach in more detail in Chapter 12.

4. Differential Indices

Differential indices capture the magnitude of difference between two correla­tions within a study. Although these differential indices are rarely used, they do offer some unique opportunities to answer specific research questions. Next, I describe differential indices for both dependent and independent cor­relations.

4.1. Differential Index for Dependent Correlations

Meta-analysis of partial and semipartial correlations answers questions of whether a unique association exists between two variables, controlling for a third variable. For example, I might consider semipartial correlations of the association of relational aggression with rejection, above and beyond overt aggression. A slightly different question would be whether relational or overt aggression was more strongly correlated to rejection (see Card et al., 2008). More generally, the differential index for dependent correlations indexes the direction and magnitude of difference of two variables’ association with a third variable.

This differential index for dependent correlation, ddependent, is computed in a way parallel to the significance test to compare differences between dependent correlations (see Cohen & Cohen, 1983, pp. 56-57). This effect size of differential correlation of two variables (A and B) with a third variable (Y) is computed from the three correlations among these variables (Card et al., 2008):

This differential index will be positive when the correlation of A with Y is greater than the correlation of B with Y, zero when these two correlations are equal, and negative when the correlation of B with Y is larger. This dif­ferential correlation can be meta-analytically combined and compared across studies to draw conclusions regarding the extent (or under what conditions) one association is stronger than the other.

4.2. Differential Index for Independent Correlations

The differential index can also be used to meta-analytically compare differences between independent correlations, that is, correlations drawn from different populations. Independent correlations may emerge within a single primary study when the primary research reports effect sizes for different subgroups. For example, in our example meta-analysis of relational aggression and rejec­tion (Card et al., 2008), we were interested in evaluating gender differences in the magnitude of associations between relational aggression and rejection. This question is really one of moderation (Is the relational aggression with rejection link moderated by gender?), but here we compute the moderating effect within each study and subsequently meta-analyze the effect.

This differential index for independent correlations parallels the sig­nificance test to compare differences between independent correlations (see Cohen & Cohen, 1983, pp. 54-55). Given separately reported correlations for subgroups A and B within a single primary study, we apply Fisher’s transfor­mation to each and then use the following equation to index the differential association for the two subgroups (Card et al., 2008):

This differential index for independent correlations will be positive when the correlation is more positive (i.e., stronger positive or weaker negative) for subgroup A than B, negative when the correlation is more negative (i.e., weaker positive or stronger negative) for subgroup A than B, and zero when groups A and B have the same correlation. Meta-analytic combination across multiple studies providing data for this index provides evidence of whether (and how strongly) subgroup classification moderates this correlation.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Miscellaneous Effect Sizes in Meta-Analysis

As I hope is becoming increasingly clear, you can include a wide range of options for effect sizes in your meta-analyses. Although this section on miscellaneous effect sizes could include dozens of possibilities, I limit my description to two that seem especially useful: scale internal consistency and longitudinal change scores.

1. Scale Internal Reliability

Internal consistency, or the internal reliability of a scale, indexes the magni­tude to which items of a scale are homogeneous. The most widely used index of this internal consistency is Cronbach’s alpha, a (Cronbach, 1951), which can be computed based on the number of items in a scale (j) and the average correlation among these items (r):

There are two situations in which you might be interested in meta­analyzing internal consistency. One situation was raised in Chapter 6—when I described the situation in which you wish to correct for unreliability but this estimate is not provided in some studies. In this situation, a meta-analytically derived mean or predicted variable (i.e., predicted by characteristics of the study) provides a reasonable estimate of internal consistency to use when correcting for the artifact of unreliability. A second situation is when the internal consistency is itself of interest. For instance, you might be interested in knowing the average internal consistency of a scale across multiple studies (i.e., the mean internal consistency), or you might be interested in the con­ditions under which internal consistency is higher or lower (i.e., moderator analyses across study characteristics). In both situations, meta-analysis of internal consistency estimates is valuable.

Although various methods of meta-analyzing reliability results have been proposed, I rely on the method described by Rodriguez and Maeda (2006) for Cronbach’s alpha. This approach is relatively simple, and Cronbach’s alpha is reported in most studies.8 This method relies on a transformation of Cron­bach’s alpha as the effect size (Rodriguez & Maeda, 2006):

The standard error of this transformed internal consistency is a function of the number of items on the scale used in the study, the sample size, and the estimate of internal consistency itself (Rodriguez & Maeda, 2006):

After computing the mean transformed internal consistency (as well as confidence interval limits or predicted values at different levels of modera­tors), you should back-transform results into the more interpretable Cron- bach’s alpha:

2. Longitudinal change Scores

Longitudinal change is of central interest in many areas. In developmental science, much attention is given to change across age, which is often studied using naturalistic longitudinal designs (see Little et al., 2009). Longitudinal change is also relevant to experimental and quasi-experimental research; for instance, you might be interested in changes in some index of functioning from before to after an intervention. Given this empirical interest in longi­tudinal change, it follows that you may be interested in meta-analytically combining and comparing this change across studies.

We can consider longitudinal change scores as indexing a two-variable association between time (X) and the variable that is potentially increas­ing or decreasing (Y). Because most studies that you might potentially meta- analyze treat time as a categorical variable (e.g., Waves 1 and 2 of a survey, pre- and postintervention scores),9 you can represent these change scores as
either standardized mean change (e.g., g) or unstandardized mean change (if all studies use the same scale for the Y variable). Because it is more likely that you will want to meta-analyze studies using different measures of Y, I focus only on the standardized mean change here (for a description of unstandard­ized mean change, see Lipsey & Wilson, 2001, pp. 42-44).

The standardized mean change effect size is defined by the following formula (Lipsey & Wilson, 2001, p. 44):

This equation is identical to that for the standardized mean difference (g) between independent groups shown in Chapter 5, if you recognize that the denominator is simply the pooled standard deviation across time. From this equation, you see that computing gchange from reported descriptive data is straightforward. One caveat is that you should be careful that the reported means and standard deviations at each time come from only the individuals who participated in both times. In other words, you need descriptive data from the nonattriting sample.10

Although most research reports will provide these descriptive data, you may find instances where they do not. If the primary study reports only a repeated-measures £-test or ANOVA (F-ratio), along with the correlation between Waves 1 and 2 (i.e., interindividual stability), you can use this infor­mation to compute gchange using the following equation (which was also pro­vided in Chapter 5):

When using these equations, you should be sure that you are assigning the correct sign to the effect size. I strongly recommend always using positive scores to represent increases in the variable over time and negative scores to represent decreases. These equations also allow us to compute gchange from probability levels—be they exact (e.g., p = .034) or minimum effect sizes from a range (e.g., p < .05). Here, you simply look up the associated t or F value given the reported level of significance and degrees of freedom.

In addition to the effect size gchange, we also need to compute the stan­dard error of this estimate for weighting in our meta-analysis. As you would expect, the standard error is dependent on the sample size; but it is also dependent on the interindividual stability (r) of the variable across time. It is critical to find this information in the research report for accurately comput­ing this standard error; if it is not provided, you should seek to obtain this information from the study authors. The equation for the standard error for &hange is (Lipsey & Wilson, 2001, p. 44):

Before concluding this section on longitudinal change scores, I want to note that this approach is not limited only to longitudinal designs, even though that is where we are most likely to apply them. Instead, this approach can be used with any data that would typically be analyzed (in primary research) using paired-sample £-tests or two-group repeated-measures ANO- VAs. For example, this effect size would be appropriate in treatment stud­ies where individuals are matched into pairs, and then randomly assigned to treatment versus control groups (see e.g., Shadish et al., 2002, p. 118). Similarly, this effect size would be useful when meta-analyzing dyadic data in which individuals are interdependently linked, such as studies consider­ing differences between husbands and wives or between older and younger siblings (see Kenny, Kashy, & Cook, 2006). Although these types of studies are likely less common in most fields, you can keep these possibilities in mind.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: The Opportunities and Challenges of Meta-Analyzing Unique Effect Sizes

1. The Challenges of Meta-Analyzing unique Effect Sizes

Meta-analyzing unique effect sizes carries a number of challenges. In this section, I describe some challenges to meta-analyzing unique effect sizes. These challenges apply not only to the effect sizes I have described in this chapter, but to a nearly unlimited range of other advanced effect sizes that we might consider.

One challenge of using unique effect sizes in meta-analysis is that the pri­mary studies might often fail to report the necessary data. When I described basic effect sizes in Chapter 5, I mentioned that these effect sizes are often reported, or else sufficient information to compute such effect sizes are typi­cally reported. In this chapter, I have focused attention on some effect sizes that are likely to be commonly reported (e.g., internal consistency), but this information is still less likely to be reported in all relevant studies. If you are using unique effect sizes (either those I have described here or others), it will be important for you to contact authors of studies that could provide relevant data that are not reported. Often, you will need to give explicit instructions to these authors on how to compute these unique effects, which might be less familiar to researchers than more basic effect sizes (e.g., correlations).

A related challenge involves the inconsistencies in analytic methods and reporting of advanced effect sizes. Earlier in this chapter, I described this challenge when using independent effect sizes, such as regression coefficients or semipartial correlations from multiple regression analyses in which differ­ent studies include different predictors/covariates. We can also imagine how this inconsistency would pose obstacles to the use of other effect sizes. For example, imagine that you wanted to meta-analytically combine results of exploratory factor analyses, such as factor loadings and commonality. If you looked at the relevant literature, you would find tremendous variability in the use of principal components versus true factor analysis models, methods of extraction, the way authors determined the number of factors to extract or interpret, and methods of rotation. Given this diversity, it would be difficult, if not impossible, to attempt to meta-analytically combine these results. This example illustrates the challenge of meta-analyzing unique effect sizes from studies that might vary in their analytic methods and reporting.

As I will discuss further in Chapter 8, meta-analysis of an effect size involves not only obtaining an estimate of that effect size for each study, but also computing a standard error for each effect size estimate for weighting. In other words, it is not enough to simply be able to find sufficient data in the primary study to compute the effect size, but you must also determine the correct formula and find the necessary information in the study to com­pute the standard error. Some readers might agree that the equations just to compute effect sizes are daunting; the formulas to compute standard errors are usually even more challenging and are typically difficult to find in all but the most advanced texts (and in some cases, there is no consensus on what an appropriate standard error is). Furthermore, you typically need more information to calculate the standard errors than the effect sizes, and this information is more often excluded from reports (and more often puzzling to authors if you request this information). In short, you need to remember that, to use an advanced effect size in a meta-analysis, you must be able to compute both its point estimate and its standard error from primary studies.

2. Balancing the Challenges with the Opportunities of Meta-Analyzing Unique Effect Sizes

Although the use of unique effect sizes in meta-analysis poses several chal­lenges, their use also offers several opportunities. Namely, if only unique effect sizes answer the questions you want to answer, then it is worth facing these challenges to answer these questions. How can you weigh the potential reward versus the cost of using unique effect sizes? Although this is a dif­ficult question to answer, I offer some thoughts next.

First, I suggest asking yourself whether the question you want to answer in your meta-analysis (see Chapter 2) really requires reliance on unique effect sizes. Can your question be effectively answered using traditional effect sizes such as r, g, or o? Is it possible that the question you are asking is similar to one involving these unique effect sizes? If so (to the last question), you might consider coding both the basic and the unique effect sizes from the studies included; you then can attempt to proceed using the unique effect sizes, but can revert to the basic effect sizes if you have to. One special consideration involves questions where you are truly interested in multivariate effect sizes, such as independent associations from multiple regression-type analyses. In these situations, you may want to read Chapter 12 before proceeding, and decide whether you might better answer these questions through multivari­ate meta-analysis of basic effect sizes rather than through univariate meta­analysis of multivariate effect sizes.

Second, you will want to determine how readily available the necessary information is within the included effect sizes. It is invaluable to examine some of the primary studies that will be included in your planned meta­analysis to get a sense of what sort of information the authors report. When doing so, sample a few studies from different authors or research groups, as their reporting practices likely differ. If you find that the necessary informa­tion is usually reported, then this can be taken as encouragement to proceed with meta-analysis of unique effect sizes. However, if the necessary infor­mation is rarely or inconsistently reported, you need to assess whether you will be able to obtain this information. Consider both your own willingness to solicit this information from authors and the likely response you will get from them. If you think that the availability of this information will be incon­sistent, then consider both (1) the expected total number of studies from which you could get the necessary information, and (2) the degree to which these studies are representative of all studies that have been conducted.

Finally, you need to realistically consider your own expertise with both meta-analysis and the relevant statistical techniques. If this is your first meta­analysis, I recommend against attempting to use unique effect sizes. Perform­ing a good meta-analytic review of basic effect sizes is challenging enough, so I encourage you to get some experience using these before attempting to meta-analyze unique effect sizes (at a minimum, be sure to code both basic and unique effect sizes). If you feel ready to try to meta-analyze unique effect sizes, consider your level of expertise in that particular statistical area (i.e., that regarding the unique effect size). Do you feel you are fluent in computing the effect size from commonly reported information? Are you familiar with the relevant standard errors and believe you can consistently calculate these from reported information? Do you feel comfortable in guiding researchers through the appropriate analyses when you need to request further informa­tion from them?

This section might seem discouraging, but I do not intend it to be. Using unique effect sizes in your meta-analysis can provide exciting opportunities to answer unique research questions. At the same time, it is important that you are realistic about your ability to use these unique effect sizes, and pro­ceed with caution (but do proceed).

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

The Logic of Weighting

Although the democratic process of giving equal weight to each study has some appeal, the reality is that some studies provide better effect size esti­mates than others, and therefore should be given more weight than others in aggregating results across studies. In this section, I describe the logic of using different weights based on the precision of the effect size estimates.

The idea of the precision of an effect size estimate is related to the stan­dard errors that you computed when calculating effect sizes (see Chapters 5-7). Consider two hypothetical studies: the first study relied on a sample of 10 individuals, finding a correlation between X and Y of .20 (or a Fish­er’s transformation, Zr, of . 203); and the second study relied on a sample of 10,000 individuals, finding a correlation between X and Y of .30 (Zr = .310). Before you take a simple average of these two studies to find the typical corre­lation between X and Y,1 it is important to consider the precision of these two estimates of effect size. The first study consisted of only 10 participants, and from the equation for the standard error of Zr (SE = 1/V(N – 3); see Chapter 5), I find that the expectable deviation in Zr from studies of this size is .378. The second study consisted of many more participants (10,000), so the parallel standard error is 0.010. In other words, a small sample gives us a point esti­mate of effect size (i.e., the best estimate of the population effect size that can be made from that sample), but it is possible that the actual effect size is much higher or lower than what was found. In contrast, a study with a large sample size is likely to be much more precise in estimating the population effect size. More formally, the standard error of an effect size, which is inversely related to sample size,2 quantifies the amount of imprecision in a particular study’s estimate of the population effect size.

Figure 8.1 further illustrates this concept of precision of effect sizes. In this figure, I have represented five studies of varying sample size, and therefore varying precision in their estimates of the population effect sizes. In this figure, I am in the fortunate—if unrealistic—position of knowing the true population effect size, represented as a vertical line in the middle of the figure. Study 1 yielded a point estimate of the effect size (represented as the circle to the right of this study) that was considerably lower than the true effect size, but this study also had a large standard error, and the resulting confidence interval of that study was large (represented as the horizontal arrow around this effect size). If I only had this study to consider, then my best estimate of the population would be too low, and the range of potential effect sizes (i.e., the horizontal range of the confidence interval arrow) would be very large. Note that the confidence interval of this study does include the true population effect size, but this study by itself is of little value in deter­mining where this unknown value lies.

The second study of Figure 8.1 includes a large sample. You can see that the point estimate of the effect size (i.e., the circle to the right of this study) is very close to the true population effect size. You also see that the confi­dence interval of this study is very narrow; this study has a small standard error and therefore high precision in estimating the population effect size. Clearly, the results of this study offer a great deal of information in determin­ing where the true population effect size lies, and I therefore would want to give more weight to these results than to those from Study 1 when trying to determine this population effect size.

The remaining three studies in Figure 8.1 contain sample sizes between those of Studies 1 and 2. Two observations should be noted regarding these studies. First, although none of these studies perfectly estimates the popula­tion effect size (i.e., none of the circles fall perfectly on the vertical line), the larger studies tend to come closer. Second, and related, the confidence inter­vals all3 contain the true population effect size.

The crucial difference between the hypothetical situation depicted in Figure 8.1 and reality is that you do not know the true population effect size when you are conducting a meta-analysis. In fact, one of the primary purposes of conducting a meta-analysis is to obtain a best estimate of this population effect size. In other words, you want to decide where to draw the vertical line in Figure 8.1. As I hope is clear at this point, it would make sense to draw this line so that it is closer to the effect size estimates from studies with narrow confidence intervals (i.e., small standard errors), and give less emphasis to ensuring that the line is close to those from studies with wide confidence intervals (i.e., large standard errors). In other words, you want to give more weight to some studies (those with small standard errors) than to others (those with large standard errors).

How do you quantify this differential weighting? Although the choices are virtually limitless,4 the statistically defensible choice is to weight effect sizes by the inverse of their variances in point estimates (i.e., standard errors squared). In other words, you should determine the weight of a particular study i (wj) from the standard error of the effect size estimate from that study (SEj) using the following equation:

In all analyses I describe in this chapter, you will use this weight. I sug­gest that you make a variable in your meta-analytic database representing this weight for each study in your meta-analysis. In the running example of this chapter, shown in Table 8.1, I consider 22 studies providing correla­tions between relational aggression and peer rejection. In addition to listing the study, I have columns showing the sample size, corrected effect sizes in original r and transformed Zr metrics, and the standard errors (SEzr) of these estimates. Note that these effect sizes have been corrected for two artifacts (see Chapter 6)—unreliability and artificial dichotomization (when relevant)—so the standard errors are also adjusted and not directly comput­able from sample size (for details, see Chapter 6). This table also shows the weight (w) for each study, computed from the standard errors using Equa­tion 8.1.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Measures of Central Tendency in Effect Sizes

1. Choices of Indices of Central Tendency

Turning momentarily away from the topic of weighting, I now consider the ways in which you can represent the central tendency of effect sizes from a series of studies in your meta-analysis. As with representing central tendency within a primary data analysis, you can consider the mode, median, and mean as possible indices.

The mode (the most commonly occurring value) is not a good choice for representing typical effect sizes in a meta-analysis. The problem is that effect sizes computed from multiple studies are likely to fall along such a fine-grained continuum that identifying the most commonly occurring value is meaningless. Although grouping effect sizes into categories might allevi­ate the problem, such categorizations are arbitrary and likely lead to a loss of information. In short, I view the mode as a poor choice for describing typical effect sizes in a meta-analytic review.

The median (the middle value of a rank-ordered list of values, or the 50th percentile) is a better choice for representing typical effect sizes in a meta-analysis. This value is easy to determine (e.g., in the example data of Table 8.1, the median effect size is r = .35) and is a valuable supplement in many situations because it is less influenced by skew in effect sizes than is the mean. At the same time, the disadvantage of the median, as typically computed, is that it does not differentially weight studies by the precision of their point estimates of effect sizes (see previous section). For this reason, I view the median as at best being a supplement to the mean.

The mean effect size is generally the most important index of central tendency in your meta-analyses. It is widely used in primary research, and therefore well understood by readers. Importantly, it is also possible to dif­ferentially weight effect sizes, and therefore give more weight to some studies than to others, by computing a weighted mean effect size. This weighted mean effect size (or the random-effects variant I describe in Chapter 10) represents critical information that will be reported in your meta-analytic reviews.

2. Computing the weighted Mean Effect Size

The weighted mean effect size across studies is computed from the weights (wj) and effect sizes (ESj) from each of the studies i using the following equa­tion:

In other words, the mean effect size is calculated by computing the prod­uct of each study’s effect size by its weight (creating a separate variable repre­senting this product in your database), summing these products across stud­ies, and dividing this value by the sum of weights across studies. The logic of this equation is more obvious if you consider using w = 1 for all studies, or giving equal weight to all studies. Here, the mean is simply the sum of effect sizes divided by the number of effect sizes, the traditional formula for the (unweighted) mean.

Equation 8.2 is generic in that it can be used with any of the effect sizes I have described in this book. With those effect sizes that are typically trans­formed (e.g., r is typically transformed to Zr), this formula is applied to the transformed effect size from each study, and the average effect size is then back-transformed to the more interpretable effect size for reporting.

To illustrate the calculation of this weighted mean effect size, consider again the data in Table 8.1. In this table (just to the right of w), I have added a column showing the product of w and the effect size (here, Zr) for each study. I then summed these values across the 22 studies (easily done within any spreadsheet or basic statistical software package) to obtain the value 2764.36, which is the numerator of Equation 8.2. Also shown at the bottom of Table 8.1 is the sum of weights (w) across the 22 studies, 7152.21, which comprises the denominator of Equation 8.2. I then compute the weighted mean effect size as Z = 2764.36/7152.21 = .387. For reporting, I would transform this mean Zr into mean r using Equation 5.3, yielding r = .368.

This average effect size is a crucially important result of your meta­analytic review (and it wasn’t nearly as difficult to compute as you might have thought!). After you compute this average effect size, you have valuable infor­mation to describe the typical effect sizes in the area of your meta-analysis. However, it is also important to consider the effect size in terms of its statisti­cal significance and confidence intervals, the topic to which I turn next.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Inferential Testing and Confidence Intervals of Average Effect Sizes

The key to making inferences regarding statistical significance about, or computing confidence intervals around, this (weighted) mean effect size is to compute a standard error of estimate. Here, I am referring to the standard error of estimating the overall, average effect size, as opposed to the standard error of effect size estimates from each individual study. The standard error of this estimate of average effect size is computed from the following equa­tion:

The logic of this equation is that you want to cumulate the amount of precision across studies to estimate the precision of your estimate of mean effect size. This logic is clear if you consider Z effect size (without artifact corrections), in which the standard error for each study is 1 / V(N – 3) and the weight for each study is therefore N – 3. If there are many studies with large sample sizes, then the sum of ws (i.e., the denominator in Equation 8.3) will be large, and the standard error of estimate of the mean effect size will be small (i.e., the estimate will be precise). In contrast, if a meta-analysis includes just a few studies with small sample sizes, then the sum of ws is small and the standard error of the estimate of mean effect size will be rela­tively large. Although the equations for standard errors of other effect sizes are not as straightforward (in that they are not as simply related to sample size), they all follow this logic.

After computing this standard error of the mean effect size, you can use this value to make statistical inferences and to compute confidence intervals. To evaluate statistical significance, one can perform a Wald test, which sim­ply involves dividing a parameter estimate (i.e., the mean effect size) by its standard error:

This test is evaluated according to the standard normal distribution, sometimes called the Z test (note that this is different from Fisher’s Zr trans­formation). The statistical significance of this test can be obtained by looking up the value of Z in any table of standard normal deviates (where e.g., |Z| > 1.96 denotes p < .05). This test can also be modified from a test of an effect size of zero in order to test the difference from any other null hypothesis value, ESo, by changing the numerator to ES – ESo.

The standard error of the mean effect size can also be used to compute confidence intervals. Specifically, you can compute the lower (ESlb) and upper (ESub) bounds for the effect size using the following equation:

This equation can be used to compute any level of confidence interval desired, though 95% confidence intervals (i.e., two-tailed a = .05, so Z = 1.96) are typical. If the effect size you are using is one that is transformed (e.g., Zr, ln(o)), you should calculate the mean, lower-bound, and upper-bound effect sizes using these transformed values, and then back-translate each into inter­pretable effect size metrics (e.g., r, o).

To illustrate these computations using the running example, I refer again to Table 8.1. I have already summed the weights (w) across the 22 studies, so I can apply Equation 8.3 to obtain the standard error of the mean effect size,  I can use this standard error to evalu­ate the statistical significance of the average effect size (Zr) using the Wald test of Equation 8.4, Z = .387/0118 = 32.70, p < .001. I would therefore con­clude that this average effect size is significantly greater than zero (i.e., there is a positive association between relational aggression and peer rejection). To create 95% confidence intervals, I would compute the lower-bound value of the effect size using Equation 8.5 as = .363, which would then be transformed (using Equation 5.3) for reporting to a lower-bound r = .348. Similarly, I would compute the upper-bound value  which is converted to upper bound r = .388 for reporting. To summarize, the mean correlation of this example meta-analysis is .368, which is significantly greater than zero (p < .001), and the 95% confidence interval of this correlation ranges from .348 to .388.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Evaluating Heterogeneity among Effect Sizes

In Figure 8.1, all of the studies had confidence intervals that contained the vertical line representing the overall population effect size. This situation is called homogeneity—most of the studies capture a common population effect size, and the differences that do exist among their point estimates of effect sizes (i.e., the circles in Figure 8.1) are no more than expected by random­sampling fluctuations. Although not every study’s confidence interval needs to overlap with a common effect size in order to conclude homogeneity (these are, after all, only probabilistic confidence intervals), most should. More for­mally, there is an expectable amount of deviation among study effect size estimates, based on their standard errors of estimate, and you can compare whether the actual observed deviation among your study effect sizes exceeds this expected value.

If the deviation among studies does exceed the amount of expectable deviation, you conclude (with some qualifications I describe below) that the effect sizes are heterogeneous. In other words, the single vertical line in Fig­ure 8.1 representing a single common effect size is not adequate. In the situa­tion of heterogeneous effect sizes, you have three options: (1) ignore the het­erogeneity and analyze the data as if it is homogeneous (as you might expect, the least justifiable choice); (2) conduct moderator analyses (see Chapter 9), which attempt to predict between-study differences in effect size using the characteristics of studies coded (e.g., methodological features, characteristics of the sample); or (3) fit an alternative model to that of Figure 8.1, in which the population effect size is modeled as a distribution rather than a vertical line (a random-effects model; see Chapter 10).

1. Significance Test of Heterogeneity

The heterogeneity (vs. homogeneity) of effect sizes is frequently evaluated by calculating a statistic Q. This test is called either a homogeneity test or, less commonly, a heterogeneity test; other terms used include simply a Q test or Hedges’s test for homogeneity (or Hedges’s Q test). I prefer the term “heteroge­neity test” given that the alternate hypothesis is of heterogeneity, and therefore a statistically significant result implies heterogeneity. This test involves com­puting a value (Q) that represents the amount of heterogeneity in effect sizes among studies in your meta-analysis using the following equation (Cochran, 1954; Hedges & Olkin, 1985, p. 123; Lipsey & Wilson, 2001, p. 116):

The left portion of this equation is the definitional formula for Q and is relatively straightforward to understand. One portion of this equation simply computes the (squared) deviation between the effect size from each study and the mean effect size across studies, which is your best estimate of the population effect size, or vertical line of Figure 8.1. This squared deviation is multiplied by the study weight, which you recall (Equation 8.1) is the inverse of the squared standard error of that study. In other words, the equation is essentially the ratio of the (squared) deviations between the effect sizes to the (squared) expected deviation. Therefore, when studies are homogeneous, you expect this ratio to be close to 1.0 for each study, and so the sum of this ratio across all studies is going to be approximately equal to the number of studies (k) minus 1 (the minus 1 is due to the fact that the population effect size is estimated by your mean effect size from your sample of studies). When stud­ies are heterogeneous, you expect the (squared) deviations between studies and mean effect sizes to be larger than the (squared) expected deviations, or standard errors. Therefore, the ratio will be greater than 1.0 for most stud­ies, and the resulting sum of this ratio across studies will be higher than the number of studies minus 1.

Exactly how high should Q be before you conclude heterogeneity? Under the null hypothesis of homogeneity, the Q statistic is distributed as C2 with df = k – 1. Therefore, you can look up the value of the Q with a particular df in any chi-square table, such as that of Table 8.2,6 to determine whether the effect sizes are more heterogeneous than expected by sampling variability alone.

The equation in the right portion of Equation 8.5 provides the same value of Q as the definitional formula on the left (it is simply an algebraic rearrangement). However, this computational equation is easier to compute from your meta-analytic database. Specifically, you need three variables (or columns in a spreadsheet): the Wj and WjESj that you already calculated to compute the mean effect size, and WjESi2 that can be easily calculated. To illustrate, the rightmost column of Table 8.1 displays this WjESi2 for each of the 22 studies in the running example, with the sum (ZWjESj2) found to be 1359.60 (bottom of table). Given the previously computed (when calculating the mean effect size) sums, Swj = 7152.21 and SwjESi = 2764.36, you can com­pute the heterogeneity statistic in this example using Equation 8.6,

You can evaluate this Q value using df = k – 1 = 22 – 1 = 21. From Table 8.2, you see that this value is statistically significant (p < .001). As I describe in the next section, this statistically significant Q leads us to reject the null hypothesis of homogeneity and conclude the alternate hypothesis of hetero­geneity.

2. Interpreting the Test of Heterogeneity

The Q statistic is used to evaluate the null hypothesis of homogeneity ver­sus the alternate hypothesis of heterogeneity. If the Q exceeds the critical C2 value given the df and level of statistical significance chosen (see Table 8.2), then you conclude that the effect sizes are heterogeneous. That is, you would conclude that the effect sizes are not all estimates of a single popula-tion value, but rather, multiple population values. If Q does not exceed this value, then you fail to reject the null hypothesis of homogeneity.

This description makes clear that evaluation of Q (i.e., of heterogeneity vs. homogeneity) is a statistical hypothesis test. This observation implies two cautions in interpreting findings regarding Q. First, you need to be aware that this test of heterogeneity provides us information about the likelihood of results being homogeneous versus heterogeneous, but does not tell us the magnitude of heterogeneity if it exists (a consideration you should be par­ticularly sensitive to, given your attention to effect sizes as a meta-analyst). I describe an alternative way to quantify the magnitude of heterogeneity in the next Section (8.4.3). Second, you need to consider the statistical power of this heterogeneity test—if you have inadequate power, then you should be very cautious in interpreting a nonsignificant result as evidence of homogeneity (the null hypothesis). I describe the statistical power of this test in Section

3. An Alternative Representation of Heterogeneity

Whereas the Q statistic and associated significance test for heterogeneity can be useful in drawing conclusions about whether a set of effect sizes in your meta-analysis are heterogeneous versus homogeneous, they do not indicate how heterogeneous the effect sizes are (with heterogeneity of zero represent­ing homogeneity). One useful index of heterogeneity in your meta-analysis is the I2 index. This index is interpreted as the percentage of variability among effect sizes that exists between studies relative to the total variability among effect sizes (Higgins & Thompson, 2002; Huedo-Medina, Sanchez-Meca, Marin-Martinez, & Botella, 2006). The I2 index is computed using the fol­lowing equation (Higgins & Thompson, 2002; Huedo-Medina et al., 2006):

The left portion of this equation uses terms that I will not describe until Chapter 10, so I defer discussion of this portion for now. The right portion of the equation uses the previously computed test statistic for heterogeneity (Q) and the number of studies in the meta-analysis (k). The right portion of the equation actually contains a logical statement, whereby I2 is bounded at zero when Q is less than expected under the null hypothesis of homogeneity (lower possibility), but the more common situation is the upper possibil­ity. Here, the denominator consists of Q, which can roughly be considered the total heterogeneity among effect sizes, whereas the numerator consists of what can roughly be considered the total heterogeneity minus the expected heterogeneity given only sampling fluctuations. In other words, the ratio is roughly the between-study variability (total minus within-study sampling variability) relative to total variability, put onto a percentage (i.e., 0 to 100%) scale.

I2 is therefore a readily interpretable index of the magnitude of hetero­geneity among studies in your meta-analysis, and it is also useful in com­paring heterogeneity across different meta-analyses. Unfortunately, because it is rather new, it has not been frequently used in meta-analyses, and it is therefore difficult to offer suggestions about what constitutes small, medium, or large amounts of heterogeneity.7 In the absence of better guidelines, I offer the following suggestions of Huedo-Medina and colleagues (2006) that I2 = 25% is a small amount of heterogeneity, I2 ~ 50% is a medium amount of heterogeneity, and I2 ~ 75% is a large amount of heterogeneity (as mentioned, I2 ~ 0% represents homogeneity). In the example meta-analysis of relational aggression with peer rejection I described earlier, I2 = 92.8%.

4. Statistical Power in Testing Heterogeneity

Although the Q test of heterogeneity is a statistical significance test, many meta-analysts make conclusions of homogeneity when they fail to reject the null hypothesis. This practice is counter to the well-known caution in pri­mary data analysis that you cannot accept the null hypothesis (rather, you simply fail to reject it). On the other hand, if there is adequate statistical power to detect heterogeneity and the results of the Q statistic are not sig­nificant, then perhaps conclusions of homogeneity—or at least the absence of substantial heterogeneity—can be reasonably made. The extent to which this argument is tenable depends on the statistical power of your heterogene­ity test.

Computing the statistical power of a heterogeneity test is extremely complex, as it is determined by the number of studies, the standard errors of effect size estimates for these studies (which is largely determined by sample size), the magnitude of heterogeneity, the theoretical distribution of effect sizes around a population mean (e.g., the extent to which an effect size index is normally [e.g., Z is approximately normally distributed] vs. non-normally [e.g., r is skewed, especially at values far from zero] distributed), and the extent to which assumptions of the effect size estimates from each study are violated (e.g., assuming equal variance between two groups when this is not true) (see, e.g., Alexander, Scozzaro, & Borodkin, 1989; Harwell, 1997). In this regard, computing the statistical power of the heterogeneity test for your particular meta-analysis is very difficult, and likely precisely possible only with complex computer simulations.

Given this complexity, I propose a less precise but much simpler approach to evaluating whether your meta-analysis has adequate statistical power to detect heterogeneity. First, you should determine a value of I2 (see previous subsection) that represents the minimum magnitude of heterogeneity that you believe is important (or, conversely, the maximum amount of heterogene­ity that you consider inconsequential enough to ignore). Then, consult Figure 8.2 to determine whether the number of studies in your meta-analysis can conclude that your specified amount of heterogeneity (I2) will be detected. This figure displays the minimum level of I2 that will result in a statistically significant value of Q for a given number of studies, based on p = .05. If the figure indicates that the number of studies in your meta-analysis could detect a smaller level of I2 than what you specified, it is reasonable to conclude that the test of heterogeneity in your meta-analysis is adequate. I stress that this is only a rough guide, which I offer only as a simpler alternative to more com­plex power analyses; however, I feel that it is likely adequate for most meta­analyses.8 Accepting Figure 8.2 as a rough method of determining whether tests of heterogeneity have adequate statistical power, it becomes clear that this test is generally quite powerful. Based on the suggestions of I2 ~ 25%, 50%, and 75% representing small, medium, and large amounts of heteroge­neity, respectively, you see that meta-analyses consisting of 56 studies can detect small heterogeneity, those with as few as 9 studies can detect medium heterogeneity, and all meta-analyses (i.e., combination of two or more stud­ies) can detect large heterogeneity.

Before concluding that the test of heterogeneity is typically high in sta­tistical power, you should consider that the I2 index is the percentage ratio of between-study variance to total variance, with total variance made up of both between- and within-study variance. Given the same dispersion of effect sizes from a collection of studies with large standard errors (small samples) rather than small standard errors (large samples), the within-study variance will be larger and the I2 will therefore be smaller (because this larger within- study variance goes into the numerator or Equation 8.7). Given these situ­ations of large standard errors (small sample sizes) among studies, the test of heterogeneity can actually have low power because the I2 is smaller than expected (see Harwell, 1997, for a demonstration of situations in which the test has low statistical power). For this reason, it is important to carefully consider what values of I2 are meaningful given the situation of your own meta-analysis and those in similar situations, more so than relying too heav­ily on guidelines such as those I have provided.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: Nonindependence among Effect Sizes

An important qualifier to the analyses I have described in this chapter (and those I will describe in subsequent chapters) is that they should be per­formed with a set of independent effect sizes. In primary data analysis, it is well known that a critical assumption is of independent observations; that each case (e.g., person) is a random sample from the population independent of the likelihood of another participant being selected. In meta-analysis, this assumption is that each effect size in your analysis is independent from oth­ers; this assumption is usually considered satisfied if each study of a particu­lar sample of individuals provides one effect size to your meta-analysis.

As you will quickly learn when coding effect sizes, this assumption is often violated—single studies often provide multiple effect sizes. This mul­titude of effect sizes from single studies creates nonindependence in meta­analytic datasets in that effect sizes from the same study (i.e., the same sam­ple of individuals) cannot be considered independent.

These multiple effect sizes arise for various reasons, and the reason impacts how you handle these situations. The end goal of handling each type of nonindependence is to obtain one single effect size from each study for any particular analysis.

1. Multiple Effect Sizes from Multiple Measures

One potential source of multiple effect sizes from a single study is that the authors report multiple effect sizes based on different measures. For exam­ple, the study by Rys and Bear (1997) in the example meta-analysis of Table 8.1 provided effect sizes of the association between relational aggression and peer rejection based on a peer-report (corrected r = .556) and teacher-report (corrected r = .338) measures of relational aggression. Or a single study might examine an association at two distinct time points. For example, Werner and Crick (2004) studied children in second through fourth grades and then re­administered measures to these same children approximately one year later, finding concurrent correlations between relational aggression and rejection of r = .479 and .458 at the first and second occasions, respectively.

In these situations, you have two options for obtaining a single effect size. The first option is to determine if one effect size is more central to your interests and to use only that effect size. This decision should be made in consultation with your study inclusion/exclusion criteria (see Chapter 3), and you should only reach this decision if it is clear that one effect size should be included whereas the other should not. Using the two example studies men­tioned, I might choose one of the two measurement approaches of Rys and Bear (1997) if I had a priori decided that peer reports of relational aggression were more important than teacher reports (or vice versa). Or I might decide to use only the first measurement occasion of the study by Werner and Crick (2004) if something occurred after this first data collection so as to make the subsequent results less relevant for my meta-analysis (e.g., if they had imple­mented an intervention and I was only interested in the association between relational aggression and rejection in normative situations). These decisions should not be based on which effect size estimate best fits your hypotheses (i.e., do not simply choose the largest effect size); it is best if you can make this decision without looking at the value of the effect size.

The second, and likely more common, option is to average these multiple effect sizes. Here, you should compute the average effect size (see Equation 8.2) among these multiple effect sizes and use this average as your single effect size estimate for the study (if the effect size is one that is typically transformed, such as Zr or ln(o), then you should average the transformed effect sizes).9 To illustrate, I combined the two effect sizes from Rys and Bear (1997) by converting both correlations (.556 and .338 for peer and teacher reports) to Zr (.627 and .352) and then averaged these values to yield the Zr = .489 shown in Table 8.1; I back-transformed this value to r = .454 for summary in this table. Similarly, I converted the correlations at times 1 and 2 from Werner and Crick (2004), r = .479 and .458 to Zr = .522 and .495, and computed the average of these two, which is shown in Table 8.1 as Zr = .509 (and the parallel r = .469). If Rys and Bear (1997) had more than two measurement approaches, or if Werner and Crick (2004) had more than two measurement occasions, I could compute the average of these three or more effect sizes in the same way to yield a single effect size per study.

2. Multiple Effect Sizes from Subsets of Participants

A second potential source of multiple effect sizes from a single study is that the effect sizes are separately reported for subgroups of the sample. For exam­ple, effect sizes might be reported separately by gender, ethnicity, or multiple treatment groups. If each of these groups should be included in your meta­analysis given your inclusion/exclusion criteria, then your goal is to compute an average effect size for these multiple groups.10 Two considerations distin­guish this situation from that of the previous subsection, however. First, if you average effect sizes across multiple subgroups, your effective sample size for the study (used in computing the standard error for the study) is now the sum of the multiple combined groups. Second, the average in this situation should be a weighted average so that larger subgroups have greater contribu­tion to the average than smaller subgroups.

To illustrate, a study by Hawley et al. (2007) used data from 407 boys and 522 girls, reporting information to compute effect sizes for boys (cor­rected r = .210 and Zr = .214) and girls (corrected r = .122 and Zr = .122), but not for the overall sample. To obtain one common effect size for this sample, I computed the weighted average effect size using Equation 8.2 to obtain the value Zr = .162 (and r = .161) shown in Table 8.1. The standard error of this effect size is based on the total sample size, combining the sizes of the multiple subgroups (here, 407 + 522 = 929). It is important to note that this computed effect size is different from what would have been obtained if you could simply compute the effect size from the raw data. Specifically, this effect size from combined subgroups represents the association between the variables of interest controlling for the variable on which subgroups were created (in this example, gender). If you expect that this covariate control will—or even could—change the effect sizes (typically reduce them), then it would be useful to create a dichotomous variable for studies in which this method of combining subgroups was used for evaluation as a potential moderator (see Chapter 9).

It is also possible that some studies will report multiple effect sizes for multiple subgroups. In fact, the Rys and Bear (1997) study I described earlier actually reported effect sizes separately by measure of aggression and gender, so that the coded data consisted of correlations of peer-reported relational aggression with rejection for 132 boys (corrected r = .590, Zr = .678) and 134 girls (corrected r = .520, Zr = .577) and correlations of teacher-reported rela­tional aggression with rejection for these boys (corrected r = .270, Zr = .277) and girls (corrected r = .402, Zr = .427). In this type of situation, I suggest a two-step process in which you average effect sizes first within groups and then across groups (summing the sample size in the second round of averag­ing). For this example of the Rys and Bear (1997) study, I would first average the effect sizes from peer and teacher reports within the 132 boys (yielding Zr = .478), and then compute this same average within the 134 girls (yielding Zr = .502). I would then compute the weighted average of these effect sizes across boys and girls, which produces the Zr = .489 (and transformation to r = .454) shown in Table 8.1. You could also reverse the steps of this two- step process—in this example, first computing a weighted average effect size across gender for each of the two measures, and then averaging across the two measures (the order I took to produce the effect sizes described earlier)—to obtain the same results.

3. Effect Sizes from Multiple Reports of the Same Study

A third potential source of nonindependence is when data from the same study are disseminated in multiple reports (e.g., multiple publications, a dis­sertation that is later published). It is important to keep in mind that when I refer to a single effect size per study, I mean one effect size per sample of participants. Therefore, the multiple reports that might arise from a single primary dataset should be treated as a single study. If the two reports pro­vide different effect size estimates (presumably due to analysis of different measures, rather than a miscalculation in one or the other report), then you should average these as I described earlier. If the two reports provide some overlapping effect size estimates (e.g., the two reports both provide the cor­relation between relational aggression and rejection; both reports provide a Time 1 correlation but the second report also contains the Time 2 correla­tion), these repetitive values should be omitted.

Unfortunately, the uncertainty that arises from this sort of multiple reporting is greater than I have described here. Often, it is unclear if authors of separate reports are using the same dataset. In this situation, I recommend comparing the descriptions of methods carefully and contacting the authors if you are still uncertain. Similarly, authors might report results that seem to come from the full sample in one report and only a subset in another. Here, I suggest selecting values from the full sample when effect sizes are identical. Having made these suggestions, I recognize that every meta-analyst is likely to come across unique situations. As with much of my previous advice on these difficult issues, I strongly suggest contacting the authors of the reports to obtain further information.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Categorical Moderators in meta-analysis

1. Evaluating the Significance of a categorical Moderator

The logic of evaluating categorical moderators in meta-analysis parallels the use of ANOVA in primary data analysis. Whereas ANOVA partitions variability in scores across individuals (or other units of analysis) into variability existing between and within groups, categorical moderator analysis in meta-analysis partitions between-study heterogeneity into that between and within groups of studies (Hedges, 1982; Lipsey & Wilson, 2001, pp. 120-121). In other words, testing categorical moderators in meta-analysis involves comparing groups of studies classified by their status on some categorical moderator.

Given this logic of partitioning heterogeneity, it makes sense to start with the heterogeneity equation (Equation 8.6) from Chapter 8, reproduced here for convenience:

You might have noticed that I have changed the notation of this equation slightly, now giving the subscript “total” to this Q statistic. The reason for this subscript is to make it explicit that this is the total, overall heterogeneity among all effect sizes. The logic of testing categorical moderators is based on the ability to separate this total heterogeneity (Qtotal) into two components, the between-group heterogeneity (Qbetween) and the within-group heteroge­neity (Qwithin), such that:

The key question when evaluating categorical moderators is whether there is greater-than-expectable between-group heterogeneity. If there is, then this implies that the groups based on the categorical study character­istic differ and that the categorical moderator is therefore reliably related to effect sizes found in the studies. If the groups do not differ, then this implies that the categorical moderator is not related to effect sizes (or, in the language of null hypothesis significance testing, that you have failed to find evidence for this moderation).

The most straightforward way to compute the between-group heterogene­ity (Qbetween) is to rearrange Equati°n 9.2, s° that Qbetween = Qtotal Qwithin. Because you have already computed the total heterogeneity (Qtotali Equation 9.1), you only need to compute and subtract the within-group heterogeneity (Qwithin) to obtain the desired Qbetween. To compute the heterogeneity within each group, you apply a formula similar to that for total heterogeneity to just the studies in that group:

That is, you compute the heterogeneity within each group (g) using the same equation as for computing total heterogeneity, restricting the included studies to only those studies within group g. After computing the within-group heterogeneity (Qg) for each of the groups, you compute the within-group heterogeneity (Qwithin) simply by summing the heterogeneities (Qgs) from all groups. More formally:

As mentioned, after computing the total heterogeneity (Qtotal) and the within-group heterogeneity (Qwithin), you compute the between-group heterogeneity by subtracting the within-group heterogeneity from the total heterogeneity (i.e., Qbetween = Qtotal – Qwithin; see Equation 9.2). The statistical significance of this between-group heterogeneity is evaluated by considering the value of Qbetween relative to dfbetween, with dfbetween = G – 1. Under the null hypothesis, Qbetween is distributed as c2 with dfbetween, so you can consult a chi-square table (such as Table 8.2; or use functions such as Microsoft Excel’s “chiinv” as described in footnote 6 of Chapter 8) to evaluate the statistical significance to make inferences about moderation.

To illustrate this test of categorical moderators, consider again the exam­ple meta-analysis of 22 studies reporting associations between children and adolescents’ relational aggression and rejection by peers. As shown in Chap­ter 8, these studies yield a mean effect size Zr = .387 (r = .368), but there was significant heterogeneity among these studies around this mean effect size, Q(21) = 291.17, p < .001. This heterogeneity might suggest the importance of explaining this heterogeneity through moderator analysis, and I hypothe­ sized that one source of this heterogeneity might be due to the use of different reporters to assess relational aggression. As shown in Table 9.1, these studies variously used observations, parent reports, peer reports, and teacher reports to assess relational aggression, and this test of moderation evaluates whether associations between relational aggression and rejection systematically differ across these four methods of assessing aggression.

I have arranged these 27 effect sizes (note that these come from 22 inde­pendent studies; I am using effect sizes involving different methods from the same study as separate effect sizes1) into four groups based on the method of assessing aggression. To compute Qtotal, I use the three sums across all 27 studies (shown at the bottom of Table 9.1) within Equation 9.1:

I then compute the heterogeneity within each of the groups using the sums from each group within Equation 9.3. For the three observational stud­ies, this within-group heterogeneity is

Using the same equation, I also compute within-group heterogeneities of Qwithin_parent = 0.00 (there is no heterogeneity in a group of one study), Qwithin_teacher = 243.16, and Qwithin_teacher = 40.73. Summing these values yields Qwithin = 1.68 + 0.00 + 243.16 + 40.73 = 285.57. Given that Qbetween = Qtotal – Qwithin, the between-group heterogeneity is Qbetween = 350.71 – 285.57 = 65.14. This Qbetween is distributed as chi-square with df = G – 1 = 4 – 1 = 3 under the null hypothesis of no moderation (i.e., no larger-than-expected between group differences). The value of Qbetween in this example is large enough (p < .001; see Table 8.2 or any chi-square table) that I can reject this null hypothesis and accept the alternate hypothesis that the groups differ in their effect sizes. In other words, moderator analysis of the effect sizes in Table 9.1 indicates that method of assessing aggression moderates the asso­ciation between relational aggression and peer rejection.

2. Follow-Up Analyses to a Categorical Moderator

If you are evaluating a categorical moderator consisting of two levels—in other words, a dichotomous moderator variable—then interpretation is simple. Here, you just conclude whether the between-group heterogeneity
is significant, then inspect the within-group mean effect sizes (i.e., weighted means computed using studies from each group separately). The decision and interpretation is then straightforward as to which group of studies yields stronger effect sizes.

The situation is more complex when the categorical moderator has three or more levels—that is, when the moderator test is an omnibus compari­son. Here, the significant between-group heterogeneity indicates that at least some groups differ from others, but exactly where those differences lie is unclear. This situation is akin to follow-up analyses conducted with a three or more level ANOVA, and decisions of how to handle these situations in meta-analysis are as thorny as they are for ANOVAs used in primary studies. However, the variety of possibilities that exist for ANOVA follow-up analyses have not been translated into a meta-analytic framework. Therefore, the two choices are between an overly liberal and an overly conservative approach.

2.1. The Liberal Approach

This approach is liberal in that one makes no attempt to control cumulative (a.k.a. family-wise) type I errors when following up a finding of significant between-group heterogeneity. Instead, you just perform a series of all pos­sible two-group comparisons to identify which groups differ in the magni­tudes of their effect sizes. To perform these comparisons, you would use the same logic described in the previous subsection for testing between-group heterogeneity, but would (1) restrict the calculation of total heterogeneity (Qtotal) to studies from the two groups, (2) sum the within-group heteroge­neity (Qwithin) only from these two groups, and (3) evaluate the resultant between-group heterogeneity (Qbetween) as a 1 df c2 test (because G = 2 in this comparison, so dƒbetween = 2 – 1). You would then repeat this two-group comparison for all possible combinations among the groups of the categorical moderator (the total number of comparisons is G(G-1)/2).

This approach parallels Fisher’s Least Significant Difference test in ANOVA (see e.g., Keppel, 1991, p. 171). Like this test in ANOVA, the obvi­ous problem with using this approach in categorical moderator analyses in ANOVA is that it allows for higher-than-desired rates of type I error in the follow-up comparisons (i.e., not controlling for cumulative, or family-wise, type I error). A second problem with this approach occurs when different groups have different effective sample sizes (i.e., many studies with large samples vs. few studies with small samples) or amounts of within-group het­erogeneity. In these situations, this approach can yield surprising results, in which groups that appear to have quite different average effect sizes are not found to differ (because the groups have small effective sample sizes or large heterogeneity), whereas groups that seem to have more similar average effect sizes are found to differ (because the groups have large effective sample sizes or small heterogeneity).

2.2. The Conservative Approach

A conservative approach to multiple follow-up comparisons of a significant omnibus moderator result parallels the approach in ANOVA commonly called Bonferroni correction (a.k.a. Dunn test; see Keppel, 1991, p. 167). Using this approach, you make the same series of comparisons between all possible two- group combinations as in the liberal approach, but the resultant Qbetweens are evaluated using an adjusted level of statistical significance (i.e., some value smaller than the chosen type I error rate, e.g., a = .05). Specifically, you divide the desired type I error rate (e.g., a = .05) by the number of comparisons2 made (i.e., by G(G – 1)/2). This Bonferroni-adjusted level of significance (ag) is then used as the basis for making inferences about whether the between- group heterogeneity statistics (Qbetween) provide evidence to reject the null hypotheses (i.e., concluding that groups differ).

There are two limitations to this approach. First, like this approach used in ANOVAs in primary studies, it is overly conservative and leads to diminished statistical power (i.e., higher type II error rates). The extent to which this limitation is problematic will depend on the sample sizes and numbers of studies in the groups you wish to compare. If all groups of the categorical moderator contain a large number of studies with large sample sizes (i.e., there is high statistical power), then the cost of this overly conser­vative approach might be minimal. However, if even some of the groups have a small number of studies or small sample sizes, then the loss of statistical power is problematic. The second limitation of this conservative approach is similar to that of the liberal approach—that seemingly larger differences in group mean effect sizes might not be significantly different, whereas seem­ingly small differences are found to be different.

2.3. Conclusions Regarding Follow-Up Analyses

The choice between an overly liberal and an overly conservative approach is not an easy one to make. In weighing between these approaches, I sug­gest that you consider (1) the relative cost of type I (erroneously concluding differences) versus type II (failing to detect differences) errors, and (2) the expectable power of your meta-analysis (meta-analyses with many studies with large sample sizes tend to have high power). Alternatively, you might avoid this problem by specifying meaningful planned contrasts that can be evaluated within a regression framework (see below).

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Continuous Moderators in meta-analysis

Continuous moderators in meta-analysis are coded study variables that can be considered to vary along a continuum of possible values. For example, mean characteristics of the sample (age, SES, percentage of ethnic minorities, percentage male or percentage female) or methodology (e.g., dose of a drug, number of therapy sessions in intervention) might be evaluated as continu­ous moderators. Just as the evaluation of categorical moderators relied on an adaptation of ANOVA, the evaluation of continuous moderators relies on an adaptation of regression. Specifically, test of continuous moderation involves (weighted) regression of the effect sizes (dependent variable) onto the con­tinuous moderator (independent variable, or predictor). Significant predic­tion indicates that the effect sizes vary in a linear manner with the continu­ous moderator; in other words, this moderator systematically relates to the association between X and Y.

The adaptation of standard regression of effect sizes onto a continuous predictor that is key to meta-analytic moderator analysis is the “weighted” I parenthetically stated. Here, the regression analysis is weighted by the inverse variance weight, w (see Chapter 8). This weighting has three implications. First, as is desirable (see Chapter 8), studies with more precise effect size esti­mates will be given more weight in the analysis than those with less precise estimates. Second, the mean squares of the regression (standard output, often in an ANOVA table, of all standard statistical packages such as SPSS or SAS) represents the heterogeneity among the effect sizes that is accounted for by the linear prediction of the continuous moderator. You use this value to eval­uate the statistical significance of the regression model. Third, this weight­ing impacts the standard errors of the regression coefficients. Although the regression coefficients themselves are accurate and directly interpretable (e.g., are effect sizes larger or smaller when values of the moderator are greater?), the standard errors of the regression coefficients are not correct and need to be hand calculated (which, fortunately, is simple).

Because this weighted regression approach to testing continuous moder­ators is most clearly illustrated through example, let me return to the sample meta-analysis of associations between relational aggression and peer rejec­tion. As shown in Table 9.2, I coded the mean age (in years) of the samples for these 22 studies, and I want to evaluate whether age moderates the asso-ciations between relational aggression and rejection. To do so, I regress the effect sizes (Fisher’s transformation of the correlation between relational aggression and rejection, Zr) onto the hypothesized continuous moderator age, using the familiar regression equation. Zr = Bq + B]_(Age) + e, with w as a weight. To do this, I use a standard statistical software package such as SPSS or SAS. In SPSS, I would specify Zr as the dependent variable, age as the inde­pendent variable, and w as the WLS (weighted least squares) weight.

The results give six pieces of information of interest: from an ANOVA table, (1) the sum of square of the regression model (SSregression or SSmodel) = 9.312; (2) the residual sum of squares3 (SSresidual or SSerror) = 281.983; and (3) the residual mean squares (MSresidual or MSerror) = 14.099; and from a table of coefficients, (4) the unstandardized regression coefficient (B1) = –.0112 with (5) an associated standard error = .0138; and (6) the intercept (B0) = .496. The SSregression is the heterogeneity accounted for by the linear regression model; it is often reported in published meta-analysesas Qr egression and is evaluated for statistical significance by comparing the value to a c2 distribution (Table 8.2 or using calculators such as Excel’s “chiinv” function) with df = number of predictors (here, df = 1). In this example, the value of 9.312 is considered statistically significant by standard criteria (p = .0023), so I conclude that there is moderation of the association between relational
aggression and rejection by age.

Because this analysis included only one predictor, the statistical signifi­cance of the model informs the statistical significance of the single predictor. However, when including multiple predictors (see next section), it is useful to also evaluate statistical significance by examining the regression coefficients and their standard errors. In this example, the unstandardized regression coefficient was -.0112, and its standard error, as computed by the statistical analysis program, was .0138. However, this standard error is inaccurate, and must be adjusted. This adjustment is to divide the standard error from the output by the square root of the residual mean square:

I then evaluate the statistical significance of this predictor by dividing the regression coefficient (B1) by this adjusted standard error, Z = -.0112/.00368 = -3.05, considering this Z value according to the standard normal deviate (i.e., Z) distribution to yield a two-tailed p (here, p = .0023). Note that in this example with a single predictor, the statistical significance of the regression model and of the single regression coefficient are identical, given that Z2 = x2(df=1) (i.e., -3.052 = 9.31).

To interpret this moderation, it is useful to compute implied effect sizes at different levels of the continuous moderator. Given the intercept (Bq = .096) and regression coefficient of age (B]_ = -.0112), I can compute the pre­dicted effect sizes at various ages using the equation Zr = Bq + Bi (Age) = .496 – .0112 (Age). For illustration of this moderation, I would choose rep­resentative values of the moderator (age) that fall within the range observed among these studies and make some conceptual sense; in this example, I might choose the ages of 5, 10, and 15 years. I then successively insert these values for age into the prediction equation, yielding implied Zrs = .440, .384, and .328, respectively. I then back-transform these implied Zrs (or any other transformed effect sizes) into their meaningful metric for reporting: implied rs = .41, .37, and .32, respectively.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

A General Multiple Regression Framework for Moderation in meta-analysis

After considering the regression approach to analyzing continuous modera­tors (previous section), you are probably wondering whether this approach allows for evaluation of multiple moderators—it does. However, before con­sidering inclusion of multiple moderators, I think it is useful to take a step back to consider how a regression approach can serve as a general approach to evaluating moderators in meta-analysis (in this context, the analyses are sometimes referred to as meta-regression). In this section, I describe how an empty (intercept-only) model accomplishes basic tests of mean effect size and heterogeneity (9.3.1), how you can evaluate categorical moderators in this framework through the use of dummy codes (9.3.2), and how this flexible approach can be used to consider unique moderation of a wide range of coded study characteristics (9.3.3). I will then draw general conclusions about this framework and suggest some more complex possibilities. I write this section with the assumption that you have a solid grounding in multiple regression; if not, you can read this section trying to obtain the “gist” of the ideas (for a thorough instruction of multiple regression, see Cohen et al., 2003).

1. The Empty Model for Computing Average Effect Size and Heterogeneity

An empty model in regression is one in which the dependent variable is regressed against no predictors, but only a constant (i.e., the value of 1 for all cases). This is represented in the following equation, which includes only an intercept (constant) as a predictor:

Performing a weighted regression of effect sizes predicted only by a constant will yield information about the weighted mean effect size and the heterogeneity, and therefore might serve as a useful initial analysis that is less tedious than the hand-spreadsheet-calculations I described in Chapter 8. Considering the example of 22 studies of relational aggression and peer rejection summarized in Table 9.2, I perform the following steps: First, I place the effect sizes (Zrs) and inverse variance weights (w) into a statistical software package (e.g., SPSS or SAS). I then create a variable in which every study had the value 1 (the constant). Finally, I regress effect sizes onto this constant, weighted by w, specifying no intercept (i.e., having the program not automatically include the constant in the model, as I am using the con­stant as a predictor). The unstandardized regression coefficient is .387, which represents the mean effect size (as Zr). The standard error of this regression coefficient from the program is adjusted as described above,

to yield the standard error of this mean effect size for use in significance test­ing or estimating confidence intervals. Finally, the residual sum of squares (55resjdual or SSerror) = 291.17 is the heterogeneity (Q) statistic, evaluated as C2 with 21 (number studies – 1) degrees of freedom. These results are identi­cal to those reported in Chapter 8 and illustrate how the empty model can be used to compute the mean effect size, make inferences about this mean, and evaluate the heterogeneity of effect sizes across studies.

2. Evaluating categorical Moderators

In primary data analysis, it has long been recognized that ANOVA is sim­ply a special case of multiple regression (e.g., Cohen, 1968). The same com­ parability applies to meta-analytic evaluation of categorical moderators. As with translation of ANOVA into multiple regression in primary analysis, the “trick” is to create a series of dichotomous variables that fully capture the different groups. The most common approach is through the use of dummy variables.

To illustrate the use of dummy variables in analyzing continuous mod­erators, consider the data from Table 9.3, which consists of 27 effect sizes (from 22 studies, as in Table 9.2) using four methods of measuring rela­tional aggression (previously summarized in Table 9.1). As with the ANOVA approach, I want to evaluate whether the method of assessing aggression moderates the associations between relational aggression and rejection. To perform this same evaluation in a regression framework, I need to compute three dummy codes (number of groups minus 1) to represent group member­ship. If I selected observational methods as my reference group, then I would assign the value 0 for all three dummy codes for studies using observational methods. I could make the first dummy code represent parent report (vs. observation) and assign values of 1 to this variable for all studies using this method and values of 0 for all studies that do not. Similarly, I could make dummy variable 2 represent peer report and dummy variable 3 represent teacher report. These dummy codes are displayed in Table 9.3 as DV1, DV2, and DV3 (for now, ignore the column labeled “Age” and everything to the right of it; I will use these data below).

To evaluate moderation by reporter within a multiple regression frame­work, I regress effect sizes onto the dummy variables representing group membership (in this case, three dummy variables), weighted by the inverse variance weight, w. This is expressed in the following equation:

Using a statistical software package (e.g., SPSS or SAS), I regress the effect sizes (Zrs) onto the three dummy variables (DV1, DV2, and DV3), weighted by the inverse variance weight (w) (here, I am requesting that the program include the constant in the model because I have not used the constant as a predictor). The output from the ANOVA table of this regression parallels the results from the ANOVA I described in Section 9.1: The total sum of squares (SStotal) provides the total heterogeneity (Qtotal) = 350.71; the residual or error sum of squares (SSresjdual or SSerror) provides the within-group heterogeneity (Qwithin) = 285.57; and the regression or model sum of squares (SSreqression or SSmodel) provides the between-group heterogeneity (Qbetween) = 65.14. This last value is compared to a c2 distribution (e.g., Table 8.2) to evaluate whether the categorical moderator is significant. This regression analysis also yields coefficients and their (incorrect) standard errors. If I adjust these standard errors, I can evaluate the statistical significance of the regression coefficients as indicative of whether each group differs from the reference group. To illus­trate: In this example, in which I coded observational methods as the refer­ence group, I could consider the regression coefficient of DV2 (denoting use of peer reports) = .301 by dividing it by the corrected standard error

to yield Z = .301 / .0597 = 5.05. I would thus conclude that studies using peer-report methods yield larger effect sizes than studies using observational methods. More generally, I could compute the implied values of each of the four methods via the prediction equation comprised of the intercept and regression coefficients for the dummy variables:

Because I used observational methods as my reference group, the implied mean effect size for this group is Zr = .097. For studies using parent reports (the first dummy variable), the implied effect size is Zr = .583 (.097 + .486); for studies using peer reports (the second dummy variable), the implied effect size is Zr = .398 (.097 + .301); and so on. When using transformed effect sizes such as Zr, you should transform these implied values back to the more intui­tive metric (e.g., r) for reporting.

As in primary analysis (see Cohen & Cohen, 1983), dummy variables represent just one of several options for coding group membership in meta­analytic tests of categorical moderators. Dummy variables have the advan-tages of explicitly comparing all groups to a reference group, which might be of central interest in some analyses. However, dummy variables have the dis­advantages of not allowing for easy comparisons between groups that are not the reference group (e.g., between peer and teacher reports in the example just presented) and that are not centered around 0 (a consideration I describe below). Effects coding (see Cohen & Cohen, 1983, p. 198) still relies on a ref­erence group, but centers on the independent variables. For example, effects for four groups would use three effects codes, which might be —A, —A, and —A for the reference group; A, 0, and 0 for the second group; 0, A, and 0 for the third group; and 0, 0, and A for the fourth group.4 Another alternative is contrast coding (Cohen & Cohen, 1983, p. 204), which allows for flexibility in creating specific planned comparisons among subsets of groups.

3. Evaluating Multiple Moderators

Having considered the regression framework for analyzing mean effect sizes, categorical moderators, and a single continuous moderator, you have likely inferred that this multiple regression approach can be used to evaluate multi­ple moderators. Doing so is no more complex than entering multiple categori­cal (represented with one or more dummy variables, effects codes, or contrast codes) or continuous predictors in this meta-analytic multiple regression.

However, one important consideration is that of centering (i.e., subtract­ing the mean value of a predictor from the values of this predictor). Although the statistical significance of either the overall model or individual predictors will not be influenced by whether or not you center, centering does offer two advantages. First, it permits more intuitive interpretation of the intercept as the mean effect size across studies. Second, it removes nonessential colinear­ity when evaluating interaction or power polynomial terms. To appropriately center predictors for this type of regression, you perform two steps. First, you compute the weighted (by inverse variance weights, w) average value of each predictor. Second, you compute a centered predictor variable by subtracting this weighted mean from scores on the original (uncentered) variable for each study. This process works for either continuous variables or dichoto­mous variables (this method of centering dummy variables converts them to effects codes).

To illustrate centering and evaluation of multiple moderators, I turn again to the example meta-analysis of the associations between relational aggression and rejection. In this illustration, I want to evaluate moderation both by method of measuring aggression and by age. Specifically, I want to evaluate whether either uniquely moderates these effect sizes (controlling for any overlap between method and age that may exist among these studies; see Section 9.4). Table 9.3 displays these 27 effect sizes (from 22 studies), as well as values for the two predictors: three dummy variables denoting the four categorical levels of method, and the continuous variable age. To create the centered variables, I first computed the weighted mean for each of the three dummy variables and age; these values were .0265, .8206, .1155, and 9.533, respectively.5 I then subtracted these values from scores on each of these four variables, resulting in the four centered variables shown on the right side of Table 9.3. I have labeled the three centered dummy codes as effects codes (EC1, EC2, and EC3), and the centered age variable “C_Age.”

When I then regress the effect size (Zr) onto these four predictors (EC1, EC2, EC3, and C_Age), weighted by w, I obtain SSregression = 93.46. Evaluat­ing this amount of heterogeneity explained by the model (Qregression) as a 4 df (df = number of predictors), I conclude that this model explains a significant amount of heterogeneity in these effect sizes. Further, each of the four regres­sion coefficients is statistically significant: EC1 = .581 (adjusted SE = .092, Z = 6.29, p < .001), EC2 = .415 (adjusted SE = .063, Z = 6.54, p < .001), EC3 = .152 (adjusted SE = .068, Z = 2.23, p < .05), and centered Age6 = -.020 (adjusted SE = .004, Z = -5.32, p < .001). Inspection of the regression coefficient (with corrected standard errors) allows me to evaluate whether age is a significant unique moderator (i.e., above and beyond moderation by method), but I can­not directly evaluate the unique moderation of method beyond age because this categorical variable is represented with three effects codes (though in this example the answer is obvious, given that each effects code is statistically significant). To evaluate the unique prediction by this categorical variable (or any other multiple variable block), I can perform a hierarchical (weighted) multiple regression in which centered age is entered at step 1 and the three effects codes are entered at step 2. Running this analysis yields SSregression = 3.56 at step 1 and SSregression = 93.46 at step 2. I conclude that the unique heterogeneity predicted by the set of effects codes representing the categori­cal method moderator is significant (Q(df=2) = 93.64 – 3.56 = 90.08, p < .001). I could similarly re-analyze these data with the three effects codes at step 1 and centered age at step 2 to evaluate the unique prediction by age. This is equivalent to inspecting the regression weight relative to its corrected stan­dard error in the final model (as is the case for the unique moderation of any single variable predictor).7

Two additional findings from this weighted multiple regression analysis merit attention. First, the intercept estimate (Bq) is .368, with a corrected standard error of .0113; these values are identical to those obtained by fitting an empty model to these 27 effect sizes. This means that I can still interpret the mean effect size and its statistical significance and confidence intervals within the moderator analysis, demonstrating the value of centering these predictors. Second, the residual sum of squares should be noted (SSresjdual or SSerror = 257.24), as this represents the heterogeneity among effect sizes left unexplained by this model (Qresidual; which can be evaluated for statistical significance according to a chi-square distribution with df = k – no. of pre­dictors – 1). As I elaborate below, the size of this residual, or unexplained, heterogeneity is one consideration in evaluating the adequacy of the modera­tion model.

4. Conclusions and Extensions of Multiple Regression Models

As I hope is clear, this weighted multiple regression framework for analyz­ing moderators in meta-analysis is a flexible approach. Extending from an empty model in which mean effect sizes and heterogeneity are estimated, this framework can accommodate any combination of multiple categorical or con­tinuous moderator variables as predictors. This general approach also allows for the evaluation of more complex moderation hypotheses. For example, one can test interactive combinations of moderators by creating product terms. Similarly, one can evaluate nonlinear moderation by the creation of power polynomial terms. These possibilities represent just a sample of many that are conceivable—if conceptually warranted—within this regression framework.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.