An Alternative SEM Approach for Moderation in meta-analysis

Cheung (2008) described an approach to meta-analysis within an SEM frame­work that can be used for moderator analyses as described in this chapter, as well as estimating fixed-effects means as described in Chapter 8 and more complex models (random- and mixed-effects models) described in Chapter 10. You should be aware that this is not SEM in the sense of multivariate, latent variable analyses (such as described in Chapter 12), but instead uses the flexibility of the SEM approach and software (e.g., ability to place model constraints) to fit meta-analytic models of a single effect size and coded study characteristics as predictors. In the context of moderator analyses, this approach is also advantageous over the regression approach I have described earlier in that it can use the advanced methods of missing data management in SEM when some studies do not report values for the characteristics you wish to evaluate as moderators.8

Although this alternative SEM approach is flexible, it does require an understanding of SEM as well as the use of specialized software.9 Given this restriction, I will write this section with the assumption that you are familiar with SEM (if you are not, I recommend Kline, 2010, as an accessible introduc­tion). Next, I describe the data transformation central to this approach, how this model can be used to estimate (fixed-effects) mean effect sizes (Chapter 8), and how this model can be used for moderator analyses. I consider this model again in Chapter 10 when I describe how it can be used for random- and mixed-effects models.

1. Transformations to Produce Equal Errors across Studies

As you recall, different studies in a meta-analysis are believed to have differ­ent sampling variances (i.e., squared standard errors) that provided the basis for differentially weighting the studies (see Chapter 8). The initial “key” to this SEM approach to meta-analysis is to rescale effect sizes and their predic­tors for each study so that the studies have equal sampling errors. This allows you to treat each study as an equally weighted case in the analyses because the weighting is accounted for by a transformation of study effect sizes and their predictors (i.e., study characteristics). This transformation factor is the square root of the weight you would normally use for a fixed-effects analysis (i.e., Wj = 1 / SE;2). You apply this transformation factor by multiplying it by the effect sizes and predictors (including the intercept) (Cheung, 2008,186):

Once these transformed effect sizes and predictors are created, the anal­yses within an SEM context do not require additional weighting, so each study is treated as an equally weighted case (to be clear, studies are still dif­ferentially weighted, but this occurs in the transformation rather than in the analyses). Next, I describe and illustrate how this approach can be used to estimate (fixed-effects) mean effect sizes and to evaluate moderators. This presentation follows closely that of Cheung (2008), but I use the example meta-analysis of relational aggression and peer rejection to illustrate these analyses.

2. Estimating Mean Effect Sizes

Although you already know how to estimate mean effect sizes (Chapter 8), it is useful to revisit these issues within this SEM approach. To evaluate a mean effect size, a model is fit in which the transformed effect size (ES*) is regressed onto the transformed intercept (Xq*). The intercept is just a con­stant 1.0 (literally, a variable with the value of 1 for each study) that is then transformed using Equation 9.7. Although the model is simple and could otherwise be performed using traditional software for regression, there are two important constraints you need to place on this model that require SEM software: (1) you fix the variance of ES* to 1.0, and (2) you fix the indicator intercept of ES* to 0. Given these constraints, the mean effect size is repre­sented as the regression coefficient from the transformed intercept.

I demonstrate this SEM representation by estimating the mean of the relational aggression with rejection association among the 22 studies shown in Table 9.2. To illustrate the computations of Equation 9.7, consider the first study (Blachman, 2003), which had an effect size Zr = .583 and weight (W) = 208.12. Using Equation 9.7, I find that the transformed effect size, Zr*, is equal to .583V208.12 = 8.411. The predictor in this model is a transformed intercept 1.0, computed using Equation 9.7 to be W208.12 = 14.426. I also apply these transformations of effect sizes and intercept to the other 21 stud­ies in Table 9.2.

The path diagram representing this analysis, as well as Mplus syntax,10 is shown in Figure 9.1. From this figure, you see that the transformed effect size (Fisher’s Zr, subjected to the transformation of Equation 9.7 to obtain Zr*) is regressed onto the transformed intercept (the constant 1.0 transformed with Equation 9.7 to obtain X*). The regression coefficient (bo) in this example is estimated to be .386, which is identical (within rounding error) to the mean Zr from these studies using the methods I described in Chapter 8. The standard error of this estimate is .012, which is also identical to the standard error of the mean effect size computed in Chapter 8. Therefore, the statistical significance (Z = .386/012 = 32.68, p < .001) is also identical (within round­ing) to the previously obtained results. In short, this approach yields identi­cal values to those if you used the methods described in Chapter 8.

3. Evaluating Moderators

From here, it is straightforward to add predictors to evaluate (categorical or continuous) moderators of this effect size. You simply make the same trans­formation described in Equation 9.7 (i.e., multiplying by the square root of the weight) to these predictors, and then add them to the predictive model.

I illustrate this analysis using the meta-analysis summarized in Table 9.3, in which I want to evaluate age (a continuous variable) and method of measuring aggression (three dummy coded variables) as potential modera­tors of the relational aggression with peer rejection association. As I did ear­lier using the multiple regression approach, I center these variables to assist interpretation.

Considering the first study, the effect size and intercept are transformed as already described. For this study, the centered values of the moderators (i.e., the values of Table 9.3 minus their means) are C_Age = -0.33 (= 9.2 – 9.53), EC1 = 0.97 (= 1 – .03), EC2 = -0.82, and EC3 = -0.12. When these four predictors are transformed (Equation 9.7) by multiplying by the square root of the study weight, the transformed values are C_Age* = -4.76, EC1* = 13.99, EC2* = -11.83, and EC3* = -1.73.

Figure 9.2 shows the path diagram and Mplus script for adding cen­tered age and the three effects codes representing measurement type, to evaluate these coded study characteristics as moderators of the association between relational aggression and rejection. You evaluate moderator effects by inspecting the regression coefficients of the transformed moderators pre­dicting transformed effect size. In this example, as when performed within a regression context, each of the three effects codes (EC1: b1 = 0.582, SE = .092, Z = 6.30, p < .001; EC2: b2 = 0.415, SE = .063, Z = 6.55, p < .001; EC3: b3 = 0.152, SE = .068, Z = 2.24, p < .05) as well as centered age (b4 = -0.020, SE = .004, Z = 5.33, p < .001) were significant moderators. Further, the intercept was significant and represents the overall mean Zr (bg = 0.370, SE = .011, Z = 32.78, p < .001). All of these values are identical (within rounding error) to those found through regression analyses. However, the key advantage of this SEM approach is that it could have accommodated all studies even if some had missing values for the study characteristics age or method of assessing aggression.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: The Limits of Interpreting Moderators in Meta-Analysis

Notwithstanding the considerable flexibility of a regression framework and the SEM approach for moderator analysis in meta-analysis, you should con­sider three potential limits when drawing conclusions from moderator analy­ses.

1. Empirically confounded Moderators

Just as you want to avoid highly correlated predictors in a multiple regression analysis of primary data, it is important to ensure that the moderator vari­ables (i.e., predictors) are not too highly correlated in meta-analysis. If they are, then two problems can emerge. First, it might be difficult to detect the unique association of a moderator above and beyond the other highly corre­lated moderators. Second, if they are extremely highly correlated, you can get inaccurate regression estimates that have large standard errors (the so-called bouncing beta problem).

Fortunately, it is easy—though somewhat time-consuming—to evaluate multicolinearity in meta-analytic moderator analyses. To do so, you regress each moderator (predictor) onto the set of all other moderators, weighted by the same weights (i.e., inverse variances of effect size estimates) as you have used in the moderator analyses. To illustrate using the example data shown in Table 9.3, I would regress age onto the three dummy variables representing the four categorical methods of assessing aggression. Here, R2 = .41, far less than the .90 that is often considered too high (e.g., Cohen et al., 2003, p. 424). I would then repeat the process for other moderator variables, successively regressing (weighted by w) them on the other moderator variables.

2. Conceptually Confounded (Proxy) Moderators

A more difficult situation is that of uncoded confounded moderators. These include a large range of other study characteristics that might be correlated across studies with the variables you have coded. For example, studying a particular type of sample (e.g., adolescents vs. young children) might be associated with particular methodological features (e.g., using self-reports vs. observations; if I had failed to code this methodology, then this feature would potentially be an uncoded confounded moderator). Here, results indi­cating moderation by the sample characteristics might actually be due to moderation by methodology. Put differently, the moderator in my analysis is only a proxy for the true moderator. Moreover, because the actual moderator (type of measure) is conceptually very different from the moderator I actu­ally tested (age), my conclusion would be seriously compromised if I failed to consider this possibility.

There is no way to entirely avoid this problem of conceptually con­founded, or proxy, moderators. But you can reduce the threat it presents by coding as many alternative moderator variables as possible (see Chapter 5). If you find evidence of moderation after controlling for a plausible alternative moderator, then you have greater confidence that you have found the true moderator (whereas if you did not code the alternative moderator, you could not empirically evaluate this possibility). At the same time, a large number of alternative possibilities might be argued to be the true moderator, of which the predictor you have considered is just a proxy, and it is impossible to anticipate and code all of these possibilities. For this reason, some argue that findings of moderation in meta-analysis are merely suggestive of moderation, but require replication in primary studies where confounding variables could arguably be better controlled. I do not think there is a universal answer for how informative moderator results from meta-analysis are; I think it depends on the conceptual arguments that can be made for the analyzed moderator versus other, unanalyzed moderators, as well as the diversity of the existing studies in using the analyzed moderator across a range of samples, meth­odologies, and measures. Despite the ambiguities inherent in meta-analytic moderator effects, assessing conceptually reasonable moderators is a worth­while goal in most meta-analyses in which effect sizes are heterogeneous (see Chapter 8).

3. Ensuring Adequate Coverage in Moderator Analyses

When examining and interpreting moderators, an important consideration is the coverage, or the extent to which numerous studies represent the range of potential moderator values considered. The literature on meta-analysis has not provided clear guidance on what constitutes adequate coverage, so this evaluation is more subjective than might be desired. Nevertheless, I try to offer my advice and suggestions based on my own experience.

As a first step, I suggest creating simple tables or plots showing the num­ber of studies at various levels of the moderator variables. If you are testing only the main effects of the moderators, it is adequate to look at just the univariate distributions.11 For example, in the meta-analysis of Table 9.3, I might create frequency tables or bar charts of the methods of assessing aggression, and similar charts of the continuous variable age categorized into some meaningful units (e.g., early childhood, middle childhood, early adoles­cence, and middle adolescence; or simply into, e.g., 2-year bins). Whether or not you report these tables or charts in a manuscript, they are extremely use­ful in helping you to evaluate the extent of coverage. Considering the method of assessing aggression, I see that these data contained a reasonable number of effect sizes from peer- (k = 17) and teacher- (k = 6) report methods, but fewer from observations (k = 3) and only one using parent reports. Similarly, examining the distribution of age among these effect sizes suggests a gap in the early adolescence range (i.e., no studies between 9.5 and 14.5 years).

What constitutes adequate coverage? Unfortunately, there are no clear answers to this question, as it depends on the overall size of your meta­analysis, the correlations among moderators, the similarity of your included studies on characteristics not coded, and the conceptual certainty that the moderator considered is the true moderator rather than a proxy. At an extreme, one study representing a level of a moderator (e.g., the single study using parent report in this example) or one study in a broad area of a con­tinuous moderator (e.g., if there was only one study during early childhood) is not adequate coverage, as it is impossible to know what other features of that study are also different from those of the rest of the studies. Conversely, five studies covering an area of a moderator probably constitute adequate coverage for most purposes (again, I base this recommendation on my own experience; I do not think that any studies have more formally evaluated this claim). Beyond these general points of reference, the best advice I can provide is to carefully consider these studies: Do they all provide similar effect sizes? Do they vary among one another in other characteristics (which might point to the generalizability of these studies for this region of the moderator)? Are the studies comparable to the studies at other levels of the moderator (if not, then it becomes impossible to determine whether the presumed moderator is a true or proxy moderator)?

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Differences among fixed-, random-, and Mixed-Effects Models

It is easiest to begin with the simple case in which you are interested only in the mean effect size among a set of studies, both in identifying the mean effect size and in computing its standard errors for inferential testing or for computing of confidence intervals. Even in this simple case, there are a number of conceptual, analytic, and interpretive differences between fixed- and random-effects meta-analytic models (see also Hedges & Vevea, 1998; Kisamore & Brannick, 2008).

1. Conceptual Differences

The conceptual differences between fixed- and random-effects models can be illustrated through Figure 8.1, which I have reproduced in the top of Figure 10.1. As you recall, the top of Figure 10.1 displays effect sizes from five stud­ies, all (or at least most) of which have confidence intervals that overlap with a single population effect size, now denoted with 0 using traditional symbol conventions (e.g., Hedges & Vevea, 1998). This overlap with a single popu­lation effect size, with deviations of study effect sizes due to only sampling fluctuations (i.e., study-specific confidence intervals), represents the fixed- effects model of meta-analysis.

The bottom portion of Figure 10.1 displays the random-effects model. Here, the confidence intervals of the individual study effect sizes do not nec­essarily overlap with a single population effect size. Instead, they overlap with a distribution of population effect sizes. In other words, random-effects models conceptualize a population distribution of effect sizes, rather than a single effect size as in the fixed-effects model. In a random-effects model, you estimate not only a single population mean effect size (0), but rather a distribution of population effect sizes represented by a central tendency (p) and standard deviation (t).

2. Analytic differences

These conceptual differences in fixed- versus random-effects models can also be expressed in equation form. These equations help us understand the com­putational differences between these two models, described in Section 10.2.

Equation 10.1 expresses this fixed-effects model of study effect sizes being a function of a population effect size and sampling error:

In this fixed-effects model, the effect sizes for each study (ESj) are assumed to be a function of two components: a single population effect size (0) and the deviation of this study from this population effect size (£;). The population effect size is unknown but is estimated as the weighted average of effect sizes across studies (this is often one of the key values you want to obtain in your meta-analysis). The deviation of any one study’s effect size from this population effect size (£;) is unknown and unknowable, but the dis­tribution of these deviations across studies can be inferred from the standard errors of the studies. The test of heterogeneity (Chapter 8) is a test of the null hypothesis that this variability in deviations is no more than what you expect given sampling fluctuations alone (i.e., homogeneity), whereas the alternative hypothesis is that these deviations are more than would be expected by sam­pling fluctuations alone (i.e., heterogeneity).

I indicated in Chapter 9 that the presence of significant heterogeneity might prompt us to evaluate moderators to systematically explain this hetero­geneity. An alternative approach would be to model this heterogeneity within a random-effects model. Conceptually, this approach involves estimating not only a mean population effect size, but also the variability in study effect sizes due to the population variability in effect sizes. These two estimates are shown in the bottom of Figure 10.1 as p (mean population effect size) and t (population variability in effect sizes). In equation form, this means that you would conceptualize each study effect size arising from three sources:

As shown by comparing the equations for fixed- versus random-effects models (Equation 10.1 vs. Equation 10.2, respectively), the critical difference is that the single parameter of the fixed-effects model, the single population effect size (0), is decomposed into two parameters (the central tendency and study deviation, |J, and tj) in the random-effects model. As I describe in more detail in Section 10.2, the central tendency of this distribution of population effect sizes is best estimated by the weighted mean of effect sizes from the studies (though with a different weight than used in a fixed-effects model). The challenge of the random-effects model is to determine how much of the variability in each study’s deviation from this mean is due to the distribution of population effect sizes (^s, sometimes called the random-effects variance; e.g., Raudenbush, 1994) versus sampling fluctuations (Ejs, sometimes called the estimation variance). Although this cannot be determined for any single study, random-effects models allow you to partition this variability across the collection of studies in your meta-analysis. I describe these computations in Section 10.2.

3. Interpretive Differences

Before turning to these analyses, however, it is useful to think of the differ­ent interpretations that are justified when using fixed- versus random-effect models. Meta-analysts using fixed-effects models are only justified in drawing conclusions about the specific set of studies included in their meta-analysis (what are sometimes termed conditional inferences; e.g., Hedges & Vevea, 1998). In other words, if you use a fixed-effects model, you should limit your conclusions to statements of the “these studies find . . . ” type.

The use of random-effects models justifies inferences that generalize beyond the particular set of studies included in the meta-analysis to a popu­lation of potential studies of which those included are representative (what are sometimes termed unconditional inferences; Hedges & Vevea, 1998). In other words, random-effects models allow for more generalized statements of the “the literature finds . . . ” or even “there is this magnitude of association between X and Y” type (note the absence of any “these studies” qualifier).1 Although meta-analysts generally strive to be comprehensive in their inclu­sion of relevant studies in their meta-analyses (see Chapter 3), the truth is that there will almost always be excluded studies about which you still might wish to draw conclusions. These excluded studies include not only those that exist that you were not able to locate, but also similar studies that might be conducted in the future or even studies that contain unique permutations of methodology, sample, and measures that are similar to your sampled studies but simply have not been conducted.

I believe that most meta-analysts wish to make the latter, generalized statements (unconditional inferences) most of the time, so random-effects models are more appropriate. In fact, I often read meta-analyses in which the authors try to make these conclusions even when they used fixed-effects models; such conclusions are inappropriate. I recommend that you frame your conclusions carefully in ways that are appropriate given your statistical model (i.e., fixed- vs. random-effects), and consider the conclusions you wish to make when deciding between these models. I return to this and other con­siderations in selecting between fixed- and random-effects models in Section 10.5.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Analyses of Random-Effects Models

A random-effects model in meta-analysis can be estimated in four general steps: (1) estimating the heterogeneity among effect sizes, (2) estimating pop­ulation variability in effect sizes, (3) using this estimate of population vari­ability to provide random-effects weights of study effect sizes, and (4) using these random-effects weights to estimate a random-effects mean effect size and standard errors of this estimate (for significance testing and confidence intervals). I illustrate each of these steps using the example meta-analysis dataset of 22 studies providing associations between relational aggression and peer rejection. These studies, together with the variables computed to estimate the random-effects model, are summarized in following Table.

1. Estimating Heterogeneity

The first step is to estimate the heterogeneity, indexed by Q, in the same way as described in Chapter 8. As you recall, the heterogeneity (Q) is computed using Equation 8.6, reproduced here:

As in Chapter 8, I estimate Q in the example meta-analysis by creating three columns (variables)—w, wES, and wES2—shown in Table 10.1. This yields Q = 291.17, which is high enough (relative to a c2 distribution with 21 df) to reject the null hypothesis of homogeneity and accept the alternate hypothesis of heterogeneity. Put another way, I conclude that the observed variability in effect sizes across these 22 studies is greater than expectable due to sampling fluctuation alone. This conclusion, along with other con­siderations described in Section 10.5, might lead me to use a random-effects model in which I estimate a distribution, rather than single point, of popula­tion effect sizes.

2. Estimating Population Variability

To estimate population variability, you partition the observed heterogene­ity into that expectable due to sampling fluctuations and that representing true deviations in population effect sizes. Although you can never know the extent to which one particular study’s deviation from the central tendency is due to sampling fluctuation versus its place in the distribution of popu­lation effect sizes, you can make an estimate of the magnitude of popula­tion variability based on the observed heterogeneity (total variability) and that which is expectable given the study standard errors. Specifically, you estimate population variability in effect sizes (t2) using the following equa­tion:

Although the denominator of this equation is not intuitive, you can understand this equation well enough by considering the numerator. Because the expected value of Q under the null hypothesis of homogeneity is equal to the degrees of freedom (k — 1), a homogeneous set of studies will result in a numerator equal to zero, and therefore the population variance in effect sizes is estimated to be zero.2 In contrast, when there is considerable heterogene­ity, then Q is larger than the degrees of freedom (k – 1), and this heterogeneity beyond that expected by sampling fluctuation results in a large estimate of the population variance, t2 (recalling that Q is a significance test based on the number of studies and total sample size in the meta-analysis, the denomina­tor adjusts for the sums of weights in a way that makes the estimate of popu­lation variance similar for small and large meta-analyses).

To estimate the population variance in the example meta-analysis shown in Table 10.1, I compute a new variable (column) w2. I then apply Equation 10.4 to obtain

3. Computing Random-Effects Weights

Having estimated the population variability in effect sizes, the next step is to compute new, random-effects weights for each study. Before describing this computation, it is useful to consider the logic of these random-effects weights. As shown in Chapter 8, the reason for weighting effect sizes in a meta-analysis is to account for the imprecision of effect sizes, so as to give more weight to studies providing more precise effect size estimates than to those providing less precise estimates. In the fixed-effects model described in Chapter 8, imprecision in study effect sizes was assumed to be due only to the standard error of that particular effect size. This can be seen in Equation 10.1, which shows that each study’s effect size is conceptualized as a function of the single population effect size and sampling deviation from that value. As seen in Equation 10.2, random-effects models consider two sources of a deviation of effect sizes around a mean: population variance (^;, which has an estimated variance of t2) and sampling fluctuation (£;). In other words, random-effects models consider two sources of imprecision in effect size esti­mates: population variability and sampling fluctuation.

To account for these two sources of imprecision, random-effects weights are comprised of both an overall estimated population variance (t2) and a study-specific standard error (SEj) for sampling fluctuation. Specifically, random-effects weights (wj*) are computed using the following equation:

To illustrate this computation, consider the first study in Table 10.1 (Blachman, 2003). This study had a weight of 208.12 in the fixed-effects model (based on w = 1/(.06932), allowing for rounding error). In the random- effects model, I compute a new weight as a function of the estimated popula­tion variance (t2 = .0408) and the study-specific standard error (SE = .0693, to yield a study-specific sampling variance SE2 = .0048):

The random-effects weights of all 22 studies are shown in Table 10.1 (second column from right). You should make two observations from these weights. First, these random-effects weights are smaller (much smaller in this exam­ple) than the fixed-effects weights. The implication of these smaller weights is that the sum of weights across studies will be smaller, and the standard error of the average mean will therefore be larger, in the random- relative to fixed-effects model. Second, although the studies still have the same relative ranking of weights using random- or fixed-effects models (i.e., studies with the largest weights for one had the largest weights for the other), the dis­crepancies in weights across studies is less for random- than for fixed-effects models. This fact impacts the relative influence of studies that are extremely large (outliers in sample size). I further discuss these and other differences between fixed- and random-effects models in Section 10.5.

4. Estimating and Drawing Inferences about Random-effects Means

The final step of the random-effects analysis is to estimate the mean effect size and to make inferences about it (through significance testing or comput­ing confidence intervals). These computations parallel those for fixed-effects
models described in Chapter 8, except that the ws of the fixed-effects mod­els are replaced with the random-effects weights, w*. To illustrate using this example of 22 studies (see the rightmost columns of Table 10.1), I compute the random-effects mean effect size (see Equation 8.2).

(which I would transform to report as the random-effects mean correlation, r = .326). Note that the random-effects mean is not identical to that of the fixed-effects mean computed in Chapter 8 (Zr = .387, r = .367), though in this example they are reasonably close.

The standard error of this mean effect size is computed as (see Equation 8.3):

This standard error can then be used for significance testing (Z = .338 / .0458 = 7.38, p < .001) of computing confidence intervals (95% confidence inter­val of Zr is .249 to .428, translating to a confidence interval for r of .244 to .404). Note that the standard error from the random-effects model is consid­erably larger than that computed in Chapter 8 under the fixed-effects model (.0118), resulting in lower Z values of the significance test (7.38 vs. 32.70 for the fixed-effects model) and wider confidence intervals (vs. 95% confidence interval for r of .348 to .388).

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Mixed-Effects Models

Mixed-effects models, sometimes called conditionally random models, com­bine the (fixed-effects) moderator analyses of Chapter 9 with the estimation of variance in population effect sizes (random-effects) described earlier in this chapter. These models are useful when you want to evaluate moderators in meta-analysis, and you (1) either want the generalizability provided by random-effects models, or (2) fixed-effects moderator analyses (as described in Chapter 9) indicate significant residual heterogeneity (i.e., Qwithin in ANOVA framework or Qresidual in regression framework).

Mixed-effects models follow the logic of moderator analyses within a general regression framework (see Chapter 9.3). However, these models include additional terms representing population variability in effect sizes, above and beyond systematic variability accounted for by moderators as well as sampling fluctuations. The general equation for mixed-effects models can be represented by the following equation:

Unfortunately, estimating mixed-effects models requires intensive, fairly complex methods. Specifically, estimating mixed-effects models requires iterative matrix algebra (or analysis within an SEM framework, which I present in the next section). I describe and illustrate this estimation using the example meta-analysis (Table 10.1) of 22 studies next, evaluating sample age as a moderator in the context of between-study heterogeneity. However, I forewarn you that the material presented in the remainder of this section is complex.

Before describing the estimation of mixed-effects models, however, it is useful to begin by describing the analysis of a moderator variable within a fixed-effects framework using matrix algebra. After describing this fixed- effects framework, I will describe and illustrate the estimation of mixed- effects models through an iterative matrix algebra.

1. Matrix Algebra of Fixed-effects Moderator Analysis

The general regression framework of analyzing moderators within the fixed- effects context (Section 9.3) can be solved using matrix algebra given the fol­lowing equation (from Overton, 1998):

To illustrate this computation with the example meta-analysis of 22 studies summarized in Table 9.4, in which I am interested in whether age moderates the association between relational aggression and peer rejection the following matrices are created:

Working through the matrix algebra to solve Equation 10.7 (using any basic matrix algebra calculator) yields the following matrix:

The value in the first row (.4957) represents the intercept, and the value in the second row (-.0112) represents the regression coefficient of the first predictor, age (additional rows would contain additional regression coeffi­cients if I had included additional predictors).

Variances of these estimates of the regression coefficients are obtained via the diagonal of the m X m matrix, = (X’ V-1 X)-1. In this example,

Standard errors of these estimates can be computed as the square roots of these values. In this example, the standard error of the estimate of the intercept is .0378 (V.00143), and the standard error of the regression coeffi­cient (i.e., moderation by age) is .0037 (V.000013). Note that these values are identical to those reported in Chapter 9.

2. Estimation of Mixed-Effects Models

Mixed-effects models are estimated iteratively (see simulation by Overton, 1998)—that is, through a series of estimations of B using V, recomputing the weights in this new solution to yield a new set of values for V, and then using these new values of V to reestimate B, with the process repeating itself until a certain standard of convergence is reached (see Overton, 1998).

2.1. Iteration I

The fixed-effects estimation of B serves as the first iteration. Here, the matrix of weights (V) assumes that t2 = 0. From this solution, you compute the model predicted values of the effect size for each study using the following matrix equation (Overton, 1998):

To illustrate using the example meta-analysis of 22 studies:

You then consider the discrepancies between the actual (observed) effect sizes of the studies and these predicted (by the intercept and any moderators) values. Specifically, you compute a matrix, D, representing k squared devia­tions that serve as estimates of the population conditional variance (t2):

To illustrate with the example meta-analysis, the D from the first itera­tion (i.e., fixed-effects model) is:

You then take the weighted average of these k estimates (22 in this exam­ple) in D to provide a single estimate of the population conditional variance (t2) using the following equation:

Applying this equation to the example data of 22 studies yields an esti­mated t2 = .0240.

2.2. Subsequent Iterations

This estimated t2 is now added to the standard errors of each study (sam­pling fluctuations), such that Vj* = t2 + SEj2. For example, the first study in the example dataset would receive the value that V]* = .0240 + (.0693)2 = .0288. These k Vj*s would be entered in the diagonal of the new matrix V* for iteration 2. Equation 10.7 is then recomputed using V* to yield a new set of estimated regression coefficients. In the example data, these values at the second iteration are Bq = .2700 and B] = .0112.

These regression coefficients are used to compute new predicted scores using Equation 10.8, new discrepancy scores are estimated, and a new D is computed using Equation 10.9 (note that at this step, the original V is used because you want to subtract out only the sampling variance). The t2 is then
reestimated using Equation 10.10 (using V*). This process continues until the estimated t2 changes minimally between successive iterations. Although the convergence criteria have not been well studied, Overton (1998, citing Erez et al., 1996) suggested that A t2 less than 10-10 is adequate and usually achieved by the seventh iteration. Using the example meta-analysis of 22 studies, I achieved this level of convergence in six iterations.

Overton (1998) has shown that a small correction for t2 following the final iteration improves the estimation of mixed-effects models. This correc­tion multiplies the obtained t2 by k/(k – m), where k = the number of studies and m = number of predictors (including constant). Applying this correction within the example meta-analysis yields the final estimates of t2 = .0499, with regression weights estimated as B0 = .2548 (intercept) and B1 = .0128 (moderating effect of age).

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

A Structural Equation Modeling Approach to Random- and Mixed-Effects Models

In Chapter 9, I introduced an alternative approach to meta-analysis based on Cheung’s (2008) description of meta-analysis within the context of struc­tural equation modeling. Here, I extend the logic of this approach to describe how it can be used to estimate random- and mixed-effects models (follow­ing closely the presentation by Cheung, 2008). As when I introduced this approach in Chapter 9, I should caution you that this material requires a fairly in-depth understanding of SEM, and you might consider skipping this section if you do not have this background. If you do have a solid background in SEM, however, this perspective may be advantageous in two ways. First, if you are familiar with SEM programs that can estimate random slopes (e.g., Mplus, MX; I elaborate on this requirement below), then you might find it easier to use this approach than the matrix algebra required for the mixed- effects model that I described earlier. Second, as I mentioned in Chapter 9, this approach uses the FIML method of missing data management of SEM, which allows you to retain studies that have missing values of study charac­teristics that you wish to evaluate as moderators.

Next, I describe how this SEM representation of meta-analysis can be used to estimate random- and mixed-effects models. To illustrate these approaches, I consider the 22 studies reporting correlations between rela­tional aggression with rejection shown in Table 10.1.

1. Estimating Random-Effects Models

The SEM representation of random-effects meta-analysis (Cheung, 2008) parallels the fixed-effects model I described in Chapter 9 (see Figure 9.2) but models the effect size predicted by intercept path as a random slope (see, e.g., Bauer, 2003; Curran, 2003; Mehta & Neale, 2005; Muthen, 1994). In other words, this path varies across studies, which captures the between-study variance of a random-effects meta-analysis. Importantly, this SEM represen­tation can only estimate these models using software that perform random slope analyses.3

One4 path diagram convention for denoting randomly varying slopes is shown in Figure 10.2. This path diagram contains the same representation of regressing the transformed effect size onto the transformed intercept as does the fixed-effects model of Chapter 9 (see Figure 9.1). However, there is a small circle on this path, which indicates that this path can vary randomly across cases (studies). The label u next to this circle denotes that the newly added piece to the path diagram—the latent construct labeled u—represents the random effect. The regression path (b0) from the constant (i.e., the tri­angle with “1” in the middle) to this construct captures the random-effects mean. The variance of this construct (m, using Cheung’s 2008 notation) is the estimated between-study variance of the effect size (what I had previously called t2).

To illustrate, I fit the data from 22 studies shown in Table 10.1 under an SEM representation of a random-effects model. As I described in Chap­ter 9, the effect sizes (Zr) and intercepts (the constant 1) of each study are transformed by multiplying these values by the square root of the study’s weight (Equation 9.7). This allows each study to be represented as an equally weighted case in the analysis, as the weighting is accomplished through these transformations.

The Mplus syntax shown in Figure 10.2 specifies that this is a random- slopes analysis by inserting the “TYPE=RANDOM” command, specifying that U represents the random effect with estimated mean and variance. The mean of U is the random-effects mean of this meta-analysis; here, the value was estimated to be 0.369 with a standard error of .049. This indicates that the random-effects mean Zr is .369 (equivalent r = .353) and statistically sig­nificant (Z = .369/.049 = 7.53, p < .01; alternatively, I could compute confi­dence intervals). The between-study variance (t2) is estimated as the variance of U; here, the value is .047.

The random-effects mean and estimated between-study variance obtained using this SEM representation are similar to those I reported earlier (Section 10.2). However, they are not identical (and the differences are not due solely to rounding imprecision). The differences in these values are due to the dif­ference in estimation methods used by these two approaches; the previously described version used least squares criteria, whereas the SEM representa­tion used maximum likelihood (the most common estimation criterion for SEM). To my knowledge, there has been no comprehensive comparison of which estimation method is preferable for meta-analysis (or—more likely— under what conditions one estimator is preferable to the other). Although I encourage you to watch for future research on this topic, it seems reasonable to conclude for now that results should be similar, though not identical, for either approach.

2. Estimating Mixed-Effects Models

As you might anticipate, this SEM approach (if you have followed the mate­rial so far) can be rather easily extended to estimate mixed-effects models, in which fixed-effects moderators are evaluated in the context of random between-study heterogeneity. To evaluate mixed-effects models in an SEM framework, you simply build on the random-effects model (in which the transformed intercept predicting transformed effect size slope randomly var­ies across studies) by adding transformed study characteristics (moderators) as fixed predictors of the effect size.

I demonstrate this analysis using the 22 studies from Table 10.1, in which I evaluate moderation by sample age while also modeling between- study variance (paralleling analyses in Section 10.3). This model is graphi­cally shown in Figure 10.3, with accompanying Mplus syntax. As a reminder, the effect size and all predictors (e.g., age and intercept) are transformed for each study by multiplying by the square root of the study weight (Equation 9.7). To evaluate the moderator, you evaluate the predictive path between the coded study characteristic (age) and the effect size. In this example, the value was estimated as b1 = .013, with a standard error of .012, so it was not statistically significant (Z = .013/012 = 1.06, p = .29). These results are simi­lar to those obtained using the iterative matrix algebra approach I described in Section 10.3, though they will not necessarily be identical given different estimator criteria.

3. Conclusions Regarding SEM Representations

As with fixed-effects moderator analyses, the major advantage of estimating mixed-effects meta-analytic model in the SEM framework (Cheung, 2008) is the ability to retain studies with missing predictors (i.e., coded study char­acteristics in the analyses). If you are fluent with SEM, you may even find it easier to estimate models within this framework than using the other approaches.

You should, however, keep in mind several cautions that arise from the novelty of this approach. It is likely that few (if any) readers of your meta­analysis will be familiar with this approach, so the burden falls on you to describe it to the reader. Second, the novelty of this approach also means that some fundamental issues have yet to be evaluated in quantitative research. For instance, the relative advantages of maximum likelihood versus least squares criteria, as well as modifications that may be needed under certain condi-tions (e.g., restricted maximum likelihood or other estimators with small numbers of studies) represent fundamental statistical underpinnings of this approach that have not been fully explored (see Cheung, 2008). Nevertheless, this representation of meta-analysis within SEM has the potential to merge to analytic approaches with long histories, and there are many opportuni­ties to apply the extensive tools from the SEM field in your meta-analyses. For these reasons, I view the SEM representation as a valuable approach to consider, and I encourage you to watch the literature for further advances in this approach.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: Which Model Should I Use?

In Sections 10.1 and 10.2, I have presented the random-effects model for estimating mean effect sizes, which can be contrasted with the fixed-effects model I described in Chapter 8. I have also described (Section 10.3) mixed- effects models, in which (fixed) moderators are evaluated in the context of conditional random heterogeneity; this section can be contrasted with the fixed-effects moderator analyses of Chapter 9. An important question to ask now is which of these models you should use in a particular meta-analysis. At least five considerations are relevant: the types of conclusions you wish to draw, the presence of unexplained heterogeneity among the effect sizes in your meta-analysis, statistical power, the presence of outliers, and the com­plexity of performing these analyses. I have arranged these in order from most to least important, and I elaborate on each consideration next. I con­clude this section by describing the consequences of using an inappropriate model; these consequences serve as a further set of considerations in select­ing a model.

Perhaps the most important consideration in deciding between a fixed- versus random-effects model, or between a fixed-effects model with modera­tors versus a mixed-effects model, is the types of conclusions you wish to draw. As I described earlier, conclusions from fixed-effects models are lim­ited to only the sample of studies included in your meta-analysis (i.e., “these studies show . . . ” type conclusions), whereas random- and mixed-effects models allow more generalizable conclusions (i.e., “the research shows . . . ” or “there is…” type of conclusions). Given that the last-named type of conclu­sions are more satisfying (because they are more generalizable), this consid­eration typically favors the random- or mixed-effects models. Regardless of which type of model you select, however, it is important that you frame your conclusions in a way consistent with your model.

A second consideration is based on the empirical evidence of unexplained heterogeneity. By unexplained heterogeneity, I mean two things. First, in the absence of moderator analysis (i.e., if just estimating the mean effect size), finding a significant heterogeneity (Q) test (see Chapter 8) indicates that the heterogeneity among effect sizes cannot be explained by sampling fluctuation alone. Second, if you are conducting fixed-effects moderator analysis, you should examine the within-group heterogeneity (Qwithin; for ANOVA ana­logue tests) or residual heterogeneity (Qresidual; for regression analog tests). If these are significant, you conclude that there exists heterogeneity among effect sizes not systematically explained by the moderators.5 In both situa­tions, you might use the absence versus presence of unexplained heteroge­neity to inform your choice between fixed- versus random- or mixed-effects models (respectively). Many meta-analysts take this approach. However, I urge you to not make this your only consideration because the heterogene­ity (i.e., Q) test is an inferential test that can vary in statistical power. In meta-analyses with many studies that have large sample sizes, you might find a significant residual heterogeneity that is trivial, whereas a meta-analysis with few studies having small sample sizes might fail to detect potentially meaningful heterogeneity. For this reason, I recommend against basing your model decision only on empirical findings of unexplained heterogeneity.

A third consideration is the relative statistical power of fixed- versus random-effects models (or fixed-effects with moderators versus mixed- effects models). The statistical power of a meta-analysis depends on many factors—number of studies, sample sizes of studies, degree to which effect sizes must be corrected for artifacts, magnitude of population variance in effect size, and of course true mean population effect size. Therefore, it is not a straightforward computation (see e.g., Cohn & Becker, 2003; Field, 2001; Hedges & Pigott, 2001, 2004). However, to illustrate this difference in power between fixed- and random-effects models, I have graphed some results of a simulation by Field (2001), shown in Figure 10.4. These plots make clear the greater statistical power of fixed-effects versus random-effects models. More generally, fixed-effects analyses will always provide as high (when t2 = 0) or higher (when t2 > 0) statistical power than random-effects models. This makes sense in light of my earlier observation that the random-effects weights are always smaller than the fixed-effects weights; therefore, the sum of weights is smaller and the standard error of the average effect size is larger for random- than for fixed-effects models. Similarly, analysis of moderators in fixed-effects models will provide as high or higher statistical power as mixed-effects models. For these reasons, it may seem that this consideration would always favor fixed-effects models. However, this conclusion must be tempered by the inappropriate precision associated with high statistical power when a fixed-effects model is used inappropriately in the presence of substantial variance in population effect sizes (see below). Nevertheless, statistical power is one important consideration in deciding among models: If you have questionable statistical power (small number of studies and/or small sample sizes) to detect the effects you are interested in, and you are comfortable with the other considerations, then you might choose a fixed- effects model.

The presence of studies that are outliers in terms of either their effect sizes or their standard errors (e.g., sample sizes) is better managed in ran-dom- than fixed-effects models. Outliers consisting of studies that have extreme effect sizes have more influence on the estimated mean effect size in fixed-effects analysis because the analyses—to anthropomorphize—must “move the mean” substantially to fall within the confidence interval of the extreme effect size (see top of Figure 10.1). In contrast, studies with extreme effect sizes impact the population variance (t2) more so than the estimated mean effect size in random-effects models. Considering the bottom of Figure 10.1, you can imagine that an extreme effect size can be accommodated by widening the spread of the population effect size distribution (i.e., increasing the estimate of t) in a random-effects model.

A second type of outlier consists of studies that are extreme in their sample sizes, especially those with much larger sample sizes than other studies. Because sample size is strongly connected to the standard error of the study’s effect size, and these standard errors in turn form the weight in fixed-effects models (see Chapter 8), you can imagine that a study with an extremely large sample could be weighted much more heavily than other studies. For example, in the 22 study meta-analyses I have presented (see Table 10.1), four studies with large samples (Hawley et al., 2007; Henington, 1996; Pakaslahti and Keltikangas-Jarvinen, 1998; Werner, 2000) comprise 44% of the total weight in the fixed-effects analysis (despite being only 18% of the studies) and are given 13 to 16 times the weight of the smallest study (Ostrov, Woods, Jansen, Casas, & Crick, 2004). Although I justified the use of weights in Chapter 8, this degree of weighting some studies far more than others might be too undemocratic (and I have seen meta-analyses with even more extreme weighting, with single studies having more weight than all other studies combined). As I have mentioned, random-effects models reduce these discrepancies in weighting. Specifically, because a common estimate of t2 is added to the squared standard error for each study, the weights become more equal across studies as t2 becomes larger. This can be seen by inspect­ing the random-effects weights (w*) in Table 10.1: Here the largest study is only weighted 1.4 times the smallest study. In sum, random-effects models, to the extent that t2 is large, use weights that are less extreme, and there­fore random- (or mixed-) effects models might be favored in the presence of sample size outliers.

Perhaps the least convincing consideration is the complexity of the mod­els (the argument is so unconvincing that I would not even raise it if it was not so commonly put forward). The argument is that fixed-effects models, whether for only computing mean effect sizes (Chapter 8) or for evaluating moderators (Chapter 9) are far simpler than random- and mixed-effects mod­els. Although simplicity is not a compelling rationale for a model (and a ratio­nale that will not go far in the publication process), I acknowledge that you should be realistic in considering how complex of a model you can use and report. I suspect that most readers will be able to perform computations for random-effect models, so if you are not analyzing moderators and the other considerations point you toward this model, I encourage you to use it. Mixed- effects models, in contrast, are more complex and might not be tractable for many readers. Because less-than-optimal answers are better than no answers at all, I do think it is reasonable to analyze moderators within a fixed-effects model if this is all that you feel you can do—with the caveat that you should recognize the limitations of this model. Even better, however, is for you to enlist the assistance of an experienced meta-analyst who can help you with more complex—and more appropriate—models.

At this point, you might see some advantages and disadvantages to each type of model, and you might still feel uncertain about which model to choose. I think this decision can be aided by considering the consequences of choosing the “wrong” model. By “wrong” model, I mean that you choose (1) a random- or mixed-effects model when there is no population variability among effect sizes, or (2) a fixed-effects model when there really exists sub­stantial population variability among effect sizes. In the first situation, using random-effects models in the absence of population variability, there is little negative consequence other than a little extra work. Random- and mixed- effects models will yield similar results as fixed-effects models when there is little population variability in effect sizes (e.g., because estimated t2 is close to zero, Equation 10.2 functionally reduces to Equation 10.1). If you decide on a random- (or mixed-) effects model only to find little population variability in effect sizes, you still have the advantage of being able to make generaliz- able conclusions (see the first consideration above). In contrast, the second type of inappropriate decision (using a fixed-effects model in the presence of unexplained population variability) is problematic. Here, the failure to model this population variability leads to conclusions that are inappropri­ately precise—in other words, artificially high significance tests and overly narrow confidence intervals.

In conclusion, random-effects models offer more advantages than fixed- effects models, and there are no disadvantages to using random-effects mod­els in the absence of population variability in effect sizes. For this reason, I generally recommend random-effects models when the primary goal is esti­mated and drawing conclusions about mean effect sizes. When the focus of your meta-analysis is on evaluating moderators, then my recommendations are more ambivalent. Here, mixed-effects models provide optimal results, but the complexity of estimating them might not always be worth the effort unless you are able to enlist help from an experienced meta-analyst. For moderator analyses, I do view fixed-effects models as acceptable, provided you examine unexplained (residual) heterogeneity and are able to show that it is either not significant or small in magnitude.6

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

The Problem of Publication Bias of Meta-Analysis

Publication bias refers to the possibility that studies finding null (absence of statistically significant effect) or negative (statistically significant effect in opposite direction expected) results are less likely to be published than studies finding positive effects (statistically significant effects in expected direction).1 This bias is likely due both to researchers being less motivated to submit null or negative results for publication and to journals (editors and reviewers) being less likely to accept manuscripts reporting these results (Cooper, DeNeve, & Charlton, 1997; Coursol & Wagner, 1986; Greenwald, 1975; Olson et al., 2002).

The impact of this publication bias is that the published literature might not be representative of the studies that have been conducted on a topic, in that the available results likely show a stronger overall effect size than if all studies were considered. This impact is illustrated in Figure 11.1, which is a reproduction of Figure 3.2. The top portion of this figure shows a distribu­tion of effect sizes from a hypothetical population of studies. The effect sizes from these studies center around a hypothetical mean effect size (about 0.20), but have a certain distribution of effect sizes found due to random-sampling error and, potentially, population-level between-study variance (i.e., het­erogeneity; see Chapters 8 and 9). Among those studies that happen to find small effect sizes, results are less likely to be statistically significant (in this hypothetical figure, I have denoted this area where studies find effect sizes less than ±0.10, with the exact range depending on the study sample sizes and effect size considered). Below this population of effect sizes of all stud­ies conducted, I have drawn downward arrows of different thicknesses to represent the different likelihoods of the study being published, with thicker arrows denoting higher likelihood of publication. Consistent with the notion of publication bias, the hypothetical studies that fail to find significant effects are less likely to be published than those that do. This differential publication rate results in the distribution of published studies shown in the lower part of Figure 11.1. It can be seen that this distribution is shifted to the right, such that the mean effect size is now approximately 0.30. If the meta-analysis only includes this biased sample of published studies, then the estimate of the mean effect size is going to be considerably higher (around 0.30) than that in the true population of studies conducted. Clearly, this has serious implica­tions for a meta-analysis that does not consider publication bias.

This publication bias is sometimes referred to by alternative names. Some have referred to it as the “file-drawer problem” (Rosenthal, 1979), conjuring images of researchers’ file drawers containing manuscripts reporting null or negative (i.e., in the opposite direction expected) results that will never be seen by the meta-analyst (or anyone else in the research community). Another term proposed is “dissemination bias” (see Rothstein, Sutton, & Borenstein, 2005a). This latter term is more accurate in describing the broad scope of this problem, although the term “publication bias” is the more commonly used one (Rothstein et al., 2005a). Regardless of terminology used, the breadth of this bias is not limited just to significant results being published and non­significant results not being published (even in a probabilistic rather than absolute sense). One source of breadth of the bias is the existence of “gray lit­erature,” research that is between the file drawer and publication, such as in the format of conference presentations, technical reports, or obscure publica­tion outlets (Conn, Valentine, Cooper, & Rantz, 2003; Hopewell, Clarke, & Mallett, 2005; also referred to as “fugitive literature” by, e.g., M. C. Rosenthal, 1994). There is evidence that null findings are more likely to be reported only in these more obscure outlets than are positive findings (see Dickersin, 2005; Hopewell et al., 2005) If the literature search is less exhaustive, these reports are less likely to be found and included in the meta-analysis than reports published in more prominent outlets.

Another source of breadth in publication bias may be in the underem­phasis of null or negative results. For example, researchers are likely to make significant findings the centerpiece of an empirical report and only report nonsignificant findings in a table. Such publications, though containing the effect size of interest, might not be detected in key word searches or in brows­ing the titles of published works. Similarly, null or counterintuitive findings that are published may be less likely to be cited by others; thus, backward searches are less likely to find these studies.

Finally, an additional source of breadth in considering publication bias is due to the time lag of publication. There is evidence, at least in some fields, that significant results are published more quickly than null or negative results (see Dickersin, 2005). The impact on meta-analyses, especially those focusing on topics with a more recently created empirical basis, is that the currently published results are going to overrepresent significant positive findings, whereas null or negative results are more likely to be published after the meta-analysis is performed.

Recognizing the impact and breadth of publication bias is important but does not provide guidance in managing it. Ideally, the scientific process would change so that researchers are obligated to report the results of study findings.2 In clinical research, the establishment of clinical trial registries (in which researchers must register a trial before beginning the study, with some journals motivating registration by only considering registered trials for publication) represents a step in helping to identify studies, although there are some concerns that registries are incomplete and that the researchers of registered trials may be unwilling to share unexpected results (Berlin & Ghersi, 2005). However, unless you are in the position to mandate research and reporting practices within your field, you must deal with publication bias without being able to prevent it or even fully know of its existence. Neverthe­less, you do have several methods of evaluating the likely impact publication bias has on your meta-analytic results.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Managing Publication Bias of Meta-Analysis

In this section, I describe six approaches to managing publication bias within meta-analysis. I also illustrate some of these approaches through the example meta-analysis I have used throughout this book: a review of 22 studies report­ing associations between relational aggression and peer rejection among chil­dren and adolescents. In Chapter 8, I presented results of a fixed-effects3 analysis of these studies indicating a mean r = .368 (SE = .0118; Z = 32.70, p < .001; 95% confidence interval = .348 to .388). When using this example in this section, I evaluate the extent to which this conclusion about the mean association is threatened by potential publication bias.

Table 11.1 displays these 22 studies. The first five columns of this table are the citation, sample size, untransformed effect size (r), transformed effect size (Zr), and standard error of the transformed effect size (SE). The remain­ing columns contain information that I explain when using these data to illustrate methods of evaluating publication bias.

1. Moderator Analyses

One of the best methods to evaluate the potential impact of publication bias is to include unpublished studies in the meta-analysis and empirically evalu­ate whether these studies yield smaller effect sizes than published studies. In the simplest case, this involves evaluating the moderation of effect sizes (Chapter 10) by the dichotomous variable, published versus unpublished study. Two caveats to this approach merit consideration. First, it is necessary to make sure that the meta-analysis includes a sufficient number of unpub­lished studies to draw reliable conclusions about potential differences. Sec­ond, it is important to consider other features on which published versus unpublished studies might differ, such as the quality of the methodology (e.g., internal validity of an experimental design) and measures (e.g., use of reliable vs. unreliable scales). You should control for such differences when comparing published and unpublished studies.

A more elaborate variant of this sort of moderator analysis is to code more detailed variables regarding publication status. For instance, you might code a more continuous publication quality variable (e.g., distinguishing unpublished data, dissertations, conference presentations, low-tier journal articles, and top-tier journal articles, if this captures a meaningful contin­uum within your field). You might also code whether the effect size of inter­est is a central versus peripheral result in the study; for instance, Card et al. (2008) considered whether terms such as “gender” or “sex” appeared in titles of works reporting gender differences in childhood aggression.

Regardless of which variables you code, the key question is whether these variables are related to the effect sizes found in the studies (i.e., whether these act as moderators). If you find no differences between published and unpub­lished studies (or absence of moderating effects of other variables such as publication quality and centrality), and there is adequate power to detect such moderation, then it is safe to conclude that there is no evidence of publication bias within this area. If differences do exist, you have the choice of either (1) interpreting results of published and unpublished studies separately, or (2) performing corrections for publication bias described below (Section 11.3).

To illustrate this approach to evaluating publication bias, I consider one approach I have described: moderation by the categorical moderator “pub­lished.” This categorical variable is shown in the sixth column of Table 11.1 and is coded as 1 for studies that were published (k = 15) and 0 for unpub­lished studies (k = 7). Notice that this comparison is possible only because I included unpublished studies in this meta-analysis and the search was thor­ough enough to obtain a sufficient number of unpublished studies for com­parison. Moderator analyses (Chapter 9) indicated a significant difference between published and unpublished studies (Qbetween(1df) = 77.47, p < .001). In the absence of publication bias, I would not expect this moderator effect, so its presence is worrisome. When I inspect the mean effects sizes within each group, I find that the unpublished studies yield higher associations (r = .51) than the published studies (r = .31). This runs counter to the possibility that nonsignificant/low effect size studies are less likely to be published; if there is a bias, it appears that studies finding large effect sizes are less likely to be published and therefore any publication bias might serve to diminish the effect size I find in this meta-analysis. However, based on my knowledge of the field (and conversations with other experts about this finding), I see no apparent reason why there would be a bias against publishing studies find­ing strong positive correlations. I consider this finding further in light of my other findings regarding potential publication bias below.

2. Funnel Plots

Funnel plots represent a graphical way to evaluate publication bias (Light & Pillemer, 1984; see Sterne, Becker, & Egger, 2005). The funnel plots are simply a scatterplot of the effect sizes found in studies relative to their sample size (with some variants on this general pattern). In other words, you would simply plot points for each study denoting their effect size relative to their sample size. Figure 11.2 shows a hypothetical outline of a funnel plot, with the effect size Zr on the y-axis and sample size (N) on the x-axis.4 Specifi­cally, the solid lines within this figure represent the 95% confidence interval of effect sizes centered around r = .30 (Zr = .31; see below) at various sample sizes5; if you plot study effect sizes and sample sizes from a sample with this mean effect size, then most (95%) of the points should fall within the area between these solid lines. On the left, you can see that there is considerably larger expectable variability in effect sizes with small sample sizes; conceptu­ally, you expect that studies with small samples will yield a wider range of effect sizes due to random-sampling variability. In contrast, as sample sizes increase, the expectable variability in effect sizes becomes smaller (i.e., the standard errors become smaller), and so the funnel plot shows a narrower distribution of effect sizes at the right of Figure 11.2. Evaluation of publica­tion bias using funnel plots involves visually inspecting these plots to ensure symmetry and this general triangular shape.

Let’s now consider how publication bias would affect the shape of this funnel plot. Note the dashed line passing through the funnel plot. This line represents the magnitude of effect size needed to achieve statistical signifi-cance (p < .05) at various sample sizes. The area above and to the right of this dashed line would denote studies finding significant effects, whereas the area below and to the left of it would contain studies that do not yield significant effects. If publication bias exists, then you would see few points (i.e., few studies in your meta-analysis) that fall within this nonsignificant region. This would cause your funnel plot to look asymmetric, with small sample studies finding large effects present but small studies finding small effects absent.

Publication bias is not the only possible cause of asymmetric funnel plots. If studies with smaller samples are expected to yield stronger effect sizes (e.g., studies of intervention effectiveness might be able to devote more resources to a smaller number of participants), then this asymmetry may not be due to publication bias. In these situations, you would ideally code the presumed difference between small and large sample studies (e.g., amount of time or resources devoted to participants) and control for this6 before creat­ing the funnel plot.

Several variants of the axes used for funnel plots exist. You might con­sider alternative choices of scale on the effect size axis. I recommend relying on effect sizes that are roughly normally distributed around a population effect size, such as Fisher’s transformation of the correlation (Zr), Hedges’s g, or the natural log of the odds ratio (see Chapter 5). Using normally distrib­uted effect sizes, as opposed to non-normal effect sizes (e.g., r or o) allows for better examination of the symmetry of funnel plots. Similarly, you have choices of how to scale the sample size axis. Here, you might consider using the natural log of sample size, which aids interpretation if some studies use extremely large samples that compress the rest of the studies into a narrow range of the funnel plot. Other choices include choosing standard errors, their inverse (precision), or weights (1 / SE2) on this axis; this option is rec­ommended when you are analyzing log odds ratios (Sterne et al., 2005) and might also be useful when the standard error is not perfectly related to sam­ple size (e.g., when you correct for artifacts). I see no problem with examin­ing multiple funnel plots when evaluating publication bias. Given that the examination of funnel plots is somewhat subjective, I believe that examining these plots from several perspectives (i.e., several different choices of axis scaling) is valuable in obtaining a complete picture about the possibility of publication bias.

To illustrate the use of funnel plots—as well as the major challenges in their use—I have plotted the 22 studies from Table 11.1 in Figure 11.3.

I created this plot simply by constructing a scatterplot with transformed effect size (fourth column in Table 11.1) on the vertical axis and sample size (second column in Table 11.1) on the horizontal axis. My inspection of this plot leads me to conclude that there is no noticeable asymmetry or sparse representation of studies in the low effect size—low sample size area (i.e., where results would be nonsignificant). I also perceive that the effect sizes tend to become less discrepant with larger sample sizes—that is, that the plot becomes more vertically narrow from the left to the right. However, you might not agree with these conclusions. This raises the challenge of using funnel plots—that the interpretations you take from these plots are neces­sarily subjective. This subjectivity is especially prominent when the number of studies in your meta-analysis is small; in my example with just 22 studies, it is extremely difficult to draw indisputable conclusions.

3. Regression Analysis

Extending the logic of funnel plots, you can more formally test for asymme­try by regressing effect sizes onto sample sizes. The presence of an associa­tion between effect sizes and sample sizes is similar to an asymmetric funnel plot in suggesting publication bias. In the case of a positive mean effect size, publication bias will be evident when studies with small sample sizes yield larger effect size estimates than studies with larger samples; this situation would produce a negative association between sample size and effect size. In contrast, when the mean effect size is negative, then publication bias will be indicated by a positive association (because studies with small samples yield stronger negative effect size estimates than studies with larger samples). The absence of an association, given adequate statistical power to detect one, par­allels the symmetry of the funnel plot in suggesting an absence of publication bias.

Despite the conceptual simplicity of this approach, recommended prac­tices (see Sterne & Egger, 2005) build on this conceptual approach but make it somewhat more complex. Specifically, two variants of this regression approach are commonly employed. The first involves considering an adjusted rank correlation between studies’ effect sizes and standard errors (for details, see Begg, 1994; Sterne & Egger, 2005). To perform this analysis, you use the following two equations to compute, for each study i, the variance of the study’s effect size from the mean effect size (v*) and the standardized effect size for study (ES*):

After computing these variables, you then estimate Kendall’s rank cor­relation between v* and ES*. A significant correlation indicates funnel plot asymmetry, which may suggest publication bias. An absence of correlation contraindicates publication bias, if power is adequate.

A comparable approach is Egger’s linear regression, in which you regress the standard normal deviate of the effect size of each study from zero (i.e., for study i, Zi = ESj / SE*) onto the precision (the inverse of the SE, or 1/ SE*): Zi = Bq + Biprecisioni + ei. Somewhat counterintuitively, the slope (Bi) repre­sents the average effect size (because both the DV and predictor have SE in their denominator, this is similar to regressing the ES onto a constant, which yields the mean ES). The intercept (Bq, which is similar to regressing the ES onto the SE) is expected to be zero, and a nonzero intercept (the signifi­cance of which can be evaluated using common statistical software) indicates asymmetry in the funnel plot, or the possibility of publication bias.

These regression approaches to evaluating funnel plot asymmetry (which can be indicative of publication bias) are advantageous over visual inspection of funnel plots in that they reduce subjectivity by providing results that can be evaluated in terms of statistical significance. However, these regression approaches depend on the absence of statistically significant results to con­clude an absence of publication bias (which is typically what you hope to demonstrate). Therefore, their utility depends on adequate statistical power to detect asymmetry. Although the number of simulation studies are lim­ited (for a review, see Sterne & Egger, 2005), preliminary guidelines for the number of studies needed to ensure adequate power can be provided. When
publication bias is severe (and targeting 80% power), at least 17 studies are needed with Egger’s linear regression approach, and you should have at least 40 studies for the rank correlation method (note that this is an extrapolation from previous simulations and should be interpreted with caution). When publication bias is moderate, you should have at least 50 to 60 studies for Egger’s linear regressions and at least 150 studies for the rank correlation approaches. However, I emphasize again that these numbers are extrapolated well beyond previous studies and should be viewed with extreme caution until further studies investigate the statistical power of these approaches.

Considering again the 22 studies of my example meta-analysis, I eval­uated the association between effect size and sample size using both the adjusted rank correlation approach and Egger’s linear regression approach. Columns seven and eight in Table 11.1 show the two transformed variables for the former approach, and computation of Kendall’s rank correlation yielded a nonsignificant value of -.07 (p = .67). Similarly, Egger’s regression of the val­ues in the ninth column onto those in the tenth was nonsignificant (p = .62). I would interpret both results as failing to indicate evidence of publication bias. However, I should be aware that my use of just 22 studies means that I only have adequate power to detect severe publication bias using Egger’s linear regression approach, and I do not have adequate power with the rank correlation method.

4. Failsafe N

4.1. Definition and Computation

Failsafe numbers (failsafe N) help us evaluate the robustness of a meta­analytic finding to the existence of excluded studies. Specifically, the failsafe number is the number of excluded studies, all averaging an effect size of zero, that would have to exist for their inclusion in the meta-analysis to lower the average effect size to a nonsignificant level.7 This number, introduced by Rosenthal (1979) as an approach to dealing with the “file drawer problem,” also can be thought of as the number of excluded studies (all with average effect sizes equal to zero) that would have to be filed away before you would conclude that no effect actually exists (if the meta-analyst had been able to analyze results from all studies conducted). If this number is large enough, you conclude that it is unlikely that you could have missed so many stud­ies (that researchers’ file drawers are unlikely to be filled with so many null results), and therefore that this conclusion of the meta-analysis is robust to this threat.

The computation of a failsafe number begins with the logic of an older method of combining research results, known as Stouffer’s or the sum-of-Zs method (for an overview of these earlier methods of combining results, see Rosenthal, 1978). This method involves computing the significance level of the effect from each study (the one-tailed p value), converting this to a stan­dard normal deviate (Zj, with positive values denoting effects in expected direction), and then combining these Zs across the k studies to obtain an overall combined significance (given by standardized normal deviate, Zc):

Failsafe N extends this approach by asking the question, How many studies could be added to those in the meta-analysis (going from k to k + N in the denominator term of Equation 11.4), all with zero effect sizes (Zs = 0, so the numerator term does not change), before the significance level drops to some threshold value (e.g., Zc = Za = 1.645 for one-tailed p = .05)? The equa­tion can be rearranged to yield the computation formula for failsafe N:

Examination of these equations makes clear the two factors that impact the failsafe N. The first is the level of statistical significance (Zc) yielded from the included results (which is a function of effect size and sample size of the study); the larger this value, the larger the failsafe N. The second factor is the number of included studies, k. Increasing numbers of included studies results in increasing failsafe Ns (because the first occurrence of k in Equa­tion 11.5 is multiplied by a ratio greater than 1 [when results are significant], this offsets the subtraction by k). This makes intuitive sense: Meta-analyses finding a low p value (e.g., far below .05) from results from a large number of studies need more excluded, null results to threaten the findings, whereas meta-analyses with results closer to what can be consider a “critical” p value (e.g., just below .05) from a small number of studies could be threatened by a small number of excluded studies.

How large should failsafe N be before you conclude that results are robust to the file drawer problem? Despite the widespread use of this approach over about 30 years, no one has provided a statistically well-founded answer. Rosenthal’s (1979) initial suggestion was for a tolerance level (i.e., adequately high failsafe N) equal to 5 k + 10, and this initial suggestion seems to have been the standard most commonly applied since. Rosenthal (1979) noted what is a plausible number of studies filed away likely depends on the area of research, but no one has further investigated this speculation. At the moment, the 5k + 10 is a reasonable standard, though I hope that future work will improve on (or at least provide more justification for) this value.

4.2. Criticisms

Despite their widespread use, failsafe numbers have been criticized in sev­eral ways (see Becker, 2005). Although these criticisms are valuable in point­ing out the limits of using failsafe N exclusively, I do not believe that they imply that you should not use this approach. Next, I briefly outline the major criticisms against failsafe N and suggest considerations that temper these critiques.

One criticism is of the premise of computing the number of studies with null results. The critics argue that other possibilities could be considered, such as studies in the opposite direction as those found in the meta-analysis. It is true that any alternative effect size could be chosen; but it seems that selection of null results (i.e., those with effect sizes close to 0), which are the studies expected to be suppressed under most conceptualizations of publica­tion bias, represents the most appropriate single value to consider.

A second criticism of the failsafe number is that it does not consider the sample sizes of the excluded studies. Sample sizes of included studies are indirectly considered in that larger samples sizes yield larger Z;s than smaller studies, given the same effect size. In contrast, excluded studies are assumed to have effect sizes of zero and therefore Zs equal to zero regard­less of sample size. So, the failsafe number would not differentiate between excluded studies with sample sizes equal to 10 versus 10,000. I believe this is a fair critique of failsafe N, though the impact depends on excluded studies with zero effect sizes (on average) having larger sample sizes than included studies. This seems unlikely given the previous consideration of publication bias and funnel plots. If there is a bias, I would expect that excluded studies are primarily those with small (e.g., near zero) effect sizes and small sample sizes (however, I acknowledge that I am unaware of empirical support for this expectation).

A third criticism involves the failure of failsafe N to model heterogeneity among obtained results. In other words, the Stouffer method of obtaining the overall significance (Zc) among included studies, which is then used in the computation of failsafe N, makes no allowance for whether these studies are homogeneous (centered around a mean effect size with no more deviation than expected by sampling error) or heterogeneous (deviation around mean effect size is greater than expected by sampling error alone; see Chapter 10). This is a valid criticism that should be kept in mind when interpreting fail­safe N. I especially recommend against using failsafe N when heterogeneity necessitates the use of random-effects models (Chapter 9).

A final criticism of Rosenthal’s (1979) failsafe N is the focus on statisti­cal significance. As I have discussed throughout this book, one advantage of meta-analysis is a focus on effect sizes; a number that indicates the number of excluded studies that would reduce your results to nonsignificance does not tell you how these might affect your results in terms of effect size. For this, an alternative failsafe number can be considered, which I describe next.

4.3. An Effect Size Failsafe N

An alternative approach that focuses on effect sizes was proposed by Orwin (1983; see also Becker, 2005). Using this approach, you select an effect size8 (smaller than that obtained in the meta-analysis of sampled studies) that rep­resents the smallest meaningful effect size (either from guidelines such as r = ±.10 for a small effect size, or preferably an effect size that is meaningful in the context of the research). This value is denoted as ESmjn.9 You then com­ pute a failsafe number (Nes) from the meta-analytically combined average effect size (E5m) from the k included studies using:

The denominator of this equation introduces an additional term that I have not yet described, ESexcjuded- This represents the expected (i.e., speci­fied by the meta-analyst) average effect size of excluded studies. A reasonable choice, paralleling the assumption of Rosenthal’s (1979) approach, might be zero. In this case, the failsafe number (Nes) would tell us how many excluded studies with an average effect size of zero would have to exist before the true effect size would be reduced to the smallest meaningful effect size (E5mjn). Although this is likely a good choice for many situations, the flexibility to specify alternative effect sizes of excluded studies addresses the first criti­cism of traditional approaches described above.

Although this approach to failsafe N based on minimum effect size alle­viates two critiques of Rosenthal’s original approach, it is still subject to the other two criticisms. First, this approach still assumes that the excluded studies have the same average sample size as included studies. I believe that this results in a conservative bias in most situations; if the excluded studies tend to have smaller samples than the included studies, then the failsafe N is smaller than necessary. Second, this approach also does not model het­erogeneity, and therefore is not informative when you find significant het­erogeneity and rely on random-effects models. A third limitation, unique to this approach, is that there do not exist solid guidelines for determining how large the failsafe number should be before you conclude that results are robust to the file drawer problem. I suspect that this number is smaller than Rosenthal’s (1979) 5k + 10 rule, but more precise numbers have not been developed.

To illustrate computation of these failsafe numbers, I again consider the example meta-analysis of Table 11.1. Summing the Zs (not the Zrs) across the 22 studies yields 127.93, from which I compute Zc = 127.93 / V22 = 27.27. To compute Rosenthal’s (1979) failsafe number of studies with effect sizes of zero needed to reduce the relational aggression with rejection association to nonsignificance, I apply Equation 11.4:

This means that there could exist up to 6,026 studies, with an average cor­relation of 0, before my conclusion of a significant association is threatened. This is greater than the value recommended by Rosenthal (1979) (i.e., 5k + 10 = 5*22 + 10 = 120), so I would conclude that my conclusion of an association between relational aggression and rejection is robust to the file drawer prob­lem. However, it is more satisfying to discuss the robustness of the magni­tude, rather than just the significance of this association, so I also use Orwin’s (1983) approach of Equation 11.5. Under the assumption that excluded stud­ies have effect sizes of 0 (i.e., in Equation 11.5, ESexcluqed = 0), I find that 5 excluded studies could reduce the average correlation to .30, 19 could reduce it to .20, and 59 would be needed to reduce it to .10. Although there are no established guidelines for Orwin’s failsafe numbers, it seems reasonable to conclude that it is plausible that the effect size could be less than a medium correlation (i.e., less than the standard of r = ±.30) but perhaps implausible that the effect size could be less than a small correlation (i.e., less than the standard of r = ±.10).

5. Trim and Fill

The trim and fill method is a method of correcting for publication bias (see Duval, 2005) that involves a two-step iterative procedure. The conceptual rationale for this method is illustrated by considering the implications of publication bias on funnel plots (recall that the corner of the funnel denoting studies with small sample sizes and small effect sizes is underrepresented), which causes bias in estimating both the mean effect size and the heteroge­neity around this effect size. The trim and fill approach uses a two-step cor­rection that attempts to provide more accurate estimates of both mean and spread in effect sizes.

The first step of this approach is to temporarily “trim” studies contribut­ing to funnel plot asymmetry. Considering Figure 11.2 (in which the funnel plot is expected to be asymmetric in having more studies in the upper left section than the lower left section when there is publication bias), this trim­ ming involves temporarily removing studies until you obtain a symmetric funnel plot (often shaped like a bar in the vertical middle of this plot). You then estimate an unbiased mean effect size from the remaining studies for use in the second step.

The second step involves reinstating the previously trimmed studies (resulting in the original asymmetric funnel plot) and then imputing studies in the underrepresented section (lower left of Figure 11.2) until you obtain a symmetric funnel plot. This symmetric funnel plot allows for accurate esti­mation of both the mean and heterogeneity (or between-study variance) of effect sizes. This two-step process is repeated several times until you reach a convergence criterion (in which trimming and filling produce little change to estimates).

As you might expect, this is not an approach performed by hand, and the exact statistical details of trimming and filling are more complex than I have presented here (for details, see Duval, 2005). Fortunately, this approach is included in some software packages for meta-analysis (this represents an exception to my general statement that meta-analysis can be conducted by hand or with a simple spreadsheet program, though you could likely program this approach into traditional software packages). There also exist variations depending on modeling of between-study variability beyond sampling fluctu­ation (random- versus fixed-effects) and choice of estimation method. These methods have not yet been fully resolved.

Despite the need for specialized software and some unresolved statisti­cal issues, the trim and fill method represents a useful way to correct for potential publication bias. Importantly, this method is not to be used as the primary reporting of results of a meta-analysis. In other words, you should not impute study values, analyze the resulting dataset including these values, and report the results as if this was what was “found” in the meta-analysis. Instead, you should compute results using the trim and fill method for com­parison to those found from the studies actually obtained. If the estimates are comparable, then you conclude that the original results are robust to publica­tion bias, whereas discrepancies suggest that the obtained studies produced biased results.

6. Weighted Selection Approaches

An additional method of managing publication bias is through selection method approaches (Hedges & Vevea, 2005), also called weighted distribu­tion theory corrections (Begg, 1994). These methods are complex, and I do not attempt to fully describe them fully here (see Hedges & Vevea, 2005). The central concept of these approaches is to construct a distribution of inclu­sion likelihood (i.e., a selection model) that is used for weighting the results obtained. Specifically, studies with characteristics that are related to lower likelihood of inclusion are given more weight than studies with characteris­tics related to higher likelihood of inclusion.

This distribution of inclusion likelihood is based on characteristics of studies that are believed to be related to inclusion in the meta-analysis. For example, you might expect the likelihood to be related to the level of sta­tistical significance, such that studies finding significant results are more likely to be included than those that do not. Because it is usually difficult to empirically derive values for this likelihood distribution, the most com­mon practice is to base these on a priori models. A variety of models have been suggested (see Begg, 1994; Hedges & Vevea, 2005), including models that propose equal likelihood for studies with p < .05 and then a gradually declining likelihood, as well as models that consist of steps corresponding to diminished likelihood at ps that are psychologically salient (e.g., ps = .01, .05, .10). Other models focus more on effect sizes, sometimes in combination with standard errors. Your choice of one of these models should be guided by the underlying selection process that you believe is operating, though this deci­sion can be difficult to make in the absence of field-specific information. It is also necessary for these approaches to be applied within a meta-analysis with a large number of studies. In sum, this weighted selection approach appears promising, but some important practical issues need to be resolved before they can be widely used.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: What Impact Do Sampling Biases Have on Meta-Analytic Conclusions?

The short answer to the question, “What impact do sampling biases have on the conclusions of a meta-analysis?” is “I don’t know.” As a meta-analyst you do not know, your readers do not know, and it is not possible to know unless you could obtain every study that has ever been conducted on the topic of the meta-analysis. Because obtaining every study is almost never possible (and if you did, there is by definition no bias because you have obtained the popula­tion of studies), this question is impossible to answer.

The magnitude of sampling bias likely varies considerably from field to field and even from one meta-analysis to another. So, it is appropriate to always be concerned about the extent to which publication bias impacts the findings of a meta-analysis. Does this mean every meta-analysis should be viewed as untrustworthy and uninformative? Absolutely not. You should remember that the available literature is all that we as scientists have, so if you dismiss this literature as not valuable, then we have nothing on which to base our empirical sciences. Moreover, it is important to remember that a meta-analytic review is no more subject to sampling bias than other lit­erature reviews. In fact, meta-analysis offers two advantages over traditional approaches to literature review that allow us to face the challenge of sam­pling bias. First, meta-analysts typically are more exhaustive in searching the literature than those performing narrative reviews, and the search proce­dures are made transparent in the reporting of meta-analyses. Second, only meta-analysis allows you to evaluate and potentially correct for publication/ sampling bias. Although there is no guarantee that these methods will per­fectly fix the problem, they are far better than simply ignoring it.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Meta-Analysis to Obtain Sufficient Statistics

1. Sufficient Statistics for Multivariate Analyses

As you may recall (fondly or not) from your multivariate statistics courses, nearly all multivariate analyses do not require the raw data. Instead, you can perform these analyses using sufficient statistics—summary information from your data that can be inserted into matrix equations to provide esti­mates of multivariate parameters. Typically, the sufficient statistics are the variances and covariances among the variables in your multivariate analysis, along with some index of sample size for computing standard errors of these parameter estimates. For some analyses, you can instead use correlation to obtain standardized multivariate parameter estimates. Although the analysis of correlation matrices, rather than variance/covariance matrices, is often less than optimal, a focus on correlation matrices is advantageous in the context of multivariate meta-analysis for the same reason that correlations are generally preferable to covariances in meta-analysis (see Chapter 5). I next briefly summarize how correlation matrices can be used in multivariate analyses, focusing on multiple regression, exploratory factor analysis, and confirmatory factor analysis. Although these represent only a small sampling of possible multivariate analyses, this focus should highlight the wide range of possibilities of using multivariate meta-analysis.

1.1. Multiple Regression

Multiple regression models fit linear equations between a set of predictors (independent variables) and a dependent variable. Of interest are both the unique prediction each independent variable has to the dependent variable above and beyond the other predictors in the model (i.e., the regression coef­ficient, B) and the overall prediction of the set (i.e., the variance in the depen­dent variable explained, R2). Both the standardized regression coefficients of each predictor and overall variance explained (i.e., squared multiple correla­tion, R2) can be estimated from (1) the correlations among the independent variables (a square matrix, R^, with the number of rows and columns equal to the number of predictors), and (2) the correlations of each independent variable with the dependent variable (a column vector, Rjy, with the number of rows equal to the number of predictors, using the following equations1 (Tabachnick & Fidell, 1996, p. 142):

1.2. Exploratory Factor Analys/s

Exploratory factor analysis (EFA) is used to extract a parsimonious set of fac­tors that explain associations among a larger set of variables. This approach is commonly used to determine (1) how many factors account for the associa­tions among variables, (2) the strengths of associations of each variable on a factor (i.e., the factor loadings), and (3) the associations among the factors (assuming oblique rotation). For each of these goals, exploratory factor anal­ysis is preferred to principal components analysis (PCA; see, e.g., Widaman, 1993, 2007), so I describe EFA only. I should note that my description here is brief and does not delve into the many complexities of EFA; I am being brief because I seek only to remind you of the basic steps of EFA without providing a complete overview (for more complete coverage, see Cudeck & MacCallum, 2007).

Although the matrix algebra of EFA can be a little daunting, all that is initially required is the correlation matrix (R) among the variables, which is a square matrix of p rows and columns (where p is the number of variables). From this correlation matrix, it is possible to compute a matrix of eigenvec­tors, V, which has p rows and m columns (where m is the number of factors).2

To determine the number of factors that can be extracted, you extract the maximum number of factors3 and then examine the resulting eigenvalues contained in the diagonal matrix (m X m) L:

You decide on the number of factors to retain based on the magnitudes of the eigenvalues contained in L. A minimum (i.e., necessary but not suf­ficient) threshold is known as Kaiser’s (1970) criterion, which states that the eigenvalue is greater than 1.0. Beyond this criterion, it is common to rely on a scree plot, sometimes with parallel analysis, as well as considering the inter- pretability of rival solutions, to reach a final determination of the number of factors to retain.

The analysis then proceeds with a specified number of factors (i.e., some fixed value of m that is less than p). Here, the correlation matrix (R) is expressed in terms of a matrix of unrotated factor loadings (A), which are themselves calculated from the matrices of eigenvectors (V) and eigenvalues (L):

In order to improve the interpretability of factor loadings (contained in the matrix A), you typically apply a rotation of some sort. Numerous rota­tions exist, with the major distinction being between orthogonal rotations, in which the correlations among factors are constrained to be zero, versus oblique rotations, in which nonzero correlations among factors are estimated. Oblique rotations are generally preferable, given that it is rare in social sci­ences for factors to be truly orthogonal. However, oblique rotations are also more computationally intensive (though this is rarely problematic with mod­ern computers) and can yield various solutions using different criteria, given that you are attempting to estimate both factor loadings and factor intercor­relations simultaneously. I avoid the extensive consideration of alternative estimation procedures by simply stating that the goal of each approach is to produce a reproduced (i.e., model implied) correlation matrix that closely corresponds (by some criterion) to the actual correlation matrix (R). This reproduced matrix is a function of (1) the pattern matrix (A), which here (with oblique rotation) represents the unique relations of variable with fac­tors (controlling for associations among factors), and (2) the factor correla­tion matrix (F), which represents the correlations among the factors4:

When the reproduced correlation matrix (R) adequately reproduces the observed correlation matrix (R), the analysis is completed. You then interpret the values within the pattern matrix (A) and matrix of factor correlations (F) to address the second and third goals of EFA described above.

1.3. Confirmatory Factor Analysis

In many cases, it may be more appropriate to rely on a confirmatory, rather than an exploratory, factor analysis. A confirmatory factor analysis (CFA) is estimated by fitting the data to a specified model in which some factor load­ings (or other parameters, such as residual covariances among variables) are specified as fixed to zero versus freely estimated. Such a model is often a more realistic representation of your expected factor structure than is the EFA.5

Like the EFA, the CFA estimates associations among factors (typically called “constructs” or “latent variables” in CFA) as well as strengths of asso­ciations between variables (often called “indicators” or “manifest variables” in CFA) and constructs. These parameters are estimated as part of the general CFA matrix equation6:

To estimate a CFA, you place certain constraints on the model to set the scale of latent constructs (see Little, Slegers, & Card, 2006) and ensure iden­tification (see Kline, 2010, Ch. 6). For example, you might specify that there is no factor loading of a particular indicator on a particular construct (vs. an EFA, in which this would be estimated even if you expected the value to be small). Using Equation 12.5, a software program (e.g., Lisrel, EQS, Mplus) is used to compute values of factor loadings (values within the A matrix), latent variances and covariances (values within the ¥ matrix), and residual vari­ances (and sometimes residual covariances; values within the 0 matrix) that yield a model implied variance/covariance matrix, S. The values are selected so that this model-implied matrix closely matches the observed (i.e., from the data) variances and covariance matrix (S) according to some criterion (most commonly, the maximum likelihood criterion minimizing a fit function). For CFA of primary data, the sufficient statistics are therefore the variances and covariances comprising S; however, it is also possible to use correlation coef­ficients such as would be available from meta-analysis to fit CFAs (see Kline, 2010, Ch. 7).7

2. The Logic of Meta-Analytically Deriving Sufficient Statistics

The purpose of the previous section was not to fully describe the matrix equations of multiple regression, EFA, and CFA. Instead, I simply wish to illustrate that a range of multivariate analyses can be performed using only correlations. Other multivariate analyses are possible, including canonical correlations, multivariate analysis of variance or covariance, and structural equation modeling. In short, any analysis that can be performed using a cor­relation matrix as sufficient information can be used as a multivariate model for meta-analysis.

The “key” of multivariate meta-analysis then is to use the techniques of meta-analysis described throughout this book to obtain average correla­tions from multiple studies. Your goal is to compute a meta-analytic mean correlation for each of the correlations in a matrix of p variables. Therefore, your task in a multivariate meta-analysis is not simply to perform one meta­analysis to obtain one correlation, but to perform multiple meta-analyses to obtain all possible correlations among a set of variables. Specifically, the number of correlations in a matrix of p variables is equal to p(p —1)/2. This correlation matrix (R) of these mean correlations is then used in one of the multivariate analyses described above.

3. The challenges of using Meta-Analytically deriving Sufficient Statistics

Although the logic of this approach is straightforward, several complications arise (see Cheung & Chan, 2005a). The first is that it is unlikely that every study that provides information on one correlation will provide information on all correlations in the matrix. Consider a simple situation in which you wish to perform some multivariate analysis of variables X, Y, and Z. Study 1 might provide all three correlations Oxy, rxz, and ryz). However, Study 2 did not measure Z, so it only provides one correlation Oxy); Study 3 failed to measure Y and so also provides only one correlation (rxz); and so on. In other words, multivariate meta-analysis will almost always derive different average correlations from different subsets of studies.

This situation poses two problems. First, it is possible that different cor­relations from very different sets of studies could yield a correlation matrix that is nonpositive definite. For example, imagine that three studies report­ing txy yield an average value of .80 and four studies reporting rxz yield an average value of .70. However, the correlation between Y and Z is reported in three different studies, and the meta-analytic average is -.50. It is not logi­cally possible for there to exist, within the population, a strong positive cor­relation between X and Y, a strong positive correlation between X and Z, but a strong negative correlation between Y and Z.8 Most multivariate analyses cannot use such nonpositive definite matrices. Therefore, the possibility that such nonpositive definite matrices can occur if different subsets of studies inform different correlations within the matrix represents a challenge to multivariate meta-analysis.

Another challenge that arises from the meta-analytic combination of different studies for different correlations within the matrix has to do with uncertainty about the effective sample size. Although many multivariate analyses can provide parameter estimates from correlations alone, the stan­dard errors of these estimates (for significance testing or constructing confi­dence intervals) require knowledge of the sample size. When the correlations are meta-analytically combined from different subsets of studies, it is unclear what sample size should be used (e.g., the smallest sum of participants among studies for one of the correlations; the largest sum; or some average?).

A final challenge of multivariate meta-analysis is how we manage hetero­geneity among studies. By computing a matrix of average correlations, we are implicitly assuming that one value adequately represents the populations of effect sizes. However, as I discussed earlier, it is more appropriate to test this homogeneity (vs. heterogeneity; see Chapter 8) and to model this population heterogeneity in a random-effects model if it exists (see Chapter 9). Only one of the two approaches I describe next can model between-study variances in a random-effects model.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Two Approaches to Multivariate Meta-Analysis

Given the challenges I described in the previous section, multivariate meta­analysis is considerably more complex than simply synthesizing several corre­lations to serve as input for a multivariate analysis. The development of models that can manage these challenges is an active area of research, and the field has currently not resolved which approach is best. In this section, I describe two approaches that have received the most attention: the meta-analytic structural equation modeling (MASEM) approach describe by Cheung and Chan (2005a) and the generalized least squares (GLS) approach by Becker (e.g., 2009). I describe both for two reasons. First, you might read meta-analyses using either approach, so it is useful to be familiar with both. Second, given that research on both approaches is active, it is difficult for me to predict which approach might emerge as superior (or, more likely, superior in certain situations). How­ever, as the state of the field currently stands, the GLS approach is more flex­ible in that it can estimate either fixed- or random-effects mean correlations (whereas the MASEM approach is limited to fixed-effects models9). For this reason, I provide considerably greater coverage of the GLS approach.

To illustrate these approaches, I expand on the example described ear­lier in the book. Table 12.1 summarizes 38 studies that provide correlations among relational aggression (e.g., gossiping), overt aggression (e.g., hitting), and peer rejection.10 Here, 16 studies provide all three correlations among these variables, 6 provide correlations of both relational and overt aggression to peer rejection, and 16 provide the correlation between overt and relational aggression. This particular example is somewhat artificial, in that (1) a selec­tion criterion for studies in this review was that results be reported for both relational and overt forms of aggression (otherwise, there would not be per­fect overlap in the correlations of these two forms with peer rejection), and (2) for simplicity of presentation, I selected only the first 16 studies, out of 82 studies in the full meta-analysis, that provided only the overt with rela­tional aggression correlation. Nevertheless, the example is realistic in that the three correlations come from different subsets of studies, and contain different numbers of studies and participants (for rrelational-overt, k = 32, N = 11,642; for rrelational-rejection
and rovert-rejection, k = 22, N = 8,081). I next use this example to illustrate how each approach would be used to fit a multiple regression of both forms of aggression predicting peer rejection.

1. The MASEM Approach

One broad approach to multivariate meta-analysis is the MASEM approach described by Cheung and Chan (2005a). This approach relies on SEM meth­odology, so you must be familiar with this technique to use this approach. Given this restriction, I write this section with the assumption that you are at least somewhat familiar with SEM (if you are not, I highly recommend Kline, 2010, as an accessible introduction).

In this approach, you treat the correlation matrix from each study as sufficient statistics for a group in a multigroup SEM. In other words, each study is treated as a group, and the correlations obtained from each study are entered as the data for that group. Although the multigroup approach is relatively straightforward if all studies provided all correlations, this is typi­cally not the case. The MASEM approach accounts for situations in which some studies do not include some variables, by not estimating the parameters involving those variables for that “group.” However, the parameter estimates are constrained equal across groups, so identification is ensured (assuming that the overall model is identified). Note that this approach considers the completeness of studies in terms of variables rather than correlations (in con­trast to the GLS approach described in Section 12.2.2). In other words, this approach assumes that if a variable is present in a study, the correlations of that variable with all other variables in the study are present. To illustrate using the example, if a study measured relation aggression, overt aggression, and peer rejection, then this approach requires that you obtain all three cor­relations among these variables. If a study measured all three variables, but failed to report the correlation between overt aggression and rejection (and you could not obtain this correlation), then you would be forced to treat the study as if it failed to measure either overt aggression or rejection (i.e., you would ignore either the relational-overt or the relational-rejection correla­tion).

The major challenge to this approach comes from the equality constraints on all parameters across groups. These constraints necessarily imply that the studies are homogeneous. For this reason, Cheung and Chan (2005a) recom­mended that the initial step in this approach be to evaluate the homogene­ity versus heterogeneity of the correlation matrices. They propose a method in which you evaluate heterogeneity through nested-model comparison of an unrestricted model in which the correlations are freely estimated across studies (groups) versus a restricted model in which they are constrained equal.11 If the change is nonsignificant (i.e., the null hypothesis of homo­geneity is retained), then you use the correlations (which are constrained equal across studies) and their asymptotic covariance matrix as sufficient statistics for your multivariate model (e.g., multiple regression in my example or, as described by Cheung & Chan, 2005a, within an SEM). However, if the change is significant (i.e., the alternate hypothesis of heterogeneity), then it is not appropriate to leave the equality constraints in place. In this situation of heterogeneity, this original MASEM approach cannot be used to evaluate models for the entire set of studies (but see footnote 9). Cheung and Chan (2005a) offer two recommendations to overcome this problem. First, you might divide studies based on coded study characteristics until you achieve within-group homogeneity. If you take this approach, then you must focus on moderator analyses rather than make overall conclusions. Second, if the coded study characteristics do not fully account for the heterogeneity, you can perform the equivalent of a cluster analysis that will empirically clas­sify studies into more homogeneous subgroups (Cheung & Chan, 2005b). However, the model results from these multiple empirically identified groups might be difficult to interpret.

Given the requirement of homogeneity of correlations, this approach might be limited if your goal is to evaluate an overall model across studies. In the illustrative example, I found significant heterogeneity (i.e., increase in model misfit when equality constraints across studies were imposed). I sus­pect that this heterogeneity is likely more common than homogeneity. Fur­thermore, I was not able to remove this heterogeneity through coded study characteristics. To use this approach, I would have needed to empirically classify studies into more homogeneous subgroups (Cheung & Chan, 2005b); however, I was dissatisfied with this approach because it would have provided multiple sets of results without a clear conceptual explanation. Although this MASEM approach might be modified in the future to accommodate hetero­geneity (look especially for work by Mike Cheung), it currently did not fit my needs within this illustrative meta-analysis of relational aggression, overt aggression, and peer rejection. As I show next, the GLS approach was more tractable in this example, which illustrates its greater flexibility.

2. The GLS Approach

Becker (1992; see 2009 for a comprehensive overview) has described a GLS approach to multivariate meta-analysis. This approach can be explained in seven steps; I next summarize these steps as described in Becker (2009) and provide results for the illustration of relational and overt aggression predict­ing peer rejection.

2.1. Data Management

The first step is to arrange the data in a way that information from each study is summarized in two matrices. The first matrix is a column vector of the Fisher’s transformed correlations (Zr) from each study i, denoted as z;. The number of rows of this matrix for each study will be equal to the num­ber of correlations provided; for example, from the data in Table 12.1, this matrix will have one row for the Andreou (2006) study, three rows for the Blachman (2003) study, and two rows for the Ostrov et al. (2004) study. The second matrix for each study is an indicator matrix (Xj) that denotes which correlations are represented in each study. The number of columns in this matrix will be constant across studies (the total number of correlations in the meta-analysis), but the number of rows will be equal to the number of correlations in the particular study. To illustrate these matrices, consider the 33rd study in Table 12.1, that by Rys and Bear (1997); the matrices (note that the z matrix contains Fisher’s transformations of rs shown in the table) for this study are:

Note that this study, which provides two of the three correlations, is represented with matrices of two rows. The indicator matrix (X33) specifies that these two correlations are the second and third correlations under con­sideration (the order is arbitrary, but needs to be consistent across studies; here, I have followed the order shown in Table 12.1).

2.2. Estimating Variances and Covariances of Study Effect Size Estimates

Just as it was necessary to compute the standard errors of study effect size estimates in all meta-analyses (see Chapters 5 and 8), we must do so in this approach to multivariate meta-analysis. Here, I describe the variances of esti­mates of effect sizes, which is simply the standard error squared: Var(Zr) = SEzr2. So the variances of each Zr effect size are simply 1 / (N; – 3). However, for a multivariate meta-analysis, in which multiple effect sizes are consid­ered, you must consider not only the variance of estimate of each effect size, but also the covariances among these estimates (i.e., the uncertainty of esti­mation of one effect size is associated with the uncertainty of estimation of another effect size within the same study). The covariance of the estimate of the Fisher’s transformed correlation between variables s and t with the estimate of the transformed correlation between variables u and v (where u or v could equal s or t) from Study i is computed from the following equation (Becker, 1992, p. 343; Beretvas & Furlow, 2006, p. 161)12:

In this equation, the covariances of estimates are based on two types of information: (1) the sample size, contained in the denominator, is known for each study; and (2) the population correlations, are unknown. Although this population correlation is study-specific (in the sense of assuming a popula­tion distribution of effect sizes consistent with a random effects model; see Chapter 10), simulation studies (Furlow & Beretvas, 2005) have shown that the mean correlation across the studies of your meta-analysis is a reason­able estimate of the population correlations for use in this equation. Becker (2009) demonstrates the use of simple sample-size-weighted mean correla­tions as estimates of these population correlations; that is,

From the ongoing example of data shown in Table 12.1, I find sample- size-weighted mean correlations of .565, .318, and .330 for the relational- overt, relational-rejection, and overt-rejection associations, respectively. Inserting these mean correlations into Equation 12.6, I can then compute the variances and covariances of estimates for each study based on the study’s effect size. For instance, the fourth study (Blachman, 2003), which had a sample size (N4) of 228, has the following matrix of variances and covari­ances of estimates:

Studies that do not report all three correlations will have matrices that are smaller; specifically, their matrices will have numbers of columns and rows equal to the number of reported effect sizes.

2.3. Estimating a Fixed-Effects Mean Correlation Matrix

After computing effect size (zj), indicator (Xj), and estimation variance/cova- riance (Cov(zj)) matrices for each study (i), you then create three large matri­ces that combine these matrices across the individual studies. The first of these is z, which is a column vector of all of the individual effect sizes vectors from the studies (zjs) stacked. In the example from Table 12.1, this vector would be:

The first three values of this vector (.512, .881, and .617) are the Zrs from the single effect sizes provided by the first three studies (Andreou, 2006; Arnold, 1998; and Berdugo-Arstark, 2002 from Table 12.1). The next three values (.472, .527, and .681) are the three Zrs from the fourth study (Blach- man, 2003), which provided three effect sizes. The next value (.448) is the single Zr from the fifth study (Brendgen et al., 2005). I have omitted the val­ues of this matrix until the last (38th) study (Zalecki & Hinshaw, 2004), which provided two effect sizes (Zrs = .571 and .635). In total, this z vector has 76 rows (i.e., a 76 X 1 matrix) that contain the 76 effect sizes from these 38 studies.

The second large matrix is X, which is a stacked matrix of the indica­tor matrices of the individual studies (Xj). Because all of the study indicator matrices had three columns, this matrix also has three columns. However, each study provides a number of rows to this matrix equal to the number of effect sizes; therefore there will be 76 rows in the X matrix in the example. Specifically, this matrix will look as follows:

The first three rows indicate that the first three studies provide effect sizes for the relational-overt association (the first column, as in Table 12.1). Rows four to six are from the fourth study (Blachman, 2003), indicating that the three effect sizes (corresponding to values in the z vector) are Fisher’s trans­formations of the relational-overt, relational-rejection, and overt-rejection correlations, respectively. Row seven indicates that the fifth study (Brend­gen et al., 2005) contributed an effect size of the relational-overt (i.e., first) association. Again, I have omitted further values of this matrix until the last (38th) study (Zalecki & Hinshaw, 2004), which has two rows in this matrix indicating that it provided effect sizes for the second (relational-rejection) and third (overt-rejection) associations. In total, this X matrix has a number of rows equal to the total number of effect sizes (76 in this example) and a number of columns equal to the number of correlations you are considering (3 in this example).

The final combined matrix is F, which contains the variances/covari- ances of estimates from the individual studies. Specifically, this matrix is a blockwise diagonal matrix in which the estimate variance/covariance matrix from each study i is placed near the diagonal, and all other values are 0. This is probably most easily understood by considering this matrix in the context of my ongoing example:

Here, the first three elements along the diagonal represent the variances of the estimates of the single effect sizes provided by these three studies. The next study is represented in the square starting in cell 4, 4 (fourth row, fourth column) to cell 6, 6. These values represent the variances and covari­ances among estimates of the three effect sizes from this study, which were shown above as Cov(z4). The variance of the single effect size of study 5 is shown next along the diagonal. I have again omitted the remaining values until the last (38th) study (Zalecki & Hinshaw, 2004). This study provided two effect sizes, and the variances (both .0035) and covariance (.0019) of these estimates are shown as a square matrix around the diagonal. Note that all other values in this matrix are 0. In total, this F matrix is a square, sym­metric matrix with 76 (total number of effect sizes in this example) rows and columns.

These three matrices, z, X, and, F are then used to estimate (via gener­alized least squares methods) fixed-effects mean effect sizes, which are con­tained in the column vector £. The equation to do so is somewhat daunting looking, but is a relatively simple matter of matrix algebra (Becker, 2009, p. 389):

In the ongoing example, working through the matrix algebra yields the following:

These findings indicate that the fixed-effects mean Zrs are .66, .33, and .34 for the relational-overt, relational-rejection, and overt-rejection associa­tions, respectively. Back-transforming these values to the more interpretable r yields .58, .32, and .33. If these fixed-effects values are of interest (see Sec­tion 12.2.2.d on evaluating heterogeneity), then you are likely interested in drawing inference about these mean effect sizes. Variances of the estimates of the mean Zrs (i.e., the squared standard errors) are found on the diagonal of the matrix obtained using the following equation (Becker, 2009, p. 389):

Just as when you are analyzing a single effect size, the appropriateness of a fixed-effects model depends on whether effect sizes are homogeneous versus heterogeneous. If they are heterogeneous, then you should use a random- effects model (see Chapter 10), which precludes the MASEM approach (Cheung & Chan, 2005a). The test of heterogeneity in the multivariate case is an omnibus test of whether any of the effect sizes significantly vary across studies (more than expected by sampling fluctuation alone; see Chapter 8). Becker (2009) described a significance test that relies on a Q value as in the univariate case, but here this value must be obtained through matrix algebra using the following equation (Becker, 2009, p. 389):

This Q value is evaluated as a c2 distribution, with df equal to the num­ber of effect sizes reported across all studies minus the number of effect sizes of interest.

In the example meta-analysis of studies in Table 12.1, Q = 1450.90. Eval­uated as a c2 value with 73 df (i.e., 76 reported effect sizes minus 3 effect sizes of interest), this value is statistically significant (p < .001). This significant heterogeneity indicates (1) the need to rely on a random effects model to obtain mean effect sizes, or (2) the potential to identify moderators of the heterogeneity in effect sizes.

2.4. Estimating a Random-Effects Mean Correlation Matrix

As you recall from Chapter 10, one method of dealing with between-study heterogeneity of a single effect size is to estimate the between-study vari­ance (t2), and then account for this variance as uncertainty in the weights applied to studies when computing the (random-effects) mean effect size. The same logic applies here, except now you must estimate and account for several between-study variances—one for each effect size in your multivari­ate model.

The first step, then, is to estimate between-study variances. Although there likely also exists population-level (i.e., beyond sampling fluctuation) covariation in effect sizes across studies, Becker (2009) stated that in practice these covariances are intractable to estimate and that accounting only for between-study variance appears adequate.13 Therefore, you simply estimate the between-study variance (t2) for each effect size of interest (as described in Chapter 10). In the ongoing example, the estimated between-study variances are .0372, .0357, and .0296 for the relational-overt, relational-rejection, and overt-rejection effect sizes.

As you recall from Chapter 10, the estimated between-study variance for a single effect size (t2) is added to the study-specific sampling variance (SEj2) to represent the total uncertainty of the study’s point estimate to the effect size, and the random-effects weight is the inverse of this uncertainty: w* = 1/(t2 + SEj2). In this GLS approach, we modify the previously described matrix of variances/covariances of estimates of studies OF) by adding the appropriate between-study variance estimate to the variances (i.e., diagonal elements) to produce a random-effects matrix, *FRE. To illustrate using the ongoing example:

Comparison of the values in this matrix relative to those in the fixed- effects F is useful. Here we see that the first value on the diagonal (for the first study, Andreou, 2006) is .0384, which is the sum of t2 (.0372) for the effect size indexed by this value (relational-overt) and the study-specific variance of this estimate from the fixed-effects matrix (.0011) (note that rounding error might produce small discrepancies). Similarly, the second value on the diagonal is for the second study, which also provided an effect size of the relational-overt association, and this value (.0414) is the sum of the same t2 (.0372) as Study 1 (because they both report relational-overt effect sizes) plus that study’s sample-size specific variance of sampling error from the fixed-effects matrix (.0042). Consider next the fourth through sixth values on the diagonal, which are for the three effect sizes from Study 4 (Blachman, 2003). The first value (.0393) is for the relational-overt effect size estimate, which is the sum of the t2 for that effect size (.0372) plus the sampling variance for this study and this effect size (.0020). The second value for this study is .0392, which is the sum of the t2 for the relational- rejection effect size (.0357) and the sampling variance for this study and this effect size found in the parallel cell of the fixed-effects matrix OF), .0035. The third value for this study (.0331) is similarly the sum of the t2 for the overt-rejection effect size (.0296) and sampling variance (.0035). Note that the off-diagonal elements (covariances of effect size estimates) do not change in this approach because we have assumed no between-study cova­riance of population effect sizes.

After computing this matrix of random-effects variances and covari­ances of estimates (ΨRE), it is relatively straightforward to estimate a matrix of random-effects mean effect sizes. You simply use Equation 12.7, but insert ΨRE rather than Ψ. Standard errors of these random-effects mean effect sizes can be estimated using Equation 12.8. In the ongoing illustration, the random-effects mean correlations (back-transformed from Zrs) are .59 (95% confidence interval = .54 to .63) for relational-overt, .32 (95% confidence interval = .24 to .39) for relational-rejection, and .36 (95% confidence interval = .29 to .43) for overt-rejection.

2.5. Fitting a Multivariate Model to the Matrix of Average Correlations

After obtaining the meta-analytically derived matrix of average correlations, it is now possible to fit a variety of multivariate models. Considering the ongoing example, I am interested in fitting a multiple regression model in which relational and overt aggression are predictors of rejection. Recalling that multiple regression analyses partition the correlation matrix into depen­dent and independent variables (see Equation 12.1), it is useful to display the results of the random-effects mean correlations (which I express more precisely here) as follows:

This overall correlation matrix is then partitioned into matrices of (1) the correlations of the dependent variable (rejection) with the predictors (relational and overt aggression),Riy; and (2) the correlations among the pre­dictors, Rii:

Applying these matrices within Equation 12.1 yields regression coeffi­cients of .16 for relational aggression and .27 for overt aggression. These two predictors explained 14.9% of the variance in the dependent variable in this model (i.e., R2 = .149).

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: The Interplay between Meta-Analytic Models and Theory

As with any data-analytic approach, meta-analytic techniques are most valu­able when applied in the service of theories relevant to the content of your review. I place this discussion on the interplay between meta-analysis and theory in this chapter on multivariate meta-analysis because many of our theories are multivariate and therefore benefit from multivariate analyses. However, consideration of theory is important for any meta-analysis— univariate or multivariate—just as it is for any form of data analysis in pri­mary research.

A full philosophical consideration of what constitutes a “theory” lies far beyond the scope of this book. Instead, I next frame my discussion of the interplay between theories and meta-analytic results in terms of the meta­phor of a “nomological net” (called a “nomological network” by Cronbach & Meehl, 1955). In this metaphor, the knots of the net represent constructs, and the webbing among the knots represent associations among the constructs. The coverage of the net represents the scope of the theory in terms of the phenomena the theory attempts to explain. Theory specifies expectations for this net in terms of what the knots are (i.e., what constructs are relevant); the webbing among the knots (i.e., what directions and magnitudes of associa-

tions among the constructs are expected); and the coverage of the net (i.e., what, when, and for whom the theory is applicable). Different theories may specify nets that differ in terms of their knots, webbing, and coverage; in fact, potentially infinite nets (theories) could be specified.14 Thus, theory informs your meta-analysis in the very fundamental ways of specifying the constructs you consider (i.e., your definition of constructs of interest), the associations you investigate (i.e., the effect sizes you meta-analyze), and the scope (i.e., breadth of samples and designs included) you include in your meta-analysis (i.e., the inclusion criteria; see Chapter 3).15

Having described how theory guides your meta-analysis, I next turn to how your meta-analysis can evaluate theories. I organize this consideration around the three pieces of the nomological net metaphor: constructs (knots), associations (webbing), and scope (coverage). Following this consideration of how meta-analysis can evaluate theories, I then turn to the topic of model evaluation and building with multivariate meta-analysis.

1. Evaluating Variables and Constructs to Inform Concepts

It is useful to consider the indirect way by which theories inform measure­ment in science (for more in-depth treatments, see, e.g., Britt, 1997; Jaccard & Jaccoby, 2010). When theories describe things, the things that they describe are concepts. Concepts are the most abstract representation of something— the ideas we hold in our minds that a thing exists. For example, any layperson will have a concept of what aggression is. Well-articulated theories go further than abstract concepts to articulate constructs, which are more specifically defined instances of the concept. For example, an aggression scholar might define the construct of aggression “as behavior that is aimed at harming or injuring another person or persons” (Parke & Slaby, 1983, p. 550). Such a definition of a construct is explicit in terms of what lies within and outside of the boundaries (e.g., an accident that injures someone is not aggression because that was not the “aim”). Constructs might be hierarchically orga­nized; for instance, the construct of “aggression” might encompass more spe­cific constructs such as “relational aggression” and “overt aggression” such as I consider in the illustrative example of this chapter. Theories may differ in terms of whether they focus on separable lower-order constructs (within the nomological net metaphor: multiple knots) or singular higher-order con­structs (a single, larger knot in the net).

Despite their specificity, constructs cannot be directly studied. Instead, a primary research study must use variables, which are rules for assigning numbers that we think reasonably capture the level of the construct. These variables might be single items (e.g., frequency of punching) or the aggrega­tion of multiple items (frequency of punching, calling names, and spreading rumors). They may have either meaningful (e.g., number of times observed in a week) or arbitrary (e.g., a 5-point Likert-type scale) metrics. They may have different levels of measurement, ranging from continuous (e.g., number of times a child is observed enacting aggression), to ordinal (e.g., a child’s aver­age score among multiple Likert-type items), to dichotomous (e.g., the pres­ence versus absence of a field note recording a child’s aggression). Regard­less, variables are the researcher’s rule-bound system of assigning values to represent constructs. However, there are an infinite number of variables (i.e., ways of assigning values) that could represent a construct, and every primary study will need to select a limited subset of these variables.

Meta-analysis is a powerful tool to evaluate variables and constructs to inform theoretical concepts. As mentioned, any single primary study must select a limited subset of variables; however, the collection of studies likely contains a wider range of variables. Meta-analytic combination of these mul­tiple studies—each containing a subset of variables representing the con­struct—will provide a more comprehensive statement of the construct itself. This is especially true if (1) the individual studies use a small subset of vari­ables, but the collection of studies contains many subsets with low overlap so as to provide coverage of many ways to measure the construct; and (2) you correct for artifacts so as to eliminate less interesting heterogeneity across methods of measurement (e.g., correcting for unreliability). Tests of modera­tion across approaches to measuring variables can also inform whether some approaches are better representations of the construct than others.

Furthermore, meta-analysis can clarify the hierarchical relations among constructs by informing the magnitude of association among constructs that might be theoretically separable (or not). For example, I provided the exam­ple of a hierarchical organization of the construct of aggression, which might be separated into relational and overt forms (i.e., two lower-order constructs) on theoretical grounds. Meta-analysis can inform whether the constructs are indeed separate by combining correlations from studies containing variables representing these constructs. If the correlation is not different from 1.0 (or —1.0 for constructs that might be conceptualized as opposite ends of a single continuum), then differentiation of the constructs is not supported; however, if the confidence intervals of the correlation do not include ±1.0, then this is evidence supporting their differentiation.16 For instance, in the full, artifact- corrected meta-analysis of 98 studies reporting associations between rela­tional and overt aggression (this differs from the limited illustrative example above; see Card et al., 2008), we found an average correlation of .76 with a 95% confidence interval ranging from .72 to .79, supporting the separate nature of these two constructs.

2. Evaluating Associations

As I mentioned in Chapter 5, the most common effect sizes used in meta­analyses are two variable associations, which can be considered between two continuous variables (e.g., r), between a dichotomous grouping variable and a continuous variable (e.g., g), or between two dichotomous variables (o). These associations represent the webbing of the nomological net.

If well-articulated, theories should offer hypotheses about the presence, direction, and strength of various associations among constructs. These hypotheses can directly be tested in a meta-analysis by combining all avail­able empirical evidence. Meta-analytic synthesis provides an authoritative (in that it includes all available empirical evidence) and usually precise (if a large number of studies or studies with large samples are included) estimate of the presence, direction, and magnitude of these associations, and thus play a key role in evaluating hypothesized associations derived from a theory. If you correct for artifacts (see Chapter 6), then it is possible to summarize and evaluate associations among constructs, which are more closely linked to theoretically derived hypotheses than potentially imperfectly measured variables, as I described earlier.

A focus on associations can also help inform the structure of constructs specified by theories. I described in the previous section how meta-analysis can be used to evaluate whether lower-order constructs can be separated (i.e., the correlation between them is smaller than ±1.0). Meta-analysis can also tell us if it is useful to separate constructs by evaluating whether they dif­ferentially relate to other constructs. If there is no evidence supporting dif­ferential relations to relevant constructs,17 then the separation is not useful even if it is possible (i.e., even if the correlation between the constructs is not ±1.0), whereas differential associations would indicate that the separation of the constructs is both possible and useful. In the meta-analysis of relational and overt aggression, my colleagues and I evaluated associations with six constructs, finding differential relations for each and thus supporting the usefulness of separating these constructs.

Most meta-analyses will only evaluate one or a small number of these hypotheses. Because most useful theories will specify numerous associations (typically more than could be evaluated in a single meta-analysis), a single meta-analysis is unlikely to definitively confirm or refute a theory. Through many separate meta-analyses evaluating different sections of the webbing of the net, however, meta-analysis provides a cumulative approach to gathering evidence for or against a theory.

3. Evaluating Scope

In the metaphor of the nomological net, the coverage (size and location) of the net represents the scope of phenomena the theory attempts to explain. As I mentioned in Section 12.3.2, a series of meta-analyses can inform empirical support for a theory across this scope, thus showing which sections of the net are sound versus in need of repair.

Meta-analysis can also inform the scope of a theory through moderator analyses. As you recall from Chapter 9, moderator analyses tell us whether the strength, presence, or even direction of associations differs across differ­ent types of samples and methodologies used by studies. Theories predicting universal associations would lead to expectations that associations (i.e., the webbing in the net) are consistent across a wide sampling or methodological scope, and therefore moderation is not expected.18 If moderation is found through meta-analysis, then the theory might need to be limited or modified to account for this nuance in scope. In contrast, some theories explicitly pre­dict changes in associations.19 Evaluating moderation within a meta-analysis, in which studies may vary more in their sample or methodological features than is often possible in a single study, provides a powerful evaluation of the scope of theories. However, you should still be aware of the samples and methodologies represented among the studies of your meta-analysis in order to accurately describe the scope that you can evaluate versus that which is still uncertain.

4. Model Building and Evaluation

Perhaps the most powerful approach to comparing competing theories is to evaluate multivariate models predicted by these theories. Models are portray­als of how multiple constructs relate to one another in often complex ways. Within the metaphor of the nomological net, associations can be said to be small pieces consisting of a piece of webbing between two knots, whereas models are larger pieces of the net (though usually still just a piece of the net) consisting of several knots and the webbing among them. Because virtu­ally all contemporary theorists have knowledge of a similar body of existing empirical research, different theories will often agree on the presence, direc­tion, and approximate magnitude of a single association.20 However, theories often disagree as to the relative importance or proximity of causation among the constructs.

These disagreements can often be explicated as competing models, which can then be empirically tested. After specifying these competing models, you then use the methods of multivariate meta-analysis to synthesize the avail­able evidence as sufficient data to fit these competing models (as described earlier in this chapter). Within these models, it is possible to compare relative strengths of association to evaluate which constructs are stronger predictors of others and to pit competing meditational models to evaluate which con­structs are more proximal predictors than others. Such model comparisons can empirically evaluate the predictions of competing theories, thus provid­ing relative support for one or another. However, you should also keep in mind that your goal might be less about supporting one theory over the other than about reconciling discrepancies. Toward this goal, meta-analytic mod­erator analyses can be used to evaluate under what conditions (of samples, methodology, or time) the models derived from each theory are supported. Such conclusions would serve the function of integrating the competing the­ories into a broader, more encompassing theory.

In the structural equation modeling literature, it is well known that a large number of equivalent models can fit the data equally well (e.g., Mac- Callum, Wegener, Uchino, & Fabrigar, 1993). In other words, you can evalu­ate the extent to which a particular model explains the meta-analytically derived associations, and even compare multiple models in this regard, but you cannot conclude that this is the only model that explains the associa­tions. Because multivariate meta-analytic synthesis provides a rich set of associations among multiple constructs—perhaps a set not available in any one of the primary studies—these data can be a valuable tool in evaluating alternate models. Although I discourage entirely exploratory data mining, it is useful to explore alternate models that are plausible even if not theoreti­cally derived (as long as you are transparent about the exploratory nature of this endeavor). Such efforts have the potential to yield unexpected models that might suggest new theories. In this regard, meta-analysis is not limited to only evaluating existing theories, but can serve as the beginning of an inductive theory to be evaluated in future research.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Dimensions of Literature Reviews, Revisited in writing Meta-Analytic Results

Before I turn to specific recommendations for writing the results of your meta-analysis, it is important for you to recognize that there is no single “right” way to write these results. As I described in Chapter 1 (see also Coo­per, 1988), literature reviews vary along several dimensions. Before you begin to write the results of your meta-analysis, you should have a clear under­standing of the goals, organization, and audience for this report.

1. Goals of Meta-Analysis

You began the meta-analysis with some goal in mind, and it is important that you keep this goal in mind as you write your report. As I described in Chap­ter 2 (see also Cooper, 1988), the goal of conducting a meta-analytic review (indeed, most literature reviews) is usually that of integration. However, this general goal of integration entails at least two subgoals (see Cooper, 1988).

One aspect of integration is generalization from specific instances. For example, the example meta-analysis I have described throughout this book (involving the association between relational aggression and peer rejection) relied on a number of studies, each specific in terms of age of the sample, method of measuring relational aggression, and a number of other features. By combining results (Chapters 8 and 10) across these specific instances (studies), it is possible to make statements that are more generalized, albeit within the bounds of the population defined by the studies represented in the meta-analysis. This generalization is not made uncritically, however. Through the comparison of studies (i.e., moderator analyses; Chapter 9) that differ on conceptually relevant characteristics, it is possible to empirically evaluate where findings can (absence of moderation) and cannot (presence of moderation) be generalized.

A second aspect of integration involves the resolution of conflicting find­ings or conclusions. Often, conflicting conclusions come from only seemingly conflicting findings from the Null Hypothesis Significance Testing (NHST) Framework, as I illustrated in Chapter 5. In these cases, meta-analysis, which focuses on effect sizes across studies rather than conclusions regarding sta­tistical significance, typically provides considerable clarity. In other cases, conflicting findings (and resulting conflicting conclusions) might not really be conflicting, but simply due to sampling fluctuations. Here, formal tests of heterogeneity of effect sizes (Chapter 8) will provide clearer conclusions about whether findings are truly conflicting. Finally, results might truly be conflicting (effect sizes are heterogeneous); here, meta-analytic results still have much to offer. One approach would be to accept this conflicting evi­dence (i.e., heterogeneous effect sizes), yet still offer the best generalizable answer through random-effects models (Chapter 8). Alternatively, you might use meta-analytic approaches to go beyond the existence of conflicting find­ings (i.e., reporting the random-effects mean) to evaluate the sources of con­flicting findings through moderator analyses (Chapter 9).1

Although the goal of your meta-analysis likely involves one or both of these aspects of integration, this does not have to be your only goal in writ­ing the results of your review. Other goals of literature reviews include (1) critiquing the body of research that you have reviewed and (2) identifying key directions for future conceptual, methodological, and empirical work (see Chapter 2 and Cooper, 1988). Although neither of these goals is directly met by the techniques of meta-analysis, they are certainly goals that you, the author (and the person who has just carefully studied the available litera­ture), can certainly address in your writing.

2. Organization of the Meta-Analysis

The results of simple meta-analyses (i.e., those reporting only mean effect sizes and a limited number of moderators analyses from a single sample of studies) have less flexibility as to how they can be organized. However, more complex meta-analytic reviews (i.e., those with many moderator analyses or those comprised of several discrete meta-analyses of different samples of empirical literature) can be organized in various ways. Cooper (1988) stated that literature reviews are commonly organized in three ways: his­torically (i.e., studies reviewing the progress of a field of study across time), conceptually (i.e., studies addressing a common idea or question are orga­nized together), or methodologically (i.e., studies with similar methodologi­cal or measurement approaches are organized together). Although each of these organizational approaches is an option, you are most likely to organize the results of your meta-analytic review either conceptually or methodologi­cally. To illustrate a conceptual organization, the manuscript containing the example meta-analytic review I have used throughout this book (Card et al., 2008) reported results of eight separate meta-analyses: one meta-analysis investigating gender differences in relational aggression, a second meta­analysis investigating the association of relational aggression with overt forms of aggression, and six smaller meta-analysis investigating associations of relational aggression with six distinct adjustment correlates. To illustrate a methodological organization, a meta-analysis might separately report results of concurrent naturalistic, longitudinal naturalistic, and experimental stud­ies of a particular effect.

3. Audience for the Meta-Analysis

Given that I have characterized the writing of your meta-analysis as “pre­senting the results to the world,” it makes sense that you would want to have in mind who is in that world—in other words, your intended audience. The potential audience for meta-analyses varies in terms of both their knowledge of the topic you have focused on and their familiarity with meta-analytic tech­niques. Scientists specializing in the area of your review are likely familiar with the terminology and theoretical perspectives, so they typically need less introduction and guidance in these areas (though you should not neglect this entirely). However, they may be unfamiliar with meta-analytic techniques, depending on the prevalence of meta-analyses in your particular field. Scien­tists outside of your specialized area will need more introduction to the topic area and may or may not be familiar with meta-analytic techniques. Practi­tioners, policymakers, and educated laypeople will almost universally need more didactic explanation of your topic and meta-analytic techniques.

Complicating matters even further, it is likely that your presentation will reach multiple audiences. If you decide that the only readers you care to inform are specialists in your field who are familiar with meta-analysis—and you write your report only for this audience—you should realize that you are probably targeting a very small audience, and the likelihood that your report will be published in a widely read outlet is small. Even if you decide to target a broader range of scientists within your field, you should recognize that others (e.g., educators, practitioners, policymakers) may read your report. Therefore, you are diminishing the potential impact of your review if it is not accessible to a broader audience of readers.

Conversely, you should be aware that some of the details that can be confusing and intimidating to readers unfamiliar with meta-analysis would be the very details that some readers (those very familiar with meta-analysis) will expect to see. The challenge, then, is to effect a balance between (1) providing enough technical details for content experts familiar with meta­analysis to evaluate your work, versus (2) not overwhelming other readers with too much technical detail. Although this can be a difficult line to walk, and it is likely that you cannot make 100% of readers 100% happy, I do think the following principles can help achieve this balance.

First, ask yourself what you find more discouraging when you read a report: (1) when you simply cannot understand what the authors have done, or (2) the authors provide what seems to be excessive detail of what they have done. My own reaction, and I suspect the reaction of many of you, is that it is better to be bored by too much detail than confused by too little. Following this principle, my suggestion is that it is better to report a potentially impor­tant piece of information than to omit it.

My recommendation that you err on the side of reporting too much rather than too little comes with a corollary: You do not have to report everything in the narrative text of your manuscript. Depending on the editorial style of your publication outlet, it may be preferable to place some details in tables, footnotes, appendices, or supplemental online documents. Doing so allows interested readers to evaluate these details, but does not distract attention for other readers. If space restrictions at your publication outlet preclude these options, then noting that full results are available upon request (and then providing them upon request) is an option.

My third recommendation is to write at multiple levels. What I mean by writing at multiple levels is that your text has pieces that make it under­standable to audiences with a broad range of background in your topic and in meta-analysis. How you accomplish this is to provide a clear, jargon-free statement that is understandable to a broad audience in tandem with more technical details. For example, technical details can be placed in parenthe­ses, as in the following: “Associations between relational aggression and peer rejection are stronger among studies using peer reports of relational aggression than those using observations (mean rs = .34 versus .09, X2(df=1) = 21.05, p < .001).” Similarly, you might ensure that each paragraph con­taining technical information consists of (1) a clear first sentence of what you evaluated, (2) one or more sentences reporting the detailed (techni­cal) results, and (3) a clear final sentence or two stating what you found in jargon-free terms. I do not intend these to be absolute rules; rather they are my own suggestions for accomplishing the difficult task of writing at multiple levels.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

What to Report and Where to Report It

In this section, I discuss the basic structural sections of a manuscript and special considerations in reporting meta-analytic results within these sec­tions. Two caveats are in order here. First, I expect that you are aware of the ways that manuscripts (whether primary studies or meta-analyses) are struc­tured within your field, in terms of what the goals of each section are, expec­tations about typical length, and writing conventions (e.g., as described in the American Psychological Association, 2009, Publication Manual). Second, I want to point out that in many ways, reports of meta-analyses are not differ­ent from reports of primary research. Your goal is still to provide an empiri­cally grounded exposition that adds meaningful knowledge to your field, and the manuscript reporting your meta-analysis should make this exposition in a similar way as you would when reporting results of a primary research study.

I next outline sections of a manuscript following a structure commonly found in social science research reports: the title, introduction, method, results, discussion, references, and appendices sections. Even if your field typically uses a different structure for reporting empirical findings, I believe that these suggestions will still be useful to consider and adapt to the report­ing practices in your field.

1. Title

As with any manuscript, the title of your meta-analysis should be an accurate and concise statement of your research goals, questions, or findings. Your title should therefore reflect the substantive focus of your review, which is reflected by the constructs comprising the effect sizes included in your meta­analysis. I think it is also preferable to indicate that your manuscript is a meta-analysis (or similar terms such as “meta-analytic review,” “quantitative review,” or “quantitative research synthesis”; see Chapter 1). Clearly denot­ing this is likely to draw the reader’s attention.

2. Introduction

The introduction section of a report of a meta-analysis tries to accomplish the same goals as the introduction of any empirical paper: to provide a back­ground in theory, methods, prior findings, or unresolved questions that ori­ents readers to the goals, research questions, or hypotheses of your meta­analytic review. In presenting this case for a meta-analytic review, it is impor­tant to provide support for all aspects of your study selection and analyses. In terms of study selection, your introduction should make a clear case for why the population of studies—in terms of sample, measurement, design, and source characteristics—that you defined in your meta-analysis are important for study. Similarly, your introduction should provide a rationale for all of the analyses you report in the results section. For instance, providing evidence for a range of research findings could be useful in building the case for the uncertainty of typical findings and the need to combine these results in a meta-analysis to obtain a clearer understanding of these typical findings. If there is considerable variability in findings, as noted by previous scholars in your field and later in the findings of significant heterogeneity in your meta-analysis, then this is often motivation to perform moderator analyses (though see Chapters 8 and 9 for cautions). Of course, when you planned your meta-analysis, you made decisions about what study characteristics to code and eventually consider as moderators; you should describe the conceptual rationale for these potential moderators in your manuscript to ground and support the decisions to evaluate these moderators. In short, every decision you made in terms of defining a population of studies and analyses should be previously supported with a rationale in the introduction section of your manuscript.

3. Method

The method section of your manuscript is where reporting practices become somewhat unique for meta-analyses versus primary research. Nevertheless, the same goals apply: to explain your research process in explicit enough detail that a reader fully understands what you have done to the point where he or she could, in principle, perfectly replicate your study (meta-analysis) based solely on what you have written. Next, I describe four general aspects of your methodology that you should report.

3.1. Literature Search Procedures

As I described in Chapter 3, the quality of a meta-analysis is substantially impacted by the extent to which the included studies adequately represent the population about which you wish to draw conclusions. The adequacy of this representation is in turn determined by the quality of your litera­ture search. For this reason, it is important to explicitly describe your lit­erature search procedures. For example, if you used electronic databases as one search strategy (and virtually every modern meta-analysis will), then it is important to detail the databases searched, the key words used (including wildcard characters), any logical operations (e.g., “and,” “or”), and the date of your last searches of these databases. You should provide similarly detailed descriptions of other search strategies (e.g., journals or conference programs searched and time span considered). Of course, it is preferable to provide brief rationales for these searches (e.g., “In order to identify unpublished studies . . . ”) rather than merely list your search strategies.

3.2. Study Inclusion and Exclusion Criteria

I mentioned in the previous subsection that the quality of a meta-analysis is impacted by whether the studies represent a population. This statement implies that the reader needs to have a clear idea of what the population is, which is defined by the inclusion and exclusion criteria you have speci­fied. Therefore, it is critical that you clearly state your inclusion criteria that define the population of interest, as well as exclusion criteria that delineate the outer boundaries of what your population does not include. In Chapter 3, I suggested that, before searching the literature, you specify a set of inclu­sion and exclusion criteria. I also indicated that these criteria may need to be modified as you search the literature and begin coding studies as unexpected situations arise. In the method section of your report, you should fully detail these inclusion and exclusion criteria, specifying which criteria you speci­fied a priori (before searching and coding) and which you specified post hoc (while searching and coding). I note here that these inclusion and exclu­sion criteria explicate the intended sampling frame of your meta-analysis (see Chapter 3); it will also be important to address how well the studies actually covered this sampling frame in the results section (see Section 13.2.4.a).

3.3. Coding of Study Characteristics and Effect Sizes

As you know by this point in your efforts, many decisions must be made while coding the studies that comprise your meta-analysis. It is important that you fully describe this coding process for readers. Three general aspects of the coding process that you should describe are the coding of study char­acteristics, the coding of effect sizes, and evidence of the reliability of your coding decisions.

As I described in Chapter 4, you could potentially code for a wide range of study characteristics in your meta-analysis. Whereas you have (or should have) provided a rationale for these study characteristics in the introduction section, here in the method section your task is to explicitly operationalize the characteristics you have coded. At a minimum, you should list the char­acteristics you coded, defining each term as necessary given the background of your audience and defining each of the possible values for each character­istic. For some characteristics (usually the “low-inference codes”; Chapter 4, Cooper, 2009a), this description can be very brief. For example, in describing “age” in the example meta-analysis I have described throughout this book, I might write “Age was coded as the mean age in years of the sample.” For other characteristics (especially “high-inference codes”; Chapter 4, Cooper, 2009a), the description may need to be considerably more extensive. For example, in describing the study characteristic “source of information” in this meta­analysis, it might (depending on the audience’s familiarity with these mea­surement practices) be necessary for me to write a sentence or two for each of the possible codes (e.g., “Self-reports were defined as any scale in which the child provided information about his or her own frequency of relational aggression, including paper-and-pencil questionnaires, responses to online surveys, and individual interviews”). Coding of even higher inference charac­teristics, such as “study quality” (see Chapter 4) might require multiple para­graphs. With many coded study characteristics, especially those requiring extensive descriptions, full description of all of these characteristics could take considerable space. Depending on the audience’s knowledge of your field and the space available in your publication outlet, it may be useful to present some of these details in a table or an appendix, or make them available upon request. However, the suggestion I offered earlier might be useful: When in doubt, err on the side of reporting too much rather than too little.

You should also describe your coding of effect sizes (Chapter 5) and any artifact corrections you perform (Chapter 6). In terms of describing your coding of effect sizes, you should be sure to answer three key questions. First, how do the signs of the effect size represent directions of results? For instance, in a meta-analysis of gender differences, it is important to specify whether positive effect sizes denote females or males scoring higher. Second, what effect size did you use and why? If you used a standard effect size (i.e., r, g or d, o), then it is usually sufficient to just state this (though you should keep the audience in mind). However, if you use an advanced or unique effect size (Chapter 7), you will usually need to further justify and describe this effect size. The third question you should be sure to answer is: How did you manage the various methods of reporting effects in the literature to obtain a common effect size? If you are writing to an audience that is somewhat familiar with meta-analysis, you can likely refer them to an external source (such as this book) for details of most computations. However, you should be especially clear about how you handled situations in which studies provided inadequate information. For example, did you assume the lower-bound effect size for studies reporting only that an effect was significant, and did you assume effect sizes of zero (or 1 for odds ratios) for studies reporting that an effect was nonsignificant? In these latter cases, it may be useful to report the percentage of effect sizes for which you made lower-bound estimates to give the reader a sense of the potential biasing effects.

Finally, you should provide evidence of the reliability of your coding, following the guidelines I offered in Chapter 4. Specifically, report how you determined reliability (intercoder and/or intracoder; number of studies dou­bly coded), and the results of these reliability evaluations. If reliabilities of coding decisions were very consistent across codes (i.e., various study char­acteristics and effect sizes), then it is acceptable to report a range; however, if there was variability, you should report reliabilities for each of your codes separately. If initial reliability estimates were poor and led to modification of your coding protocol, you should transparently report this fact. Finally, you should offer some evaluation of whether or not you believe the reliability of coding was adequate (if it was not, then it will be useful to address this limi­tation in the discussion section of your report).

3.4.  Data-Analytic Strategy

Because meta-analytic techniques are unfamiliar to many readers in many fields, and because there are differences in analytic practices among differ­ent meta-analysts, it is important that you clearly state your data-analytic strategies. If extensive description is needed, I prefer to describe these strat­egies as a distinct subsection of the manuscript, usually at the end of the method section, but sometimes at the beginning of the results section (you should read some articles in your field that use meta-analytic techniques, or other advanced techniques that require description, to see where this mate­rial is typically placed). Alternatively, if you can adequately describe your techniques concisely, and many readers in your field are at least somewhat familiar with meta-analysis, then you might decide to omit this section and instead provide these details throughout the results section before you pre­sent the results of each analysis.

There are at least five key elements of your data-analytic strategy that you should specify. First, you should describe how you managed multiple effect sizes from studies (see Chapter 8). Second, you should specify which weights you used for studies in your meta-analysis (e.g., inverse squared stan­dard errors; Chapter 8). If your audience is entirely unfamiliar with meta­analysis, you might also provide justification for these weights (see Chapter 8). Third, you should describe the process of analyzing the central tendencies of effect sizes. For instance, did you base your decision to use a fixed- ver­sus random-effects model on the results of an initial heterogeneity test, or did you make an a priori decision to use one or the other (see Chapter 10)? Fourth, you should describe your process and method of moderator analyses. Specifically, you should describe (1) whether your decision to pursue mod­erator analyses was guided by initial findings of heterogeneity; (2) the order in which you evaluated multiple moderators (e.g., one at a time, all at once, or some conceptually-based sequence); (3) if you followed a sequence of mod­erator analyses, whether you used residual heterogeneity tests along the way to decide to continue or to stop; and (4) what approach to moderator analysis you used (e.g., ANOVA- or regression-based?). Finally, you should make clear how you evaluated potential publication bias (see Chapter 11).

4. Results

As you might expect, the results section of the report contains some informa­tion unique from that in the results section of a primary study. At the same time, the underlying goal is the same in both: to accurately and clearly report the findings of your analyses to provide illumination of the research ques- tions/hypotheses that motivated the study/meta-analysis. In this section, I describe four pieces of information that will generally be present in your results. I do not necessarily intend to suggest how you should organize your results section; for a single, relatively simple meta-analysis, this organization might be useful, but for a more complex meta-analysis or a review with sev­eral meta-analyses, you will likely follow a more conceptual or methodologi­cal organization as I described earlier.

4.1. Descriptive Information

An important set of results, yet one that is often overlooked, is simply the description of the sample of studies that comprised your meta-analytic review. This information can often be summarized in a table, but the impor­tance of this information merits at least a paragraph, if not an entire sub­section, near the beginning of your results section. If your report includes multiple meta-analyses, it might be useful to report this descriptive infor­mation for both the overall collection of studies (i.e., all studies included in any of your meta-analyses) and the subsets of studies that comprised each meta-analysis.

Necessary descriptive information to report includes the number of studies (usually denoted by k), as well as the total number of participants in these studies (N, which is the sum of the Ns across the studies). I also strongly advise that you report the number of studies at different levels of coded study characteristics used in moderator analyses. For categorical char­acteristics, this is simply the number of studies with each value, whereas for continuous characteristics, you might report the means, standard deviations, and ranges. If your initial coding protocol included study characteristics that you ultimately did not use as moderators because of a lack of variability in values across studies, I suggest also reporting this information.

In addition to reporting this descriptive information, it is worth writ­ing some comments about these data, as they describe both the sample for your meta-analysis and the state of the empirical literature in your field. For instance, it is useful to note if some values of your moderators are under­represented in the existing literature (e.g., few studies have sampled certain types of individuals, few studies have used a particular methodology), or if certain combinations of moderators (e.g., particular methodologies with certain types of individuals) are underrepresented. It is also useful to com­ment on study characteristics that did not vary, and potentially to discuss the implications of this homogeneity in the discussion. In short, it is useful to describe the nature of the sample of studies (and by implication, the field of your meta-analysis), and to point out the sampling, measurement, and meth­odological strengths and shortcomings of this body of research.

4.2. Central Tendencies and Heterogeneity

Turning to the analytic results, most reports describe the results of central tendency and heterogeneity tests before the results of moderator analyses. Regarding central tendency, or (usually) mean effect sizes, you should clearly state whether the mean was obtained through fixed- or random-effects mod­els, the standard error of this mean effect size, and the (typically 95%) confi­dence interval of this mean. Although the confidence interval generally suf­fices for significance testing, you might also choose to report the statistical significance of this effect size. In reporting these results, be sure to provide “words” that help readers make sense of the “numbers.” Put differently, avoid simply listing means, confidence intervals, and the like, but rather provide narrative descriptions of them. For instance, it might be useful to some read­ers to have the direction of association described (e.g., to interpret a positive mean correlation: “Higher levels of relational aggression are associated with higher peer rejection”), and it is usually useful to characterize the magnitude of effect sizes according to standards in your field or else commonly applied guidelines (e.g., Cohen, 1969, characterization of rs ~ ±.10, .30, and .50 as small, medium, and large, respectively).

In addition to the mean effect size, it is important to describe the het­erogeneity of effect size to give readers a sense of the consistency versus vari­ability as well as range of findings. Although you will almost certainly report the results of the heterogeneity test, the Q statistic described in Chapter 8 (Section 8.4), you should bear in mind the limits of this statistic given that it is a statistical significance test (i.e., it can have very high or low statistical power). For this reason, it may be useful to supplement reporting of the Q statistic with a description of the magnitude of heterogeneity. One possibility might be to describe quantitatively the magnitude of this heterogeneity by reporting the I2 index. Another possibility might be to visually display the heterogeneity using one of the figures I describe in Section 13.3. With either approach, it is important to describe (again, using words) this homogeneity or heterogeneity, and how this information was used in decisions regard­ing other analyses (e.g., to use random-effects models, to perform moderator analyses).

4.2. Moderator Analyses

If moderator analyses are conducted in your meta-analysis (and most meta­analyses will involve some moderator analyses), then it is important to fully report these results. Specifically, you should report the Q statistic, degrees of freedom, and significance level for each moderator analysis you perform (whether performed within an ANOVA or a regression framework; see Chap­ter 9). It is also common to report the within-group or residual heterogene­ity (Q) remaining after accounting for this moderator or set of moderators. For categorical moderators with more than two levels, it is also necessary to report results of follow-up comparisons (see Chapter 9).

You should not stop at reporting only the significance tests of your mod­erator analyses; it is also important to report the numbers of studies and the typical effect sizes at various levels of the moderators. For a single categorical moderator this is straightforward: You simply report the numbers of studies and mean effect sizes within each of the levels of the moderator. For multiple categorical moderators, you should report the numbers of studies and mean effect sizes within the various combinations across the multiple moderator variables. For continuous moderators, it is not advisable to artificially catego­rize the continuous moderator variable and then report information (num­bers of studies and mean effect sizes) within these artificial groups, though this practice is sometimes followed. Instead, I suggest using the intercept and regression coefficient(s) of your regression-based moderator analysis to com­pute predicted effect sizes at different levels of the moderator, and then report these predicted effect sizes across a range of the moderator variable values well-covered by the studies in your meta-analysis. In Chapter 9 (Section 9.2), I presented an example in which effect sizes of the association between rela­tional aggression and peer rejection were predicted by (i.e., moderated by) the mean ages of the samples, and I computed the expected effect sizes for the ages 5, 10, and 15 years (intuitive values that represented the span of most studies in the meta-analysis).

Before concluding my suggestions for reporting moderator analysis results, I want to remind you of a key threat to moderator analysis in meta­analytic reviews: that the variable you have identified as the moderator is not the “true” moderator in that it is only associated with or serves as a proxy for the true moderator. If alternate potential moderators are study characteristics that you have coded, then it is important to report results either (1) ruling out these alternative explanations, or (2) showing that the variable you believe is the true moderator is predictive of effect sizes after controlling for the alter­native moderator variables (see Section 9.4). You should report these findings in the results section. However, it is also worth considering that you can never definitively determine whether the moderator variable you have identi­fied is the true moderator, or whether it simply serves as a proxy for another, uncoded study characteristic that is the true moderator. This is a limitation that should be considered in the discussion section of your report.

4.3. Diagnostic Analyses

Earlier (Chapters 2, 11) I described the widely known threat to meta-analyses (and all other literature reviews) posed by publication bias. Given that this threat is both widely known and potentially severely biasing to results of a meta-analysis, it is important to report evidence evaluating this threat. Specifi­cally, you should report your efforts (1) to evaluate the presence of this threat, such as moderator analyses, funnel plots, or regression analyses; (2) show how plausible it is that there could exist enough missed literature with zero results so as to invalidate your conclusions (i.e., various failsafe numbers); and (3) and detail the approaches you used to correct for this potential bias (e.g., trim and fill, weighted selection) (see Chapter 11). After providing all available evidence regarding potential publication bias, you should offer the reader a clear state­ment of how likely publication bias may have impacted your findings.

5. Discussion

The discussion section of your report should place the findings of your meta­analytic review in the context of your field. Whereas it is tempting to let the numbers speak for themselves, do not assume that they speak to the reader. Although the discussion section likely allows the most liberty in terms of writ­ing (you can think of it as your opportunity to add the “qualitative finesse” that some critics have charged is absent from meta-analyses; see Chapter 2), you should consider including at least four components of this section. I discuss each of these components next in an order in which they commonly (though not necessarily) appear in discussion sections of meta-analytic reports.

5.1. Review of Findings

Although you should be careful to avoid extensive repetition of results in the discussion section, it is sometimes useful to provide a brief overview of key findings, especially if the results section was long, technical, or complex. It is useful to highlight the findings that you will most extensively discuss in this section, though you should certainly not omit findings that were unex­pected or contradictory to your hypotheses (these are typically important to consider further).

5.2. Explanations and Implications of Findings

You should remember that the main purpose of your meta-analytic review was to answer some research questions, which presumably are important to your field in some way. The majority of your efforts in the discussion sec­tion should be directed to describing how your results provide these answers (when they do) and how these answers increase understanding within your field. For instance, do the findings of your review provide answers that sup­port existing theory, support one theory over another, or suggest the need for refinement of existing theories in your field? Do the answers inform policy or practice in your field?

While providing answers to these questions is useful, you should also recognize the limits to the information provided by the existing research that comprised your review. This recognition can guide where more primary empirical research is needed, and it is important for your review to identify this need. For example, if you could not reach reasonably definitive conclu­sions to some of your research questions due to low statistical power (result­ing from few studies or studies with small sample sizes), then you should state the need for further research to inform this question. Your descriptive summary of study characteristics also speaks to the types of studies that have not been performed (e.g., specific sample characteristics, measurement characteristics, etc., and combinations of these characteristics). Conversely, if you find that a large number of studies (or a number of studies with large samples) using very similar samples, measures, and the like, have been per­formed, and that the results are homogeneous and provide a very precise estimate of this effect size, then it is also valuable to state that more studies of this type are not needed (better that future research invest efforts toward providing new information). In short, I encourage you to remember that you have just spent months carefully studying and meta-analyzing nearly all of the work in the area of your meta-analysis, so you are in a very informed posi­tion to say where the field needs to go; it is a valuable contribution for you to make clear statements that guide these future efforts.

5.3. Limitations

As when you are reporting the results of any empirical study, it is impor­tant for you to acknowledge the limitations of your meta-analytic review.

Some of these limitations may be the shortcomings of the available empirical basis, and I have already encouraged you to make clear statements of what these limitations are. Other limitations are particular to literature reviews (including meta-analyses), such as the limitations of drawing conclusions about moderator variables and potential publication bias. You should also make clear limitations to what can be inferred from the types of studies and effect sizes you have included in your meta-analysis. For instance, you should describe the limitations to inferring causality from effect sizes from concur­rent naturalistic studies (see Chapter 2). For every limitation you identify, I encourage you to provide a rationale for why this limitation is more or less threatening to your conclusions, and how future research might resolve these issues (this piece of advice is relevant for any research report, not just those using meta-analyses).

6. Conclusions

Given the often high impact and broad readership of reports of meta-analyses, it is critical that your text conclude with a clear statement of how your meta­analytic review advances understanding, and why this advancement is impor­tant.

7. References

As with any other scholarly report, your meta-analytic review will include a list of references. Although typical practices vary across disciplines, I note two practices that are common in the field of Psychology (as described in the American Psychological Association, 2009, Publication Manual) and in many other areas social science. First, all of the studies included in your meta­analysis should be included in your reference list. Second, the first line of your reference section (after the “Reference” heading but before the first reference) should contain a statement such as “Studies preceded by an asterisk were included in the meta-analysis”; and then you should place an asterisk before the reference of the studies that were included in your meta-analytic review.

8. Appendices

Different journals have different standards and preferences for material being included in the main body of the text, in appendices printed at the end of the article, or (more recently) in appendices available through the journal’s web­site. Depending on the practices of your targeted journal, however, it might be useful to consider using appendices for some of the lengthier information that is important to report yet not of interest to many readers. For instance, tables summarizing the coding of all studies included in your meta-analysis (see Section 13.3.2) are important because they allow readers to judge the completeness of your review and your coding practices; however, such tables are lengthy and often of peripheral interest to many readers. These tables might ideally be placed in an appendix rather than in the text proper.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Using Tables and Figures in Reporting Meta-Analyses

Tables and figures, if used effectively, can provide a large amount of data in an informative way, as well as reduce the burden of describing all of this information within the text (though you should not omit key findings from the text just because they are also displayed in tables or figures). In this section, I describe some approaches to presenting meta-analytic results in tables and in figures. I supplement description of each approach I describe by considering the relative frequencies of their use in a recent survey of meta­analyses published from 2000 to 2005 by Borman and Grigg (2009).2

1. Tables

There are two general types of tables used to summarize results of meta­analytic reviews: tables presenting summary information such as mean effect sizes, and tables summarizing coded aspects of the individual studies included in the meta-analysis. The use of both tables is common; in the sur­vey by Borman and Grigg (2009), these tables were included in 74% and 89%, respectively, of published meta-analyses.

1.1. Summary Tables

Summary tables can be used to report aggregate information obtained from meta-analytic combination and comparison of multiple studies. This infor­mation can include information about the central tendency of effect sizes (e.g., mean, median), the distribution of these effect sizes (range, heterogene­ity tests, indices of heterogeneity such as I2), and results of moderator analy­ses. If your review contains a single meta-analysis (i.e., all studies included in a single meta-analysis), this table will likely be rather narrow, so you should consider if such a table is worth the space beyond summarizing such infor­mation in the text. However, if your review includes several meta-analyses (i.e., a series of meta-analyses of separate effect sizes), this table will be wider and contain a wealth of information more concisely summarized than can be done in text. Summary tables are especially useful in these latter situations.

Table 13.1 illustrates one of many ways (and not the only way) you might organize a summary table. Here, I summarize results for the ongoing exam­ple meta-analysis used throughout this book involving associations between relational aggression and peer rejection. The first two rows display results of the heterogeneity test and its significance (denoted by asterisks) and the I2 index to quantify the magnitude of heterogeneity. The next two rows display the random-effects mean effect size (with significance level) and confidence intervals around these means. The remaining rows display the results of two moderator analyses: the categorical moderator “reporter” and the continu­ous moderator age. After reporting the omnibus test (Qbetween) of the cat­egorical moderator, I report the mean associations within each group (type of reporter),3 denoting significant differences between groups with alphabetic subscripts. For the continuous moderator variable (age), I report its signifi-cance (Qregression), followed by the unstandardized regression coefficient and predicted correlations at various meaningful values of the moderator. This table could be expanded in various ways, such as by including additional rows to report the results of more moderator analyses (including compari­sons of published vs. unpublished studies to evaluate publication bias), or by adding additional columns to report the results of other meta-analyses (e.g., Card et al., 2008, also reported results involving associations of overt forms of aggression with peer rejection, as well as associations of relational aggres­sion with various other aspects of adjustment).

1.2. Tables of Individual Studies

It is very useful—and arguably even necessary—to provide a detailed listing of the values you coded for each of the studies included in your meta-analytic review. This sort of table should report, for each of the studies included in your meta-analytic review, basic citation information for the study (e.g., authors, year), sample size, your coding for all of the study characteristics used in your review (for either descriptive purposes or in moderator analy­ses), and effect sizes. If you performed any artifact adjustments (see Chap­ter 6), the artifact information (e.g., reliability estimates, dichotomizations) should also be reported.

The most common order of studies within this type of table is to list studies either alphabetically by author names or else chronologically by year of publication. Although such ordering is useful for readers to find a par­ticular study or to see if any studies were excluded, it is not necessarily the most informative approach (Borman & Grigg, 2009). A preferable way to organize these tables is likely according to some important characteristics of studies, such as by moderator variables found to be important in your meta­analyses.

To illustrate this sort of table, Table 13.2 presents coded details of the studies used in the meta-analysis on relational aggression and peer rejec­tion. Here, I have organized studies first by reporter (one of the moderator variables) and then by age (another moderator variable). You can see that this table contains a row for each study,4 columns for each study characteristic coded, and the coded effect size.

2. Figures

The statement “a picture is worth a thousand words” is a cliche but neverthe­less, it is true: Thoughtful use of figures to present meta-analytic results is an efficient way to present a large amount of information, including informa­tion about central tendency and variability in effect sizes, moderator effects, publication bias, and potential outlier studies. I next describe three types of figures that you can consider in presenting results of your meta-analysis, considering the type of information that is conveyed in each type of figure.

2.1. Forest Plots

These plots are rarely used in social sciences, though they are common in research syntheses of medical trials (Borman & Grigg, 2009). These plots, such as those illustrated in Figure 13.1, are formed by listing the studies included in the meta-analysis down the left side of the figure. The area to the right of each study displays information about the mean (filled circles) and 95% confidence intervals (horizontal lines) for each study in the meta­analysis. The thick vertical line represents the (weighted) mean of these effect sizes. Although it is not done in every instance, I have also included a vertical (dashed) line to indicate the null result of r = .00 to illustrate which studies yield significant effect sizes.

Forrest plots portray a range of information. First, they present informa­tion regarding both the point estimate and uncertainty of effect sizes from every study in your meta-analysis, serving a useful summary function simi­lar to tables of individual studies. Second, the inclusion of the vertical line for the mean effect size makes this information apparent. Third, this plot provides visual information regarding the heterogeneity of studies. Observ­ing that several (more than the approximately 1 in 20 expectable by chance) of the study confidence intervals do not contain the common mean effect size (vertical line) serves as visual evidence of significant heterogeneity, and the range of these study-specific effect sizes around this vertical provides some indication of the variability in these effect sizes. Although not apparent in Figure 13.1, this forest plot would also be useful for detecting studies with extreme effect sizes (far to the left or right of other studies with confidence intervals not approaching the rest of the studies).

The basic forest plot such as the one I have shown in Figure 13.1 can be extended in several ways (see Borman & Grigg, 2009). For instance, the studies could be ordered in some meaningful way rather than alphabetically, such as by a key study characteristic (i.e., moderator). If order is by a cat­egorical moderator, then you might consider adding multiple vertical lines to denote different mean values within moderator groups. The sizes of the circles for study effect sizes could be larger or smaller to denote, for instance, their relative weighting. It would also be possible to change the shapes or other characteristics (e.g., color, if presenting in color) of these study-specific effect sizes to indicate other features, such as values on a second moderator of interest. Finally, you might consider merging the information of a table of individual studies (e.g., sample sizes, coded scores on various moderators) and the forest plot by creating a hybrid table and figure. This would display a tremendous amount of information, though it might be rather large if you have a large (in terms of numbers of studies and coded study characteristics) meta-analysis.

2.2. Stem-and-Leaf Plots

These plots are commonly used and convey considerable information, includ­ing information about central tendency, variability, and distributional form (e.g., skewness, modality) of a set of effect size, as well as pointing to poten­tial outlier studies with extreme effect sizes. Stem-and-leaf plots consist of two parts. The “stem” is the vertical array of “bins” of possible effect sizes (e.g., correlations between .70 and .79, between .60 and .69, etc.), and each “leaf” is a single-digit number representing the effect size from a single study. These effect sizes can be either in the original metric (e.g., r, o) or in a trans­formed metric (e.g., Zr, ln(o)); the original metric is more intuitive for readers, but the transformed metric is more useful for assessing potential skew in the distribution of effect sizes.5 Figure 13.2 presents a stem-and-leaf plot for the 22 studies in the example meta-analysis. The numbers to the left of the verti­cal line comprise the stem, scaled in intervals of .1 with a range one value higher and one value lower than the most extreme effect sizes found among these 22 studies. To the right of the vertical line are the leaves, with each digit representing a single study. For example, the highest leaf is the value 2 connected to the stem at .6 to represent a study that found a correlation of .62 between relational aggression and peer rejection. The five digits (leaves) connected to the .5 stem denote five studies finding associations between .50 and .59 (specifically, .53, .55, .56, .57, and another .57; note that the leaves are arranged from lowest to highest values moving away from the stem).

Visual inspection of this figure provides a variety of information. First, the visual spacing of the leaves provides information about the number of studies finding effect sizes of approximate values (e.g., you can see that more studies find correlations in the .50 to .59 range than the .60 to .69 range; note that it is preferable to use a font that is uniform in width for all values so that the size of the rightward-extending bar represents the number of stud­ies on that stem). Second, this sort of plot gives an approximate, though not precise, idea about central tendency. Recalling that the weighted mean r = .37 among these studies, you can see that this is near an approximate “balancing point” of the distribution of these effect sizes (though be aware that visual inspection of the funnel plot does not take into account differential weight­ing of studies). Third, stem-and-leaf plots visually display the heterogeneity of effect sizes across studies. In this example, there is considerable disper­sion among the effect sizes, which is consistent with quantitative findings of significant heterogeneity and a large I2. Fourth, stem-and-leaf plots provide visual information about the distribution of effect sizes. In this example, it appears that the effect sizes are somewhat skewed to have a longer tail toward the lower/negative values. Finally, it can be useful to study stem-and-leaf plots of studies with outlying effect sizes; in this example, no study dramati­cally departs from the others.

Stem-and-leaf plots are commonly used in reports of meta-analytic find­ings (about 30% of reports surveyed by Borman & Grigg, 2009). You can also extend the basic stem-and-leaf plot to provide more sophisticated informa­tion. For example, it is possible to provide multiple sets of leaves to represent studies with different study characteristics (i.e., different values of categori­cal moderators). By orienting these multiple sets of leaves side by side, scaled along a common vertical axis (i.e., stem), readers can gain an appreciation for the differences in central tendency, variability, distributional form, and pos­sible outlier studies within each group of studies.

2.3. Funnel Plots

As seen in Chapter 11, funnel plots are a way of graphically evaluating poten­tial publication bias (or other biases leading to censoring of nonsignificant results). As you recall, these plots are scatterplots of the studies in your meta­analysis, with one axis representing some function of sample sizes (or stan­dard errors) and the other representing effect sizes. Because I described these plots in Chapter 11, I will not discuss them here.

The main purpose of funnel plots is to identify potential publication bias. However, these plots also display information about the mean effect size (which can be shown as a line through the scatterplot) and about heterogene­ity (i.e., the width of the funnel). According to the survey of published meta­analyses by Borman and Grigg (2009), these plots are modestly frequently used (12.5% of meta-analyses considered). My own impression is that the value of funnel plots is primarily in terms of detecting publication bias. For other information (e.g., mean effect sizes, heterogeneity), other figures are more effective or as effective while using less space.

2.4. Other Figures

My consideration of forest plots, stem-and-leaf plots, and funnel plots only touches on the many options available. These other options include schematic plots (a.k.a. box-and-whisker plots, which provide clear information about means, heterogeneity, and outliers); normal quantile plots (which are useful in evaluating publication bias); and radial plots (which are fairly technical plots of studies’ mean effect sizes and precisions) described by Borman and Grigg (2009). This variety of potential graphical displays is valuable in pro­viding a wide range of tools for presenting the results of your meta-analysis. When choosing a method of displaying your results, however, you should always keep in mind what information is most important to convey.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: Avoiding Common Problems in Reporting Results of Meta-Analyses

In this section, I identify 10 problems that I perceive to be common in report­ing results of meta-analytic reviews. More importantly, I offer concrete sug­gestions for how you can avoid each. Although following these suggestions will not guarantee that your meta-analytic report will be successful (whether defined by publication in a top-outlet, high-impact, or any other criterion), doing so will help you avoid some of the most common obstacles.

  1. Disconnecting conceptual rationale and data analyses. One of the more common problems with written reports of meta-analyses (and probably most empirical papers) is a disconnect between the conceptual rationale for the review in the introduction and the analyses and results actually presented. Every analysis performed should be performed for a reason, and this reason should be described in the Introduction of your paper. Even if some analyses were entirely exploratory, it is better to state as much rather than have read­ers guess why you performed a particular analysis. A good way to avoid this problem is simply to compile a list of analyses presented in your results sec­tion, and then identify the section in your introduction in which you justify this analysis.
  2. Providing insufficient details of methodology. I have tried to emphasize the importance of describing your meta-analytic method in sufficient detail so that a reader could—at least in principle—replicate your review. This level of detail requires extensive description of your search strategies, inclusion and exclusion criteria, practices of coding both study characteristics and effect sizes, and the data-analytic strategy you performed. Because it is easier to know what you did than to describe it, 1 strongly recommend that you ask a colleague familiar with meta-analytic techniques to review a draft of your description to determine if he or she could replicate your methodology based only on what you wrote.
  3. Writing a phone book. Phone books contain a lot of information, but you probably do not consider them terribly exciting to read. When presenting results of your meta-analysis, you have a tremendous amount of information to potentially present: results of many individual studies, a potentially vast array of summary statistics about central tendency and heterogeneity of effect sizes, likely a wide range of nuanced results of moderator analyses, analyses addressing publication bias, and so on. Although it is valuable to report most or all of these results (that is one of the main purposes of sharing your work with others), this reporting should not be an uninformative listing of num­bers that fails to tell a coherent story. Instead, it is critical that the numbers are embedded within an understandable story. To test whether your report achieves this, try the following exercise: (1) Take what you believe is a near­complete draft of your results section, and delete every clause that contains a statistic from your meta-analysis or any variant of “statistical significance”; (2) read this text and see if what remains provides an understandable nar­rative that accurately (if not precisely) describes your results. 1f it does not, then this should highlight to you places where you should better guide read­ers through your findings.
  4. Allowing technical complexity to detract from message. Robert Rosenthal once wrote, “I have never seen a meta-analysis that was ‘too simple’ ” (Rosen­thal, 1995, p. 183). Given that Rosenthal was one of the originators of meta­analytic techniques (see Chapter 1) and has probably read far more meta­analytic reviews than you or I ever will, his insight is important. Although complex meta-analytic techniques can be useful to answer some complex research questions, you should keep in mind that many important questions can be answered using relatively simple techniques. I encourage you to use techniques that are as complex as needed to adequately answer your research questions, but no more complex than needed. With greater complexity of your techniques comes greater chances (1) of making mistakes that you may fail to detect, and (2) confusing your readers. Even if you feel confident in your ability to avoid mistakes, the costs of confusing readers is high in that they are less likely to understand and—in some cases—to trust your conclu­sions. The acronym KISS (Keep It Simple, Stupid) is worth bearing in mind. To test whether you have achieved adequate simplicity, I suggest that you (1) have a colleague (or multiple colleagues)—one who is unfamiliar with meta-analysis but is otherwise a regular reader of your targeted publication outlet—read your report; then (2) ask this colleague or colleagues to describe your findings to you. If there are any aspects that your colleague is unable to understand or that lead to inaccurate conclusions, then you should edit those sections to be understandable to readers not familiar with meta-analysis.
  5. Forgetting why you performed the meta-analysis. Although I doubt that many meta-analysts really forget why they performed a meta-analysis, the written reports often seem to indicate that they have. This is most evident in the discussion section, where too many writers neglect to make clear state­ments about how the results of their meta-analysis answer the research ques­tions posed and advance understanding in their field. Extending my earlier recommendation (problem 1 above) for ensuring connections between the rationale and the analyses performed, you should be sure that items on your list of analyses and conceptual rationales are addressed in the discussion section of your report. Specifically, be sure that you have clearly stated (1) the answers to your research questions, or why your findings did not provide answers, and (2) why these answers are important to understanding the phe­nomenon or guiding application (e.g., intervention, policy).
  6. Failing to consider the limits of your sample of studies. Every meta­analysis, no matter how ambitious the literature search or how liberal the inclusion criteria, necessarily involves a finite—and therefore potentially limited—sample of studies. It is important for you to state—or at least speculate—where these limits lie and how they qualify your conclusions. You should typically report at least some results evaluating publication bias (see Chapter 11), and comment on these in the discussion section. Evidence of publication bias does not constitute a fatal flaw of your meta-analysis if your literature search and retrieval strategies were as extensive as can be reasonably expected, but you should certainly be clear about the threat of publication bias. Similarly, you should clearly articulate the boundaries of your sample as determined by either inclusion/exclusion criteria (Chapter 3) or characteristics of the empirical literature performed (elucidated by your reporting of descriptive information about your sample of studies). Descrip­tion of the boundaries of your sample should be followed with speculation regarding the limits of generalizability of your findings.
  7. Failing to provide (and consider) descriptive features of studies. Problem 4 (allowing technical complexity to detract from your message) and problem 6 (failing to consider the limits of your sample) too often converge in the form of this problem: failing to provide basic descriptive information about the studies that comprise your meta-analysis. As mentioned, reporting this information is important for describing the sample from which you draw conclusions, as well as describing the state of the field and making recom­mendations for further avenues of research. The best way to ensure that you provide this information is to include a section (or at least a paragraph or two) at the beginning of your results section that provides this information.
  8. Using fixed-effects models in the presence of heterogeneity. This is a rather specific problem but one that merits special attention. As you recall from Chapter 10, fixed-effects models assume a single population effect size (any variability among effect sizes across studies is due to sampling error), whereas random-effects models allow for a distribution of population effect sizes. If you use a fixed-effects model to calculate a mean effect size across studies in the presence of substantial heterogeneity, then the failure to model this heterogeneity provides standard errors (and resulting confidence inter­vals) that are smaller than is appropriate. To avoid this problem, you should always evaluate heterogeneity via the heterogeneity significance test (Q; see Chapter 8) as well as some index that is not impacted by the size of your sam­ple (such as I2; see Chapter 8). If there is evidence of statistically significant or substantial heterogeneity, then you are much more justified in using a ran­dom- rather than a fixed-effects model (see Chapter 10 for considerations). A related problem to avoid is making inappropriately generalized conclusions from fixed-effects models; you should be careful to frame your conclusions according to the model you used to estimate mean effect sizes in your meta­analysis (see Chapter 10).
  9. Failing to consider the limits of meta-analytic moderator analyses. I have mentioned that the results of moderator analyses are often the most important findings of a meta-analytic review. However, you should keep in mind that findings of moderation in meta-analyses are necessarily correlational—that certain study characteristics covary with larger or smaller effect sizes. This awareness should remind us that findings of moderation in meta-analyses (or any nonexperimental study) cannot definitively conclude that the presumed moderator is not just a proxy for another moderator (i.e., another study char­acteristic). You should certainly acknowledge this limitation in describing moderator results from your meta-analysis, and you should consider alterna­tive explanations. Of course, the extent to which you can empirically rule out other moderators (through multiple regression moderator analyses control­ling for them; see Chapter 10) diminishes the range of competing explana­tions, and you should note this as well. To ensure that you avoid the problem of overinterpreting moderator results, 1 encourage you to jot down (separate from your manuscript) at least three alternative explanations for each mod­erator result, and write about those that seem most plausible.
  10. Believing there is a “right way” to perform and report a meta-analysis. Although this chapter (and other works; e.g., Clarke, 2009; Rosenthal, 1995) provides concrete recommendations for reporting your meta-analysis, you should remember that these are recommendations rather than absolute pre­scriptions. There are contexts when it is necessary to follow predetermined formats for reporting the results of a meta-analysis (e.g., when writing a com­missioned review as part of the Campbell [www.campbellcollaboration.org] or Cochrane [www.cochrane.org] Collaborations), but these are typically excep­tions to the typical latitude available in presenting the results of your review. This does not mean that you deceptively present your work, but rather that you should consider the myriad possibilities for presenting your results, keeping in mind the goals of your review, how you think the findings are best organized, the audience for your review, and the space limitations of your report. 1 believe that the suggestions 1 have made in this chapter— and throughout the book—are useful if you are just beginning to use meta­analytic techniques. But as you gain experience and consider how to best present your findings, you are likely to find instances where 1 have written “should” that are better replaced with “should usually, but . . . ”. I encourage this use of my (and others’) recommendations as jumping points for your efforts in presenting your findings.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.