Meta-analysis in management research

What is a meta-analysis? and Why perform a meta-analysis?

“Meta-analysis is the statistical combination of results from two or more separate studies” (Deeks et al, 2019, chapter 10). When the treatment effect (or effect size) is consistent from one study to the next, meta-analysis can be used to identify this common effect. When the effect varies from one study to the next, meta-analysis may be used to identify the reason for the variation.

Decisions about the utility of an intervention or the validity of a hypothesis cannot be based on the results of a single study, because results typically vary from one study to the next. Rather, a mechanism is needed to synthesize data across studies. Narrative reviews had been used for this purpose, but the narrative review is largely subjective (different experts can come to different conclusions) and becomes impossibly difficult when there are more than a few studies involved. Meta-analysis, by contrast, applies objective formulas (much as one would apply statistics to data within a single study), and can be used with any number of studies.

Pharmaceutical companies use meta-analysis to gain approval for new drugs, with regulatory agencies sometimes requiring a meta-analysis as part of the approval process. Clinicians and applied researchers in medicine, education, psychology, criminal justice, and a host of other fields use meta-analysis to determine which interventions work, and which ones work best. Meta analysis is also widely used in basic research to evaluate the evidence in areas as diverse as sociology, social psychology, sex differences, finance and economics, political science, marketing, ecology and genetics, among others.

[title text=”Main contents” link_text=”See more from basic to advanced” link=”/category/methodology/quantitative-research/quantitative-research-methods/meta-analysis/”]

[blog_posts style=”normal” col_spacing=”xsmall” columns=”3″ columns__md=”1″ depth_hover=”2″ auto_slide=”5000″ cat=”242″ posts=”3″ offset=”53″ show_date=”false” excerpt_length=”25″ comments=”false” image_height=”60%” image_size=”original” image_hover=”zoom” text_align=”left”]

[blog_posts style=”normal” col_spacing=”xsmall” columns=”3″ columns__md=”1″ depth_hover=”2″ auto_slide=”6000″ cat=”242″ posts=”3″ offset=”50″ show_date=”false” excerpt_length=”25″ comments=”false” image_height=”60%” image_size=”original” image_hover=”zoom” text_align=”left”]

[blog_posts style=”normal” col_spacing=”xsmall” columns=”3″ columns__md=”1″ depth_hover=”2″ auto_slide=”7000″ cat=”242″ posts=”6″ offset=”44″ show_date=”false” excerpt_length=”25″ comments=”false” image_height=”60%” image_size=”original” image_hover=”zoom” text_align=”left”]

[blog_posts style=”normal” col_spacing=”xsmall” columns=”3″ columns__md=”1″ depth_hover=”2″ auto_slide=”8000″ cat=”242″ posts=”6″ offset=”38″ show_date=”false” excerpt_length=”25″ comments=”false” image_height=”60%” image_size=”original” image_hover=”zoom” text_align=”left”]

[blog_posts style=”normal” col_spacing=”xsmall” columns=”3″ columns__md=”1″ depth_hover=”2″ auto_slide=”5000″ cat=”242″ posts=”6″ offset=”32″ show_date=”false” excerpt_length=”25″ comments=”false” image_height=”60%” image_size=”original” image_hover=”zoom” text_align=”left”]

[blog_posts style=”normal” col_spacing=”xsmall” columns=”3″ columns__md=”1″ depth_hover=”2″ auto_slide=”6000″ cat=”242″ posts=”6″ offset=”26″ show_date=”false” excerpt_length=”25″ comments=”false” image_height=”60%” image_size=”original” image_hover=”zoom” text_align=”left”]

[blog_posts style=”normal” col_spacing=”xsmall” columns=”3″ columns__md=”1″ depth_hover=”2″ auto_slide=”7000″ cat=”242″ posts=”6″ offset=”20″ show_date=”false” excerpt_length=”25″ comments=”false” image_height=”60%” image_size=”original” image_hover=”zoom” text_align=”left”]

[blog_posts style=”normal” col_spacing=”xsmall” columns=”3″ columns__md=”1″ depth_hover=”2″ auto_slide=”8000″ cat=”242″ posts=”6″ offset=”14″ show_date=”false” excerpt_length=”25″ comments=”false” image_height=”60%” image_size=”original” image_hover=”zoom” text_align=”left”]

[row style=”small” class=”form-lien-he”]

[col span=”2″ span__sm=”12″]

[/col]
[col span=”4″ span__sm=”12″]

[button text=”Home” color=”secondary” style=”gloss” radius=”5″ depth=”2″ depth_hover=”3″ expand=”true” icon=”icon-star” icon_pos=”left” link=”https://phantran.net/”]

[/col]
[col span=”4″ span__sm=”12″]

[button text=”See basic to advanced” style=”gloss” radius=”5″ depth=”2″ depth_hover=”3″ expand=”true” icon=”icon-checkmark” icon_pos=”left” link=”/category/methodology/quantitative-research/quantitative-research-methods/meta-analysis/”]

[/col]
[col span=”2″ span__sm=”12″]

[/col]

[/row]

Where does meta-analysis fit in the research process?

Publications

Many journals encourage researchers to submit systematic reviews and meta-analyses that summarize the body of evidence on a specific question, and this approach is replacing the traditional narrative review. Meta-analyses also play supporting roles in other papers.  For example, a paper that reports results for a new primary study might include a meta-analysis in the introduction to synthesize prior data and help to place the new study in context.

Planning new studies

Meta-analyses can play a key role in planning new studies. The meta-analysis can help identify which questions have already been answered and which remain to be answered, which outcome measures or populations are most likely to yield significant results, and which variants of the planned intervention are likely to be most powerful.

Grant applications

Meta-analyses are used in grant applications to justify the need for a new study.  The meta-analysis serves to put the available data in context and to show the potential utility of the planned study. The graphical elements of the meta-analysis, such as the forest plot, provide a mechanism for presenting the data clearly, and for capturing the attention of the reviewers. Some funding agencies now require a meta-analysis of existing research as part of the grant application to fund new research.

The Need for Research Synthesis in the Social Sciences

Isaac Newton is known to have humbly explained his success: “If I have seen further it is by standing upon the shoulders of giants” (1675; from Columbia World of Quotations, 1996). Although the history of science suggests that Newton may have been as likely to kick his fellow scientists down as he was to collaboratively stand on their shoulders (e.g., Boorstin, 1983, Chs. 52-53; Gribbin, 2002, Ch. 5), this statement does eloquently portray a central prin­ciple in science: That the advancement of scientific knowledge is based on systematic building of one study on top of a foundation of prior studies, the accumulation of which takes our understanding to ever increasing heights. A closely related tenet is replication—that findings of studies are confirmed (or not) through repetition by other scientists.

Together, the principles of orderly accumulation and replication of empirical research suggest that scientific knowledge should steadily prog­ress. However, it is reasonable to ask if this is really the case. One obstacle to this progression is that scientists are humans with finite abilities to retain, organize, and synthesize empirical findings. In most areas of research, stud-ies are being conducted at an increasing rate, making it difficult for scholars to stay informed of research in all but the narrowest areas of specialization. I argue that many areas of social science research are in less need of further research than they are in need of organization of the existing research. A second obstacle is that studies are rarely exact replications of one another, but instead commonly use slightly different methods, measures, and/or sam­ples.1 This imperfect replication makes it difficult (1) to separate meaningful differences in results from expectable sampling fluctuations, and (2) if there are meaningful differences in results across studies, to determine which of the several differences in studies account for the differences in results.

An apparent solution to these obstacles is that scientists systematically review results from the numerous studies, synthesizing results to draw con­clusions regarding typical findings and sources of variability across studies. One method of conducting such systematic syntheses of the empirical lit­erature is through meta-analysis, which is a methodological and statistical approach to drawing conclusions from empirical literature. As I hope to dem­onstrate in this book, meta-analysis is a particularly powerful tool in draw­ing these sorts of conclusions from the existing empirical literature. Before describing this tool in the remainder of the book, in this chapter I introduce some terminology of this approach, provide a brief history of meta-analysis, further describe the process of research synthesis as a scientific endeavor, and then provide a more detailed preview of the remainder of this book.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Basic Terminology in Meta-Analysis

Before further discussing meta-analysis, it is useful to clarify some relevant terminology. One clarification involves the distinction of meta-analysis from primary or secondary analysis. The second clarification involves terminology of meta-analysis within the superordinate category of a literature review.

1. Meta-Analysis versus Primary or Secondary Analysis

The first piece of terminology to clarify are the differences among the terms “meta-analysis,” “primary analysis,” and “secondary analysis” (Glass, 1976). The term “primary analysis” refers to what we typically think of as data anal- ysis—when a researcher collects data from individual persons, companies, and so on,2 and then analyzes these data to provide answers to the research questions that motivated the study. The term “secondary analysis” refers to re-analysis of these data, often to answer different research questions or to answer research questions in a different way (e.g., using alternative analytic approaches that were not available when the data were originally analyzed). This secondary data analysis can be performed either by the original research­ers or by others if they are able to obtain the raw data from the researchers. Both primary and secondary data analysis require access to the full, raw data as collected in the study.

In contrast, meta-analysis involves the statistical analysis of the results from more than one study. Two points of this definition merit consideration in differentiating meta-analysis from either primary or secondary analysis. First, meta-analysis involves the results of studies as the unit of analysis, spe­cifically results in the form of effect sizes. Obtaining these effect sizes does not require having access to the raw data (which are all-too-often unavailable), as it is usually possible to compute these effect sizes from the data reported in papers resulting from the original, primary or secondary, analysis. Second, meta-analysis is the analysis of results from multiple studies, in which indi­vidual studies are the unit of analysis. The number of studies can range from as few as two to as many as several hundred (or more, limited only by the availability of relevant studies). Therefore, a meta-analysis involves drawing inferences from a sample of studies, in contrast to primary and secondary analyses that involve drawing inferences from a sample of individuals. Given this goal, meta-analysis can be considered a form of literature review, as I elaborate next.

2. Meta-Analysis as a Form of Literature Review

A second aspect of terminological consideration involves the place of meta­analysis within the larger family of literature reviews. A literature review can be defined as a synthesis of prior literature on a particular topic. Literature reviews differ along several dimensions, including their focus, goals, per­spective, coverage, organization, intended audience, and method of synthe­sis (see Cooper, 1988, 2009a). Two dimensions are especially important in situating meta-analysis within the superordinate family of literature reviews: focus and method of synthesis. Figure 1.1 shows a schematic representation of how meta-analysis differs from other literature reviews in terms of focus and method of synthesis.

Meta-analyses, like other research syntheses, focus on research out­comes (not the conclusion reached by study authors, which Rosenthal noted are “only vaguely related to the actual results” (1991, p. 13). Reviews focusing on research outcomes answer questions such as “The existing research shows X” or “These types of studies find X, whereas these other types of studies find Y.” Other types of literature reviews have different foci. Theoretical reviews focus on what theoretical explanations are commonly used within a field, attempt to explain phenomena using a novel theoretical alternative, or seek to integrate multiple theoretical perspectives. These are the types of reviews that are commonly reported in, for example, Psychological Review. Survey reviews focus on typical practices within a field, such as the use of particu­lar methods in a field or trends in the forms of treatment used in published clinical trials (e.g., Card & Little, 2007, surveyed published research in child development to report the percentage of studies using longitudinal designs). Although reviews focusing on theories or surveying practices within the lit­erature are valuable contributions to science, it is important to distinguish the focus of meta-analysis on research outcomes from these other types of reviews.

However, not all reviews that focus on research outcomes are meta­analyses. What distinguishes meta-analysis from other approaches to research synthesis is the method of synthesizing findings to draw conclusions. The methods shown in the bottom of Figure 1.1 can be viewed as a continuum from qualitative to quantitative synthesis. At the left is the narrative review. Here, the reviewer evaluates the relevant research and somehow draws con-
elusions. This “somehow” represents the limits of this qualitative, or narra­tive, approach to research synthesis. The exact process of how the reviewer draws eonelusions is unknown, or at least not artieulated, so there is eon- siderable room for subjectivity in the research conclusions reached. Beyond just the potential for subjective bias to emerge, this approach to synthesizing research taxes the reviewer’s ability to process information. Reviewers who attempt to synthesize research results qualitatively tend to perceive more inconsistency and smaller magnitudes of effects than those performing meta­analytic syntheses (Cooper & Rosenthal, 1980). In sum, the most common method of reviewing research—reading empirical reports and “somehow” drawing conclusions—is prone to subjectivity and places demands on the reviewer that make conclusions difficult to reach.

Moving toward the right, or quantitative direction, of Figure 1.1 are two vote-counting methods, which I have termed informal and formal. Both involve considering the significance of effects from research studies in terms of significant positive, significant negative, or nonsignificant results, and then drawing conclusions based on the number of studies finding a particu­lar result. Informal (also called conventional) vote counting involves simply drawing conclusions based on “majority rules” criteria; so, if more studies find a significant positive effect than find other effects (nonsignificant or sig­nificant negative), one concludes that there is a positive effect. A more formal vote-counting approach (see Bushman & Wang, 2009) uses statistical analy­sis of the expected frequency of results given the type I error rates (e.g., Given a traditional type I error rate of .05, do significantly more than 5% of studies find an effect?). Although vote-counting methods can be useful when infor­mation on effect sizes is unavailable, I do not discuss them in this book for two reasons (for descriptions of these vote-counting methods, see Bushman & Wang, 2009). First, conclusions of the existence of effects (i.e., statistical significance) can be more powerfully determined using meta-analytic proce­dures described in this book. Second, conclusions of significance alone are unsatisfying, and the focus of meta-analysis is on effect sizes that provide information about the magnitude of the effect.3

At the right side of Figure 1.1 is meta-analysis, which is a form of research synthesis in which conclusions are based on the statistical analysis of effect sizes from individual studies.4 I reserve further description of meta-analysis for the remainder of the book, but my hope here is that this taxonomy makes clear that meta-analysis is only one approach to conducting a literature review. Specifically, meta-analysis is a quantitative method of synthesizing empirical research results in the form of effect sizes. Despite this specific­ity, meta-analysis is a flexible and powerful approach to advancing scientific knowledge, in that it represents a statistically defensible approach to synthe­sizing empirical findings, which are the foundation of empirical sciences.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

A Brief History of Meta-Analysis

In this section, I briefly outline the history of meta-analysis. My goal is not to be exhaustive in detailing this history (for more extensive treatments, see Chalmers, Hedges, & Cooper, 2002, Hedges, 1992, and Olkin, 1990; for a his­tory intended for laypersons, see Hunt, 1997). Instead, I only hope to provide a basic overview to give you a sense of where the techniques described in this book have originated.

There exist several early individual attempts to combine statistically results from multiple studies. Olkin (1990) cites Karl Pearson’s work in 1904 to synthesize associations between inoculation and typhoid fever, and several similar approaches were described from the 1930s. Methods of combining prob­abilities advanced in the 1940s and 1950s (including the method that became well known as Stouffer’s method; see Rosenthal, 1991). But these approaches saw little application in the social sciences until the 1970s (with some excep­tions such as work by Rosenthal in the 1960s; see Rosenthal, 1991).

It was only in the late 1970s that meta-analysis found its permanent place in the social sciences. Although several groups of researchers devel­oped techniques during this time (e.g., Rosenthal & Rubin, 1978; Schmidt & Hunter, 1977), it was the work of Gene Glass and colleagues that intro­duced the term “meta-analysis” (Glass, 1976) and prompted attention to the approach, especially in the field of psychology. Specifically, Smith and Glass (1977) published a meta-analysis of the effectiveness of psychotherapy from 375 studies, showing that psychotherapy was effective and that there is little difference in effectiveness across different types of therapies. Although the former finding, introduced by Glass, would probably have been received with little disagreement, the latter finding by Smith and Glass was controversial and prompted considerable criticism (e.g., Eysenck, 1978). The controversial nature of Smith and Glass’s conclusion seems to have had both positive and negative consequences for meta-analysis. On the positive side, their convinc­ing approach to the difficult question of the relative effectiveness of psycho­therapies likely persuaded many of the value of meta-analysis. On the nega­tive side, the criticisms of this particular study (which I believe were greater than would have been leveled against meta-analysis of a less controversial topic) have often been generalized to the entire practice of meta-analyses. I describe these criticisms in greater detail in Chapter 2.

Despite the controversial nature of this particular introduction of meta­analysis to psychology, the coming years witnessed a rapid increase in this approach. In the early 1980s, several books describing the techniques of meta­analysis were published (Glass, McGraw, & Smith, 1981; Hunter, Schmidt, & Jackson, 1982; Rosenthal, 1984). Shortly thereafter, Hedges and Olkin (1985) published a book on meta-analysis that was deeply rooted in traditional sta­tistics. This rooting was important both in bringing formality and perceived statistical merit to the approach, as well as serving as a starting point for subsequent advances to meta-analytic techniques.

The decades since the introduction of meta-analysis to the social sci­ences have seen increasing use of this technique. Given its widespread use in social science research during the past three decades, it appears that meta­analysis is here to stay. For this reason alone, scholars need to be familiar with this approach in order to understand the scientific literature. However, understanding meta-analysis is valuable not only because it is widely used; more importantly, meta-analysis is widely used because it represents a pow­erful approach to synthesizing the existing empirical literature and contrib­uting to the progression of science. My goal in the remainder of this book is to demonstrate this value to you, as well as to describe how one conducts a meta-analytic review.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

The Scientific Process of Research Synthesis

Given the importance of research syntheses, including meta-analyses, to the progression of science, it is critical to follow scientific standards in their preparation. Most scientists are well trained in methods and data-analytic techniques to ensure objective and valid conclusions in primary research, yet methods and data-analytic techniques for research synthesis are less well known. In this section, I draw from Cooper’s (1982, 1984, 1998, 2009a) description of five5 stages of research synthesis to provide an overview of the process and scientific principles of conducting a research synthesis. These stages are formulating the problem, obtaining the studies, making decisions about study inclusion, analyzing and interpreting study results, and present­ing the findings from the research synthesis.

As in any scientific endeavor, the first stage of a literature review is to formulate a problem. Here, the central considerations involve the question that you wish to answer, the constructs you are interested in, and the popu­lation about which you wish to draw conclusions. In terms of the questions answered, a literature review can only answer questions for which prior liter­ature exists. For instance, to make conclusions of causality, the reviewer will need to rely on experimental (or perhaps longitudinal, as an approximation) studies; concurrent naturalistic studies would not be able to provide answers to this question. Defining the constructs of interest seems straightforward but poses two potential complications: The existing literature may use differ­ent terms or operationalizations for the same construct, or the existing litera­ture may use similar terms to describe different constructs. Therefore, you need to define clearly the constructs of interest when planning the review. Similarly, you must consider which samples will be included in the literature review; for instance, you need to decide whether studies of unique popula­tions (e.g., prison, psychiatric settings) should be included within the review. The advantages of a broad approach (in terms of constructs and samples) are that the conclusions of the review will be more generalizable and may allow for the identification of important differences among studies. However, a nar­row approach will likely yield more consistent (i.e., more homogeneous, in the language of meta-analysis) results, and the quantity of literature that must be reviewed is smaller. Both of these features might be seen as advantages or disadvantages, depending on the goals (e.g., to identify average effects versus moderators) and ambition (in terms of the number of studies one is willing to code) of the reviewer.

The next step in a literature review is to obtain the literature relevant for the review. Here, the important consideration is that the reviewer is exhaus­tive, or at least representative, in obtaining relevant literature. It is useful to conceptualize the literature included as a sample drawn from a population of all possible studies. Adapting this conceptualization (and paralleling well- known principles of empirical primary research) highlights the importance of obtaining a representative sample of literature for the review. If the literature reviewed is not representative of the extant research, then the conclusions drawn will be a biased representation of reality. One common threat to all literature reviews is publication bias (also known as the file drawer problem). This threat is that studies that fail to find significant effects (or that find effects counter to what is expected) are less likely to be published, and therefore less likely to be accessible to the reviewer. To counter this threat, you should attempt to obtain unpublished studies (e.g., dissertations), which will either counter this threat or at least allow you to evaluate the magnitude of this bias (e.g., evaluating whether published versus unpublished studies find different effects). Another threat is that reviewers typically must rely on literature writ­ten in a language they know (e.g., English); this excludes literature written in other languages and therefore may exclude most studies conducted in other countries. Although it would be impractical to learn every language in which relevant literature may be written, you should be aware of this limitation and how it impacts the literature on which the review is based. To ensure the transparency of a literature review, the reviewer should report the means by which potentially relevant literature was searched and obtained.

The third, related, stage of a literature review is the evaluation of stud­ies to decide which should inform the review. This stage involves reading the literature obtained in the prior stage (searching for relevant literature) and drawing conclusions regarding relevance. Obvious reasons to exclude works include investigation of constructs or samples that are irrelevant to the review (e.g., studies involving animals when one is interested in human behavior) or failure of the work to provide information relevant to the review (e.g., it treats the construct of interest only as a covariate without providing sufficient information about effects). Less obvious decisions need to be made for works that involve questionable quality or methodological features differ­ent from other studies. Including such works may improve the generalizabil- ity of the review but at the same time may contaminate the literature basis or distract from your focus. Decisions at this stage will typically involve refining the problem formulated at the first stage of the review.

The fourth stage is the most time-consuming and difficult: analyzing and interpreting the literature. As mentioned, there exist several approaches to how reviewers draw conclusions, ranging from qualitative to informal or formal vote counting to meta-analysis. For a meta-analysis, this stage involves systematically coding study characteristics and effect sizes, and then statisti­cally analyzing these coded data. As I describe later in this book (Chapter 2) there are powerful advantages to using a meta-analytic approach.

The final stage of the literature review is the presentation of the review, often in written form. Although I suspend detailed recommendations on reporting meta-analyses until later in the book, a few general guidelines should be considered here. First, we should be transparent about the review process and decisions taken. Just as empirical works are expected to present sufficient details so that another researcher could replicate the results, a well- written research synthesis should provide sufficient detail for another scholar to replicate the review. Second, it is critical that the written report answers the original questions that motivated the review, or at least describes why such answers cannot be reached and what future work is needed to provide these answers. A third, related, guideline is that we should avoid a simple study-by-study listing. A good review synthesizes—not merely lists—the lit­erature. Meta-analysis provides a powerful way of drawing valuable informa­tion from multiple studies that goes far beyond merely listing their individual results.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Identifying Goals and Research Questions for Meta-Analysis

In providing a taxonomy of literature reviews (see Chapter 1), Cooper (1988, 2009a) identified the goals of a review to be one of the dimensions on which reviews differ. Cooper identified integration (including drawing generaliza­tions, reconciling conflicts, and identifying links between theories of dis­ciplines), criticism, and identification of central issues as general goals of reviewers. Cooper noted that the goal of integration “is so pervasive among reviews that it is difficult to find reviews that do not attempt to synthesize works at some level” (1988, p. 108). This focus on integration is also central to meta-analysis, though you should not forget that there is room for addi­tional goals of critiquing a field of study and identifying key directions for future conceptual, methodological, and empirical work. Although these goals are not central to meta-analysis itself, a good presentation of meta-analytic results will usually inform these issues. After reading all of the literature for a meta-analysis, you certainly should be in a position to offer informed opin­ions on these issues.

Considering the goal of integration, meta-analyses follow one of two1 general approaches: combining and comparing studies. Combining studies involves using the effect sizes from primary studies to collectively estimate a typical effect size, or range of effect sizes. You will also typically make inferences about this estimated mean effect size in the form of statistical significance testing and/or confidence intervals. I describe these methods in Chapters 8 and 10. The second approach to integration using meta-analysis is to compare studies. This approach requires the existence of variability (i.e., heterogeneity) of effect sizes across studies, and I describe how you can test for heterogeneity in Chapter 8. If the studies in your meta-analysis are het­erogeneous, then the goal of comparison motivates you to evaluate whether effect sizes found in studies systematically differ depending on coded study characteristics (Chapter 4) through meta-analytic moderator analyses (Chap­ter 9).

We might think of combination and comparison as the “hows” of meta­analysis; if so, we still need to consider the “whats” of meta-analysis. The goal of meta-analytic combination is to identify the average effect sizes, and meta­analytic comparison evaluates associations between these effect sizes and study characteristics. The common component of both is the focus on effect sizes, which represent the “whats” of meta-analysis. Although many different types of effect sizes exist, most represent associations between two variables (Chapter 5; see Chapter 7 for a broader consideration). Despite this simplicity, the methodology under which these two-variable associations were obtained is critically important in determining the types of research questions that can be answered in both primary and meta-analysis. Concurrent associations from naturalistic studies inform only the degree to which the two variables co-occur. Across-time associations from longitudinal studies (especially those controlling for initial levels of the presumed outcome) can inform temporal primacy, as an imperfect approximation of causal relations. Associations from experimental studies (e.g., association between group random assignment and outcome) can inform causality to the extent that designs eliminate threats to internal validity. Each of these types of associations is represented as an effect size in the same way in a meta-analysis, but they obviously have different implications for the phenomenon under consideration. It is also worth noting here that a variety of other effect sizes index very different “whats,” including means, proportions, scale reliabilities, and longitudinal change scores; these possibilities are less commonly used but represent the range of effect sizes that can be used in meta-analysis (see Chapter 7).

Crossing the “hows” (i.e., combination and comparison) with the “whats” (i.e., effect sizes representing associations from concurrent naturalistic, lon­gitudinal naturalistic, quasi-experimental, and experimental designs, as well as the variety of less commonly used effect sizes) suggests the wide range of research questions that can be answered through meta-analysis. For exam­ple, you might combine correlations between X and Y from concurrent natu­ralistic studies to identify the best estimate of the strength of this association. Alternatively, you might combine associations between a particular form of treatment (as a two-group comparison receiving versus not receiving) and a particular outcome, obtained from internally valid experimental designs, to draw conclusions of how strongly the treatment causes improvement in functioning. In terms of comparison, you might evaluate the extent to which X predicts later Y in longitudinal studies of different duration in order to evaluate the time frame over which prediction (and possibly causal influence) is strongest. Finally, you might compare the reliabilities of a particular scale across studies using different types of samples to determine how useful this scale is across populations. Although I could give countless other examples, I suspect that these few illustrate the types of research questions that can be answered through meta-analysis. Of course, the particular questions that are of interest to you are going to come from your own expertise with the topic; but considering the possible crossings between the “hows” (combination and comparison) and the “whats” (various types of effect sizes) offers a useful way to consider the possibilities.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

The Limits of Primary Research and the Limits of Meta-Analytic Synthesis

Perhaps no statement is more true, and humbling, than this offered as the opening of Harris Cooper’s editorial in Psychological Bulletin (and likely stated in similar words by many others): “Scientists have yet to conduct the flawless experiment” (Cooper, 2003, p. 3). I would extend this conclusion further to point out that no scientist has yet conducted a flawless study, and even further by stating that no meta-analyst has yet performed a flawless review. Each approach to empirical research, and indeed each application of such approaches within a particular field of inquiry, has certain limits to the contributions it can make to our understanding. Although full consideration of all of the potential threats to drawing conclusions from empirical research is beyond the scope of this section, I next highlight a few that I think are most useful in framing consideration of the most salient limits of primary research and meta-analysis—those of study design, sampling, methodologi­cal artifacts, and statistical power.

1. Limits of Study Design

Experimental designs allow inferences of causality but may be of question­able ecological validity. Certain features of the design of experimental (and quasi-experimental) studies dictate the extent to which conclusions are valid (see Shadish, Cook, & Campbell, 2002). Naturalistic (a.k.a. correlational) designs are often advantageous in providing better ecological validity than experimental designs and are often useful when variables of interest cannot, or cannot ethically, be manipulated. However, naturalistic designs cannot answer questions of causality, even in longitudinal studies that represent the best nonexperimental attempts to do so (see, e.g., Little, Card, Preacher, & McConnell, 2009).

Whatever limits due to study design that exist within a primary study (e.g., problems of internal validity in suboptimally designed experiments, ambiguity in causal influence in naturalistic designs) will also exist in a meta­analysis of those types of studies. For example, meta-analytically combining experimental studies that all have a particular threat to internal validity (e.g., absence of double-blind procedures in a medication trial) will yield conclu­sions that also suffer this threat. Similarly, meta-analysis of concurrent cor­relations from naturalistic studies will only tell you about the association between X and Y, not about the causal relation between these constructs. In short, limits to the design that are consistent across primary studies included in a meta-analysis will also serve as limits to the conclusions of the meta­analysis.

2. Limits of Sampling

Primary studies are also limited in that researchers can only generalize the results to populations represented by the sample. Findings from studies using samples homogeneous with respect to certain characteristics (e.g., gender, ethnicity, socioeconomic status, age, settings from which the participants are sampled) can only inform understanding of populations with characteris­tics like the sample. For example, a study sampling predominantly White, middle- and upper-class, male college students (primarily between 18 and 22 years of age) in the United States cannot draw conclusions about individuals who are ethnic minority, lower socioeconomic status, females of a different age range not attending college, and/or not living in the United States.

These limits of generalizability are well known, yet widespread, in much social science research (e.g., see Graham, 1992, for a survey of ethnic and socioeconomic homogeneity in psychological research). One feature of a well- designed primary study is to sample intentionally a heterogeneous group of participants in terms of salient characteristics, especially those about which it is reasonable to expect findings potentially to differ, and to evaluate these factors as potential moderators (qualifiers) of the findings. Obtaining a het­erogeneous sample is difficult, however, in that the researcher must typi­cally obtain a larger overall sample, solicit participants from multiple settings (e.g., not just college classrooms) and cultures (e.g., not just in one region or country), and ensure that the methods and measures are appropriate for all participants. The reality is that few if any single studies can sample the wide range of potentially relevant characteristics of the population about which we probably wish to draw conclusions.

These same issues of sample generalizability limit conclusions that we can draw from the results of meta-analyses. If all primary studies in your meta-analysis sample a similar homogeneous set of participants, then you should only generalize the results of meta-analytically combining these results to that homogeneous population. However, if you are able to obtain a collection of primary studies that are diverse in terms of sample charac­teristics, even if the studies themselves are individually homogeneous, then you can both (1) evaluate potential differences in results based on sample characteristics (through moderator analyses; see Chapter 9) and (2) make conclusions that are generalizable to this more heterogeneous population. In this way, meta-analytic reviews have the potential to draw more generaliz­able conclusions than are often tractable within a primary study, provided you are able to obtain studies collectively consisting of a diverse range of participants. However, you should keep in mind the limits of the samples of studies included in your meta-analysis and be cautious not to extrapolate beyond these limits. Most meta-analyses contain some limits—intentional (specified by inclusion/exclusion criteria; see Chapter 3) or unintentional (required by the absence or unavailability—e.g., written in a language that you do not know—of primary research with some populations)—that limit the generalizability of conclusions.

3. Limits of Methodological Artifacts

Researchers planning and conducting primary studies do not intention­ally impose methodological artifacts, but these often arise. These artifacts, described in detail in Chapter 6, can arise from imperfect measures (imper­fect reliability or validity), sampling homogeneity (resulting in direct or indi­rect restriction of ranges among variables of interest), or poor data-analytic choices (e.g., artificial dichotomization of continuous variables). These arti­facts typically2 attentuate, or diminish, the effect sizes estimated in primary studies. This attenuation leads to lower statistical power (higher rates of type II error) and underestimation of the magnitude—and potentially the impor- tance—of the results.

These artifacts can be corrected in the sense that it is possible to esti­mate the magnitude of “true” effect sizes disattenuated for these artifacts. In primary studies, this is rarely done, with the exception of those using latent variable analyses to correct for unreliability (see, e.g., Kline, 2005). This correction for attenuation of effect sizes is more common in meta-analyses, though the practice is somewhat controversial and varies across disciplines (see Chapter 6). Whether or not you correct for certain artifacts in your own meta-analyses should guide the extent to which you view these artifacts as potential limits (by attenuating your effect sizes and potentially introducing less meaningful heterogeneity).

4. Limits of Statistical Power

Statistical power refers to the probability of concluding that an effect exists when it truly does. The converse of statistical power is type II error, or fail­ing to conclude that an effect exists when it does. Although this concept of statistical power is rooted in the Null Hypothesis Significance Testing frame­work (which is problematic, as I describe in Chapter 5), statistical power is also relevant in other frameworks such as reliance on point estimates and confidence intervals in describing results (i.e., low statistical power leads to large confidence intervals).

The statistical power of a primary study depends on several factors, including the type I error rate (i.e., a) set by the researcher, the type of analy­sis performed, and the magnitude of the effect size within the population. However, because these other factors are typically out of the researcher’s control,3 statistical power is dictated primarily by sample size, where larger sample sizes yield greater statistical power. When planning primary studies, researchers should conduct power analyses to guide the number of partici­pants needed to have a certain probability (often .80) of detecting an effect size of a certain magnitude (for details see, e.g., Cohen, 1969; Kraemer & Thiemann, 1987; Murphy & Myors, 2004).

Despite the potential for power analysis to guide study design, there are many instances when primary studies are underpowered. This might occur because the power analysis was based on an unrealistically high expectation of population effect size, because it was not possible to obtain enough par­ticipants due to limited resources or scarcity of appropriate participants (e.g., when studying individuals with rare conditions), or because the researcher failed to perform a power analysis in the first place. In short, although inad­equate statistical power is not a problem inherent to primary research, it is plausible that in many fields a large number of existing studies do not have adequate statistical power to detect what might be considered a meaningful magnitude of effect (see, e.g., Halpern, Karlawish, & Berlin, 2002; Maxwell, 2004).

When a field contains many studies that fail to demonstrate an effect because they have inadequate statistical power, there is the danger that readers of this literature will conclude that an effect does not exist (or that it is weak or inconsistent). In these situations, a meta-analysis can be use­ful in combining the results of numerous underpowered studies within a single analysis that has greater statistical power.4 Although meta-analyses can themselves have inadequate statistical power, they will generally5 have greater statistical power than the primary studies comprising them (Cohn & Becker, 2003). For this reason, meta-analyses are generally less impacted by inadequate statistical power than are primary studies (but see Hedges & Pigott, 2001, 2004 for discussion of underpowered meta­analyses).

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Critiques of Meta-Analysis: When Are They Valid and When Are They Not?

As I outlined in Chapter 1, attention to meta-analysis emerged in large part with the attention received by Smith and Glass’s (1977) meta-analysis of psy­chotherapy research (though others developed techniques of meta-analysis at about the same time; e.g., Rosenthal & Rubin, 1978; Schmidt & Hunter, 1977). The controversial nature of this meta-analysis drew criticisms, both of the particular paper and of the process of meta-analysis itself. Although these criticisms were likely motivated more by dissatisfaction with the results than the approach, there has been some persistence of these criticisms toward meta-analysis since its early years. The result of this extensive criticism, and efforts to address these critiques, is that meta-analysis as a scientific process of reviewing empirical literature has a deeper appreciation of its own limits; so this criticism was in the end fruitful.

In the remainder of this section, I review some of the most common criti­cisms of meta-analysis (see also, e.g., Rosenthal & DiMatteo, 2001; Sharpe, 1997). I also attempt to provide an objective consideration of the extent, and under what conditions, these criticisms are valid. At the end of this section, I place these criticisms in perspective by noting that many apply to any lit­erature review.

1. Amount of Expertise Needed to conduct and understand

Although not necessarily a critique, I think it is important first to address a common misperception I encounter: that meta-analysis requires extensive statistical expertise to conduct. Although very advanced, complex methods exist for various aspects of meta-analysis, most meta-analyses do not require especially complicated analyses. The techniques might seem rather obscure or complex when one is first reading meta-analyses; I believe that this is pri­marily because most of us received considerable training in primary analysis during our careers, but have little if any exposure to meta-analysis. How­ever, performing a basic yet sound meta-analysis requires little more exper­tise than that typically acquired in a research-oriented graduate social sci­ence program, such as the ability to compute means, variances, and perhaps perform an analysis of variance (ANOVA) or regression analysis, albeit with some small twists in terms of weighting and interpretation.6

Although I do not view the statistical expertise needed to conduct a sound meta-analysis as especially high, I do feel obligated to make clear that meta­analyses are not easy. The time required to search adequately for and code studies is substantial (see Chapters 3-7). The analyses, though not requiring an especially high level of statistical complexity, must be performed with care and by someone with the basic skills of meta-analysis (such as provided in Chapters 8-11). Finally, the reporting of a meta-analysis can be especially difficult given that you are often trying to make broad, authoritative state­ments about a field (see Chapters 13-14). My intention is not to scare anyone away from performing a meta-analysis, but I think it is important to recog­nize some of the difficulty in this process. However, needing a large amount of statistical expertise is not one of these difficulties for most meta-analyses you will want to perform.

2. Quantitative Analysis May Lack “Qualitative Finesse” of Evaluating Literature

Some complain that meta-analyses lack the “qualitative finesse” of a nar­rative review, presumably meaning that it fails to make creative, nuanced conclusions about the literature. I understand this critique, and I agree that some meta-analysts can get too caught up in the analyses themselves at the expense of carefully considering the studies. However, this tendency is cer­tainly not inherent to meta-analysis, and there is certainly nothing to pre­clude the meta-analyst from engaging in this careful consideration.

To place this critique in perspective, I think it is useful to consider the general approaches of qualitative and quantitative analysis in primary research. Qualitative research undoubtedly provides rich, nuanced informa­tion that has contributed substantially to understanding in nearly all areas of social sciences. At the same time, scientific progress would be limited if we did not also rely on quantitative methods and on methods of analyzing these quantitative data. Few scientists would collect quantifiable data from doz­ens or hundreds of individuals, but would instead use a method of analysis consisting of looking at the data and “somehow” drawing conclusions about central tendency, variability, and co-occurrences of individual differences. In sum, there is substantial advantage to conducting primary research using both qualitative and quantitative analyses, or a combination of both.

Extending this value of qualitative and quantitative analyses in primary research to the process of research synthesis, I do not see careful, nuanced consideration of the literature and meta-analytic techniques to be mutually exclusive processes. Instead, I recommend that you rely on the advantages of meta-analysis in synthesizing vast amounts of information and aiding in drawing probabilistic inferential conclusions, but also using your knowledge of your field where these quantitative analyses fall short. Furthermore, meta­analytic techniques provide results that are statistically justifiable (e.g., there is an effect size of a certain range of magnitude; some type of studies provide larger effect sizes than another type), but it is up to you to connect these find­ings to relevant theories in your field. In short, a good meta-analytic review requires both quantitative methodology and “qualitative finesse.”

3. The “Apples and Oranges” Problem

The critique known as the “apples and oranges problem” was first used as a critique against Smith and Glass’s (1977) meta-analytic combination of stud­ies using diverse methods of psychotherapy in treating a wide range of prob­lems among diverse samples of people (see Sharpe, 1997). Critics charge that including such a diverse range of studies in a meta-analysis yields meaning­less results.

I believe that this critique is applicable only to the extent that the meta­analyst wants to draw conclusions about apples or oranges; if you want to draw conclusions only about a narrowly defined population of studies (e.g., apples), then it is problematic to include studies from a different popula­tion (e.g., oranges). However, if you wish to make conclusions about a broad population of studies, such as all psychotherapy studies of all psychologi­cal disorders, then it is appropriate to combine a diverse range of studies. To extend the analogy: combining apples and oranges is appropriate if you want to draw conclusions about fruit; in fact, if you want to draw conclu­sions about fruit you should also include limes, bananas, figs, and berries! Studies are rarely identical replications of one another, so including studies that are diverse in methodology, measures, and sample within your meta­analysis has the advantage of improving the generalizability of your conclu­sions (Rosenthal & DiMatteo, 2001). So, the apples and oranges critique is not so much a critique about meta-analysis; rather, it just targets whether or not the meta-analyst has considered and sampled studies from an appropriate level of analysis.

In considering this critique, it is useful to consider the opportunities for considering multiple levels of analysis through moderator analysis in meta­analysis (see Chapter 9). Evoking the fruit analogy one last time: A meta­analysis can include studies of all fruit and report results about fruit; but then systematically compare apples, oranges, and other fruit through mod­erator analyses (i.e., do results involving apples and oranges differ?). Fur­ther moderator analyses can go further by comparing studies involving, for example, McIntosh, Delicious, Fuji, and Granny Smith apples. The possibility of including diverse studies in your meta-analysis and then systematically comparing these studies through moderator analyses means that the apples and oranges problem is easily addressable.

4. The “File Drawer” Problem

The “file drawer” problem is based on the possibility that the studies included in a meta-analysis are not representative of those that have been conducted because studies that fail to find significant or expected results are hidden away in researchers’ file drawers. Because I devote an entire chapter to this problem, also called publication bias, later in this book (Chapter 11), I do not treat this threat in detail here. Instead, I briefly note that this is indeed a threat to meta-analysis, as it is to any literature review. Fortunately, meta­analyses typically use systematic and thorough methods of obtaining stud­ies (Chapter 3) that minimize this threat, and meta-analytic techniques for detecting and potentially correcting for this bias exist (Chapter 11).

5. Garbage In, Garbage Out

The critique of “garbage in, garbage out” is that the meta-analysis of poor quality primary studies only results in conclusions of poor quality. In many respects this critique is a valid threat, though there are some exceptions. First, we can consider what “poor quality” (i.e., garbage) really means. If studies are described as being of poor quality because they are underpowered (i.e., have low statistical power to detect the hypothesized effect), then meta­analysis can overcome this limitation by aggregating findings from multiple underpowered studies to produce a single analysis that is more powerful. If studies are considered to be of poor quality because they contain artifacts such as using measures that are less reliable or less valid than is desired, or if the primary study authors used certain inappropriate analytic techniques (e.g., artificially dichotomizing continuous variables), then methods of cor­recting effect sizes might help overcome these problems (see Chapter 6). For these types of “garbage” then, meta-analyses might be able to produce high- quality findings.

There are other types of problems of study quality that meta-analyses cannot overcome. For instance, if all primary studies evaluating a particu­lar treatment fail to assign participants randomly to conditions, do not use double-blind procedures, or the like, then these threats to internal validity in the primary studies will remain when you combine the results across stud­ies in a meta-analysis. Similarly, if the primary studies included in a meta­analysis are all concurrent naturalistic designs, then there is no way that meta-analytic combination of these results can inform causality. In short, the design limitations that consistently occur in the primary studies will also be limitations when you meta-analytically combine these studies.

Given this threat, some have recommended that meta-analysts exclude studies that are of poor study quality, however that might be defined (see Chapter 4). Although this exclusion does ensure that the conclusions you reach have the same advantages afforded by good study designs as are avail­able in the primary studies, I think that uncritically following this advice is misguided for three reasons. First, for some research questions, there may be so few primary studies that meet strict criteria for “quality” that it is not very informative to combine or compare them; however, there may be many more studies that contain some methodological flaws. In these same situa­tions, it seems that the progression of knowledge is unnecessarily delayed by stubborn unwillingness to consider all available evidence. I believe that most fields benefit more from an imperfect meta-analysis than no meta-analysis at all, provided that you appropriately describe the limits of the conclusions of your review. A second reason I think that dogmatically excluding poor quality studies is a poor choice is that this practice assumes that certain imperfections of primary studies result in biased effects, yet does not test this assumption. This leads to the third reason: Meta-analyses can evaluate whether systematic differences in effect sizes emerge from certain method­ological features. If you code the relevant features of primary studies that are considered “quality” within your particular field (see Chapter 4), you can then evaluate whether these features systematically relate to differences in the results (effect sizes) found among studies through moderator analyses (Chapter 9). Having done this, you can (1) make statements about how the differences in specific aspects of quality impact the effect sizes that are found, which can guide future design of primary studies; (2) where differences are found, limit conclusions to the types of studies that you believe produce the most valid results; and (3) where differences are not found, have the advan­tage of including all relevant studies (versus a priori excluding a potentially large number of studies).

6. Are These Problems Relevant Only to Quantitative Reviews?

Although these critiques were raised primarily against the early meta-analyses and have since been raised as challenges primarily against meta-analytic (i.e., quantitative) reviews, most apply to all types of research syntheses. Aside from the first two I have reviewed (meta-analyses requiring extensive statisti­cal expertise and lacking in finesse), which I have clarified as being generally misconceptions, the remainder can be considered as threats to all types of research syntheses (including narrative research reviews) and often all types of literature reviews (see Figure 1.1). However, because these critiques have most often been applied toward meta-analysis, we have arguably considered these threats more carefully than have scholars performing other types of lit­erature reviews. It is useful to consider how each of the critiques I described above threatens both quantitative and other literature reviews (considering primarily the narrative research review), and how each discipline typically manages the problem.

The “apples and oranges” problem (i.e., inclusion of diverse types of studies within a review) is potentially threatening to both narrative and meta-analytic review. However, my impression is that meta-analyses more commonly attempt to draw generalized conclusions across diverse types of primary studies, whereas narrative reviews more often draw fragmented con­clusions of the form “These types of studies find this. These other types of studies find this.” If practices stopped there, then the apples and oranges problem could more fairly be applied to meta-analyses than other reviews. However, meta-analysts usually perform moderator analyses to compare the diverse types of studies, and narrative reviews often try to draw synthe­sized conclusions about the diverse types of studies. Given that both types of reviews typically attempt to draw conclusions at multiple levels (i.e., about fruits in general and about apples and oranges in particular), the critique of focusing on the “wrong” level of generalization—if there is such a thing, ver­sus just focusing on a different level of generalization than another scholar might choose—is equally applicable to both. However, both the process of drawing generalizations across diverse studies and the process of comparing diverse types of studies are more objective and lead to more accurate con­clusions (Cooper & Rosenthal, 1980) when performed using meta-analytic versus narrative review techniques.

The “file drawer” problem—the threat of unpublished studies not being included in a review, and the resultant available studies being a biased rep­resentation of the literature—is a threat to all attempts to draw conclusions from this literature. In other words, if the available literature is biased, then this bias affects any attempt to draw conclusions from the literature, narra­tive or meta-analytic. However, narrative reviews almost never consider this threat, whereas meta-analytic reviews routinely consider it and often take steps to avoid it and/or evaluate it (indeed, there exists an entire book on this topic; see Rothstein, Sutton, & Borenstein, 2005b). Meta-analysts typi­cally make greater efforts to systematically search for unpublished literature (and studies published in more obscure sources) than do those preparing narrative reviews (Chapter 3). Meta-analysts also have the ability to detect publication bias through comparison of published and available unpublished studies, funnel plots, or regression analyses, as well as the means to evaluate the plausibility of the file drawer threat through failsafe numbers (see Chap­ter 11). All of these capabilities are absent in the narrative review.

Finally, the problem of “garbage in, garbage out”—that the inclusion of poor quality studies in a review leads to poor quality results from the review—is a threat to both narrative and meta-analytic reviews. However, I have described ways that you can overcome some problems of the primary studies in meta-analysis (low power, presence of methodological artifacts), as well as systematically evaluate the presumed impact of study quality on results, that are not options in a narrative review.

In sum, the problems that might threaten the results of a meta-analytic review are also threats to other types (e.g., narrative) of reviews, even though they are less commonly considered in other contexts. Moreover, meta­analytic techniques have been developed that partially or fully address these problems; parallel techniques for narrative reviews either do not exist or are rarely considered. For these reasons, although you should be mindful of these potential threats when performing a meta-analytic review, these threats are not limited—and are often less of threats—in a meta-analytic relative to other types of research reviews.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

2.4 Practical Matters: The Reciprocal Relation between Planning and Conducting a Meta-Analysis

My placement of this chapter on identifying research questions for meta­analysis before chapters on actually performing a meta-analysis is meant to correspond to the order you would follow in approaching this endeavor. As with primary research, you want to know your goals and research questions, as well as potential limitations and critiques, of your meta-analysis before you begin.

However, such an ordering is somewhat artificial in that it misses the often reciprocal relation between planning and conducting a meta-analytic review. At a minimum, someone planning a meta-analysis almost certainly has read empirical studies in the area that would likely be included in the review, and conclusions that the reader takes from these studies will undoubt­edly influence the type of questions asked when planning the meta-analysis.

Beyond this obvious example, I think that much of the process of con­ducting a meta-analysis is less linear than is typically presented, but more of an iterative, back-and-forth process among the various steps of planning, searching the literature, coding studies, analyzing the data, and writing the results. I do not view this reality as problematic; although we should avoid the practice of “HARKing” (Hypothesizing After Results are Known; Kerr, 1998), we do learn a lot during the process of conducting the meta-analysis that can refine our initial questions. Next, I briefly describe how each of the major steps of searching the literature, coding studies, analyzing the data, and writing the results can provide reasons to revise our initial plans of the meta-analysis.

As I discuss in detail in Chapter 3, an important step in meta-analysis is specifying inclusion/exclusion criteria (i.e., what type of studies will be included in the literature) and searching for relevant literature. This pro­cess should be guided by the research questions you wish to answer, but the process might also change your research questions. For example, finding that there is little relevant literature to inform your meta-analysis research questions—either too few studies to obtain a good estimate of the overall effect size or too little variation over levels of moderators of interest—might force you to broaden your questions to include more studies. Conversely, finding that so many studies are relevant to your research question that it is not practical to include all of them might cause you to narrow your research question (e.g., to a more limited sample, type of measure, and/or type of intervention).7

Research questions can also be modified after you begin coding studies (see Chapters 4-7). Not only might your careful reading of the studies lead you to new or modified research questions, but also the more formal process of coding might necessitate changes in your research questions. If studies do not provide sufficient information to compute effect sizes consistently, and it is not possible to obtain this information from study authors, then it may be necessary to abandon or modify your original research questions. If your research questions involve comparing studies (i.e., moderator analyses), you may have to alter this research question if the studies do not provide adequate variability or coverage of certain characteristics. For example, if you were interested in evaluating whether an effect size differs across ethnic groups, but during the coding of studies found that most studies sampled only a particular ethnic group, then you would not have adequate variability across the studies and would have to abandon this particular research ques­tion (or else modify it in some way to make it more tractable).

Analyzing the data (see Chapters 8-12) is probably where the most mod­ifications to original study questions will occur. Although you should thor­oughly investigate your original research questions, and you should avoid entirely exploratory “fishing expeditions,” you will invariably form new research questions during the data analysis phase. Some of these new ques­tions will be formed as you learn answers to your original questions (e.g., “Having found this, I wonder if . . . ?”), whereas other questions will come from simply looking at the data (e.g., thinking about why a particular study, or set of studies, has discrepant effect sizes). Although both approaches are post hoc, the latter is certainly more exploratory—and therefore more likely to capitalize on chance—than the former. However, both approaches to cre­ating new research questions are valuable, as long as you are upfront about their source when presenting and drawing conclusions from your meta­analysis (see Chapter 13).

As is true of analyzing the data, the process of writing your results may lead to refinement of research questions or even the development of new ones. Furthermore, the process of presenting your findings to colleagues— through either conference presentations or the peer review process—is likely to generate further refinement and creation of research questions.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Developing and Articulating a Sampling Frame for Meta-Analysis

Given that meta-analysis uses the individual study as its unit of analysis, it is useful to think of your meta-analysis as consisting of a sample of studies, just as primary analyses sample people or other units (e.g., families, businesses) comprising its sample. In primary analyses, we typically wish to make infer­ences to a larger population that is represented by the sampled individuals; in meta-analysis, we typically wish to make inferences to a larger population of possible studies from the sample of studies included in our review. In both cases, we want our sample to be representative of this larger population, as opposed to a biased (nonrepresentative) set.

To illustrate the importance of obtaining an unbiased sample of studies, we can consider the threat of publication bias (discussed in further detail in Chapter 11). The top of Figure 3.2 displays a hypothetical population of effect sizes, with the horizontal (x) axis representing the effect sizes obtained in studies of this population and the vertical (y) axis representing the frequency that studies yield this effect size.1 We see that the mean effect size in this population is somewhere around 0.20 and that there is a certain amount of deviation around this mean due to either sampling fluctuation or unspecified (random) differences. The bottom part of this figure shows the distribution of a biased sample of studies drawn from this population. I have used arrows of different width to represent the likelihood of studies from the population being included in this sample. The arrows to the right are thick to represent studies with large effect sizes being very likely to be included in the sample (i.e., very likely to be found in a search), whereas the arrows to the left are thin to represent studies with small effect sizes being very unlikely to be included in the sample (i.e., likely not found in a search). We can see that this differential likelihood of inclusion by effect sizes results in a biased sample. If you were to meta-analyze studies from this sample, you would find a mean effect size somewhere around 0.30 rather than the 0.20 found in the popula­tion. Thus, analysis of this biased sample of studies leads to biased results in a meta-analysis.

The goal of searching and retrieving the literature for a meta-analytic review is to obtain a representative, unbiased collection of studies from which inferences can be made about a larger population of studies. Meta-analyses differ from primary analyses in that your goal is typically to obtain all of the studies comprising this population as it currently exists.2 Whether or not you are successful in obtaining all available studies (and it is not possible to know with certainty that you have), it is still appropriate to consider this set of studies as a sample, from which you might draw inferences about a larger population including studies you did not locate or studies performed in the future (assuming that these studies are part of the same population as those included in your meta-analysis).

This approach, in which you think of the studies included in your meta­analysis as a sample from a population to which you wish to make inferences, has two important implications. First, this conceptualization properly frames the conclusions you draw from results after completing your meta-analysis; this is important in allowing you to avoid either understating or overstat­ing the generalizability of your findings. Second, and more relevant during the planning stages of your review, this conceptualization should guide your criteria for which type of studies should or should not be included in your meta-analysis, as described next.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Inclusion and Exclusion Criteria for Meta-Analysis

The inclusion criteria, and conversely the exclusion criteria, are a set of explicit statements about the features of studies that will or will not (respectively) be included in your meta-analysis. Ideally, you should specify these criteria before searching the literature so that you can then determine whether each study identified in your search should be included in your meta-analysis. Practically speaking, however, you are likely to find studies that are ambigu­ous given your initial criteria, so you will need to modify these criteria as these unanticipated types of studies arise.

1. The Importance of Clear Criteria

Developing an explicit set of inclusion and exclusion criteria is important for three reasons. First, as I noted earlier, these criteria should reliably guide which studies you will (or will not) include in your meta-analysis. This guid­ance is especially important if others are assisting in your search. Even if you are conducting the search alone, however, these criteria can reduce subjectiv­ity that might be introduced if the criteria are ambiguous.

The second reason that explicit criteria are important is that these crite­ria define the population to which you can make conclusions. A statement of exclusion (i.e., an exclusion criterion) limits your conclusions not to involve this characteristic. For example, in the example meta-analysis I will pres­ent throughout this book (considering various effects involving relational aggression), my colleagues and I excluded samples with an average age of 18 years or older. It would therefore be inappropriate to attempt to draw any con­clusions regarding adults from this meta-analysis. A statement of inclusion (i.e., an inclusion criterion) implies that the population is defined—at least in part—by this criterion. For example, a criterion specifying that included studies must use experimental manipulation with double-blind procedures would mean that the population is of studies with this design (and any other inclusion criteria stated).

The third reason that explicit criteria are important relates to the goal of transparency, which is an important general characteristic to consider when reporting your meta-analysis (see Chapter 13). Here, I mean that your inclusion/exclusion criteria should be so explicit that a reader could, after performing the same searches as you perform, come to the same conclu­sions regarding which studies should be included in your meta-analysis. To illustrate, imagine that you perform a series of searches that identify 100 studies, and based on your inclusion/exclusion criteria you decide that 60 should be included in your meta-analysis. If another person were to evalu­ate those same 100 studies using your inclusion/exclusion criteria, he or she should—if your criteria are explicit enough—identify the same 60 studies as appropriate for the review. To achieve this level of transparency in your meta-analysis, it is important to record and report the full set of inclusion/ exclusion criteria you used.

2. Potential Inclusion/Exclusion Criteria

The exact inclusion/exclusion criteria you choose for your meta-analysis should be based on the goals of your review (i.e., What type of studies do you want to make conclusions about?) and your knowledge of the field. Never­theless, there are several common elements that you should consider when developing your inclusion/exclusion criteria (from Lipsey & Wilson, 2001, pp. 18-23):

2.1. Definitions of Constructs of Interest

The most important data in meta-analyses are effect sizes, which typically are some index of an association between X and Y.3 In any meta-analysis of these effect sizes, it is important to specify criteria involving operational definitions of both constructs X and Y. Although it is tempting for those with expertise in the area to take an “I know it when I see it” approach, this approach is inadequate for the reader and for deciding which studies should be included. One challenge is that the literature often refers to the same (or similar enough) construct by different names (e.g., in the example meta-analysis, the construct I refer to as “relational aggression” is also called “social aggression,” “indirect aggression,” and “covert aggression”). A sec­ond challenge is that the literature sometimes refers to different constructs with the same name (e.g., in the example meta-analysis, several studies used a scale of “indirect aggression” that included such aspects as diffuse anger and resentment that were inconsistent with the more behavioral definition of interest). By providing a clear operational definition of the constructs of interest, you can avoid ambiguities due to these challenges.

2.2. Sample Characteristics

It is also important to consider the samples used in the primary studies that you will want to include or exclude. Here, numerous possibilities may or may not be relevant to your review, and may or may not appear in the literature you consider. Some basic demographic variables to consider include gender (e.g., Will you include studies sampling only males or only females?), ethnicity (e.g., Will you include only representative samples, or those that sample one ethnic group exclusively?), and age (e.g., Will you include studies sampling infants, toddlers, children, adolescents, young adults, and/or older adults?). It is also worth considering what cultures or nationalities will be included. Even if you place no restrictions on nationality, you will need to exclude reports written in languages you do not know,4 which likely precludes many studies of samples from many areas of the world. Beyond these examples, you might encounter countless others—for example, samples drawn from unique settings (e.g., detention facilities, psychiatric hospitals, bars), selected using atypical screening procedures (e.g., certain personality types), or based on atypical recruitment strategies (e.g., participants navigating to a website). Although it is useful if you can anticipate some of these irregular sample characteristics in advance, many will invariably arise unexpectedly and you will have to deal with these on a case-by-case basis.

2.3. Study Design

A third consideration for inclusion/exclusion criteria for almost every meta­analysis is the type of research design that included studies should have. Some obvious possibilities are to include only experimental, quasi-experimental, longitudinal naturalistic, or concurrent naturalistic designs. Even within these categories, however, there are innumerable possibilities. For example, if you are considering only experimental treatment studies, should you include only those with a certain type of control group, only those using blinded pro­cedures, and so on? Among quasi-experimental studies, are you interested only in between-group comparisons or pre-post designs? Answers to these sorts of questions must come from your knowledge of the field in which you are performing the review, as well as your own goals for the meta-analysis.

2.4. Time Frame

The period of time from which you will draw studies is a consideration that may or may not be relevant to your meta-analysis. By “period of time,” I mean the year in which the primary study was conducted, for which you might use the proxy variable year of publication (or completion, presentation, etc., for unpublished works). For many phenomena, it might be of more interest to include studies from a broad range of time and evaluate historic effects through moderator analyses (i.e., testing whether effect sizes vary regularly across time; see Chapter 9) rather than a priori excluding studies. However, in some situations it may make sense to include only those studies performed within a certain time period. These situations might include when you are only interested in a phenomenon after some historic changes (e.g., correlates of unprotected sex after the AIDS crisis) or when the phenomenon has only existed during a certain period of time (e.g., studies of cyberbullying have only been performed since the popularity of the Internet has increased).

2.5. Publication Type

The reporting format of the studies is another consideration for potential inclusion/exclusion criteria. Although including only published studies is generally considered problematic (due to the high possibility of publication bias; see Chapter 11), it is important to consider what types of reports will be included. Possibilities include dissertations, other unpublished written reports (e.g., reports to funding agencies), conference presentations, or infor­mation that the researcher provides you upon request.

2.6. Effect Size Information

Finally, a necessary inclusion criterion is that the studies provide sufficient information to compute an effect size.5 In most situations, this will be infor­mation provided in the written report that allows you directly to compute an effect size (see Chapter 5). However, you should also consider whether you would include studies that provide only enough information to compute a lower-bound estimate (e.g., probability ranges such as p < .05, statements that results were nonsignificant; see Chapter 5). When studies do not report sufficient information to compute effect sizes, you should contact the study authors to request more information; here, a necessary inclusion criterion is that the authors supply this information.

3. Relative Advantages of Broad versus Narrow Inclusion Criteria

In developing inclusion/exclusion criteria, specifying both broad and narrow sets of criteria has notable advantages. By broad criteria, I refer to a set of cri­teria that include most possible studies and exclude few, whereas narrow cri­teria will exclude many studies and include few. Of course, these two choices represent end points along a continuum. Selecting a set of criteria that falls along this continuum has several implications for your meta-analysis.

Perhaps the most important consideration in weighing a broad versus nar­row set of criteria is that of the population of studies about which you want to draw conclusions. Put simply: Would you prefer to make conclusions about a very specific, well-defined population, or would you rather make more gener- alizable conclusions about a potentially messy population (i.e., one with likely fuzzy boundaries, likely inconsistent representation in your sample of studies, and possibly undistinguished subpopulations)? Specific to the issue of study quality (see Chapter 4) is the question of whether you want to include only the most methodologically rigorous studies or are willing to include methodologi­cally flawed studies (risking the “garbage in, garbage out” criticism described in Chapter 2). There is not a universal “right answer” to these questions, just as there is not a right answer to the issue of level of generalization to the “apples and oranges” problem described in Chapter 2. If you choose a narrow set of criteria, you should be cautious to draw conclusions only about this narrowly defined population. In contrast, if you choose a broad set of criteria, it is prob­ably advisable to code for study characteristics that contribute to this breadth and to evaluate these as potential moderators of effect sizes (see Chapter 9).

A second consideration is the number of studies that will ultimately be included in your meta-analysis by specifying a broad versus narrow set of criteria. Broad criteria will result in a meta-analysis of more studies that are more diverse in their features, whereas narrow criteria will result in fewer studies that are more similar in their features. Having fewer studies will sometimes result in inadequate power to evaluate the average effect size (see Chapters 8 and 10), will usually preclude thorough consideration of study characteristics that account for differences in effect sizes (i.e., moderator analyses; see Chapter 9), and might even lead your audience to view your review as too small to be important to the field. In contrast, having more studies increases the amount of work involved in the meta-analysis (espe­cially the coding of studies), perhaps to the point where a meta-analysis of the full collection of studies is impossible.6 Therefore, one consideration is to specify inclusion/exclusion criteria that yield a reasonable number of studies given your research questions, your available time and resources, and typical practices in your field. This is not the only, or even primary, consideration, but it is a realistic factor to consider.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Finding Relevant Literature for Meta-Analysis

After specifying inclusion/exclusion criteria, the next step is to begin search­ing for empirical studies that fit within this sampling frame. In searching for this relevant literature, you have many options, each with advantages and limitations over the others. Although it is not always necessary to use all of the options I list next, it is useful to consider at least most of them and how reliance on some but not others might bias the sample of studies you obtain for your meta-analysis.

Before describing these search options, it is useful to consider the con­cepts of recall and precision (see White, 2009). Recall is the percentage of studies retrieved from those that should be retrieved (i.e., the number of stud­ies meeting your inclusion criteria that actually exist); it is a theoretical value that can never be known because you never know how many studies actually exist. Precision is the percentage of retrieved studies that are relevant (i.e., actually meet your inclusion criteria). Ideally, we would like both to be 100%, such that our search strategies yield every available study that meets our criteria and none that do not. In reality, we can never meet this goal, so you must balance the relative costs of one or the other being less than 100%. The cost of imperfect recall is that you will miss studies that should have been included, resulting in reduced statistical power and potentially biased results if the missed studies differ from those you included. The cost of imperfect precision is that we will waste our resources retrieving and reading studies that will not be included in our meta-analysis. Although this might not seem like a tremendous cost, it is if it means that you cannot complete your meta- analysis.7 The goal of your search strategy should be to achieve high recall without diminishing precision beyond an unacceptable level, where “unac­ceptable” depends on your available resources and the expected benefits of increasing recall in terms of statistical power and reducing bias.

1. Electronic Databases

Modern electronic databases, available via the Internet through most uni­versity libraries (or available for subscription for others), have made the task of searching for relevant studies much easier than in the early days of meta­analysis. Electronic databases exist in many fields, such as economics (Econ- Lit), education (ERIC), medicine (Medline), psychology (PsycINFO), and sociology (Sociological Abstracts), to name just a few. These databases often have wide coverage (though see cautions below) and therefore serve as one of the primary search tools in modern meta-analysis. In fact, these databases are typically the first searches performed by meta-analysts, and I would con­sider them necessary (though not sufficient) for your meta-analysis.

Despite their power and apparent simplicity, using electronic databases is a more complex process than might be initially apparent (see Reed & Baxter, 2009). I next describe three considerations in using these databases, attempt­ing to consider these generically rather than focusing on any one database.

1.1. What Is Included and What Is Excluded?

The first question you should ask before using any electronic database is “What is included (and what is excluded) from this database?.” Answering this question requires you to read the documentation of the databases you are considering; consulting with librarians in your topical area is invaluable, as they have considerable expertise on this question.

Some databases include dissertations and other unpublished works, whereas others do not. If the database you plan to use does not include dis­sertations, you should certainly supplement your search of this database with one that includes dissertations (such as Proquest dissertation and the­sis database). If the database does not include other unpublished work, and your inclusion criteria allow for this work, then you will need to ensure that other search strategies will find these works. If the database does include unpublished works, you should investigate how these works are selected for inclusion; databases that include works unsystematically (e.g., primary study authors being willing to submit works to the database) should be treated cau­tiously as the sample of unpublished work may be biased.8

Another consideration is the breadth of published work included in the database. Prominent journals are more likely to be included than peripheral journals, and books by larger publishers are more likely to be included than those by lesser-known publishers. If it is plausible that the results (effect sizes) could differ in studies published in outlets included (e.g., prominent journals) versus excluded (e.g., periphery journals) in the database(s) you are using, then reliance on this database may yield a biased sample of studies.

1.2. Key Words

After researching the databases you will use to understand their coverage, you then search the databases for relevant studies. To perform this search, you generally enter key words, for which the search engine will return records containing these key words. Selection of appropriate key words goes far in increasing recall and precision, so you should consider these key words care­fully and report them in your meta-analytic review.

A first consideration is the key words you select. You can select key words based on your knowledge of the literature in your area, by examining the key words specified in studies that you know contain data about the phe­nomenon of interest, and through thesauri available in some electronic data­bases. Your goal is to create a list of words or phrases that (1) are as specific to the phenomenon you are investigating as possible and (2) cover the range of terms used to describe the phenomenon. Considering the example meta­analysis involving associations of relational aggression with various other constructs (e.g., gender, peer rejection), our goal was to search for all studies of relational aggression. Terms such as “aggression” were too broad, as these would identify studies investigating constructs aside from that in which we were interested. Using the term “relational aggression” was more specific, but by itself would have been inadequate because different researchers use dif­ferent terms for this construct. We ultimately developed a list of four terms to use in our search (“relational aggression,” “social aggression,” “indirect aggression,” and “covert aggression”) that represent the terms typically used by primary study authors investigating this construct.

Wildcard marks (e.g., “*” in PsycINFO) are useful in combination with key words. Wildcard marks are used in conjunction with a stem, specify­ing that the search engine returns all studies containing the specified stem followed by any characters where the wildcard mark is typed. For example, submitting the phrase “relational agg*” would return studies containing the phrases “relational aggression,” “relational aggressor,” and so on. Using wildcard marks can also return unexpected and unwanted findings, how­ever, (e.g., the example stem and wildcard would also return any studies that used the phrase “relational aggravation”). These can generally be recognized quickly and skipped, or you can modify the wildcard search term or use the Boolean statement “not” as described next.

Boolean statements are a tremendous asset when you are searching elec­tronic databases. These statements include “or,” “and,” and “not” in most databases. The use of “or” is especially valuable in combining alternative key words for the same construct; for example, we connected the four terms for the construct of interest using “or” in our example meta-analysis (i.e., the search phrase was: “relational aggression” or “social aggression” or “indi­rect aggression” or “covert aggression”). The logical statement “and” is useful for either limiting the studies returned or specifying two construct associa­tions that are of interest in many meta-analyses. For example, in the example meta-analysis, we could have combined the above search (various key words for relational aggression combined using “or”) with a phrase limiting the samples to childhood or adolescence (“child* or adolesc*”) using the “and” statement.9 Similarly, if we were only interested in studies reporting associa­tions between relational aggression and peer rejection (one of the examples I use commonly throughout the book), we could have used “and” to link the phrases for relational aggression with a set of phrases for peer rejection. Finally, you can use the key word “not” either to exclude unwanted wildcard phrases (e.g., in the example above, I could specify “not ‘relational aggrava­tion’ ” to remove the unwanted studies using this term), or to specify exclu­sion criteria (e.g., specifying “not ‘adult’ ”).

1.3. Cautions

Electronic databases are incredibly powerful and time-efficient tools for searching for relevant studies, and I believe that every modern meta-analysis should use these databases. However, at least three cautions merit consider­ation.

First, as I described earlier, you should carefully consider what is not included in the electronic databases you use. If a database does not include (or if it has poor rates of inclusion) unpublished works or studies published in peripheral outlets, then reliance on this database alone would result in diminished recall. This diminished recall can threaten your meta-analysis by decreasing statistical power and, if the studies not included in the database systematically differ from those included (e.g., publication bias, Chapter 11), by producing biased results. To avoid these problems, you should identify alternative electronic databases and other search strategies that are likely to identify relevant studies not included in the electronic database you are using.

A related caution comes from the fact that most electronic databases are discipline specific. Although the databases vary in the extent to which they include works in related disciplines, this disciplinary specificity suggests that you should not rely on only a single database within your discipline. Many, if not most, phenomena that social scientists study are considered within mul­tiple disciplines. For example, research on relational aggression might appear not only in psychology (e.g., in the PsycINFO database), but also in criminal justice, education, gender studies, medicine, public health, and sociology (to name just a few possibilities). I recommend that you consider searching at least one or two databases outside of your primary discipline to explore how much literature might be obtained from other disciplines.

A third caution in using electronic databases relates to their very nature: You perform a search and a list of studies is provided, but you have no idea how many potentially relevant studies were not provided. In other words, rely­ing only on electronic databases provides no information about what stud­ies were not identified in your search, so the possibility remains that some studies—and possibly even some very well-known studies—did not match your specified search criteria. You can address this problem in several ways. One possibility is to perform some additional searches within your selected database(s) that use broader terms (e.g., “aggression” rather than more spe­cific terms such as “relational aggression”) and visually scan results to see if any additional relevant studies could be identified with broader search cri­teria. Second, you can rely on additional search procedures besides the elec­tronic database. I return to this topic of assessing the adequacy of your search (including the adequacy of electronic database searches) in Section 3.4.

1.4. Conclusions about the Use of Electronic Databases

Electronic databases of journal articles, books and chapters, and often some unpublished works exist in most social science disciplines. These searchable databases can provide an efficient method of searching for studies to include in your meta-analysis if you carefully consider the coverage of the databases you use and select appropriate key words along with wildcard marks and Boolean statements. These electronic databases should not be your only method of searching the literature, however, as several cautions need to be considered when using them. Nevertheless, the electronic databases are likely to be one of the primary ways you will search for studies, and every modern meta-analysis should use these tools.

2. Bibliographical Reference Volumes

Bibliographical reference volumes are printed works that provide informa­tion similar to electronic databases (e.g., titles, authors, abstracts), often list­ing studies by broad topics and/or including an index of key words. These volumes were frequently published by large research societies and were intended to aid literature searches in specific fields in much the same way that electronic databases now do in most fields. For example, the Ameri­can Psychological Association published Psychological Abstracts from 1927 to 2006. In many fields, publication of these printed reference volumes has been discontinued in favor of online electronic versions (though exceptions may exist).

Searching these reference volumes is not nearly as convenient as search­ing electronic databases, and few meta-analysts currently rely on these vol­umes as their primary search instrument (though you are likely to see them used when you read older meta-analytic reviews). Nevertheless, there still may be instances when you would consider using these printed volumes. Spe­cifically, if studies potentially relevant for your meta-analysis include older studies, and the electronic databases that you use have not yet incorporated all of these older studies, then it may be useful to consult these reference vol­umes to ensure that you do not systematically exclude these older studies.

3. Listings of Unpublished Works

As I mentioned briefly in Chapter 2, and describe in detail in Chapter 11, one of the most challenging threats to many meta-analyses is that of publication bias (a.k.a. the “file drawer problem”). The extent to which you can avoid and evaluate this threat depends on your searching for and including unpub­lished studies in your meta-analysis. I have already mentioned the value of searching electronic databases that include dissertations as one method of obtaining unpublished studies. Next I list three additional listings that might allow you to find more unpublished studies. For each, I suggest searching with the same careful rigor I suggested for searching electronic databases.

3.1.  Conference Programs

A potentially valuable way to find unpublished studies is to search the pro­grams of academic conferences in which relevant work is likely to be pre­sented. Dedicated meta-analysts often have shelves of these programs, though even this idea is becoming antiquated as more conference programs are archived and searchable online. In this approach, you search the titles of presentations listed in conference books (larger conferences typically have at least crude indices) and request copies of these works from authors (whose contact information is usually listed in these books).

From my experience, it is usually possible to identify a large number of unpublished works by searching conference programs; however, retriev­ing copies of these presentations for coding can be more difficult. Typically, you are better able to contact authors and more likely to receive requested presentations if you make your request shortly after the conference rather than several years later. Therefore, studies obtained through conference pro­grams probably underrepresent older studies. Some other tips I have learned through experience include: (1) whenever you request a conference presenta­tion, provide exact details such as the title of the presentation and the year and conference where it was presented; (2) contact coauthors if you do not receive a response from the first author, as some authors of the presentation may have graduated or left academia; (3) tell the author why you are request­ing this information (I will elaborate on this piece of general advice below).

Although I think conference presentations are a valuable source of unpublished studies, there are some limitations and cautions to consider. First, your search for conference presentations should be systematic. If you decide to search the programs of a particular conference, you should make reasonable efforts to search the programs’ books across a reasonable number of years (vs. the years you attended but not the years in between when you did not attend), and you should certainly search for works within the entire conference book (vs. just the presentations you happened to attend). Second, you should recognize that the response rate to your requests might be low (you should track this response rate as it might be useful to report), and you should consider the possibility that responses might be systematically related to effect sizes.10 Finally, you should anticipate that conference presentations will often present information needed for study coding (Chapter 4) and effect size calculation (Chapters 5-7) in less detail than other formats (e.g., journal articles). It is still better to code what you can from these studies than not to consider them at all, and it is possible to request further information from study authors.

3.2. Funding Agency Lists

Another valuable way to obtain unpublished studies is to search funding award listings from relevant funding agencies (e.g., National Institutes of Health, National Science Foundation, private foundations). Because funding decisions are made before results are known, studies obtained through this approach will not likely be subject to biases in findings of significance/non- significance. Furthermore, searching these listings is likely to yield studies that have been started but have not yet gone through the publication process (i.e., more recent studies).

3.3. Research Registries

Some fields of clinical science have established listings in which researchers are expected to register a study before conducting it. To encourage registra­tion, some journals will only publish results from studies registered prior to conducting the study. Such registries, by creating a listing of studies in advance of knowing the results, should yield a collection of results unbiased by the findings (e.g., nonsignificant or counterintuitive findings). If the field in which you are performing your meta-analysis has such registries, these will be a very valuable search avenue for obtaining an unbiased set of stud­ies.

4. Backward Searches

After accumulating a set of studies for potential inclusion in your meta­analysis, you will begin the process of coding these studies (see Chapters 4-8). You should read these articles completely (vs. going straight for the method and results sections where most information you will code appears), searching for cited studies that might be relevant for your review that you did not identify through your other strategies. Similarly, you should care­fully read prior reviews (narrative or meta-analytic) searching for potentially relevant studies.

This process of searching for relevant studies cited in the works you have found is referred to as “backward searching” (sometimes also called “foot­note chasing”); that is, you are working from the studies you have “back­ward” in time to identify previously conducted studies cited in these works. This approach is especially useful in identifying older studies, whereas it is unlikely to identify newer studies that have not yet been cited. An impor­tant potential bias of this approach comes from the possibility that studies yielding certain “favorable” results (e.g., significant findings, effects favoring expectations) are probably more likely to be cited than studies with “unfa­vorable” results (e.g., null findings, counterintuitive findings).

Despite the potential biases of backward searches, I believe that they represent a valuable method of searching. My own experience is that many studies come from this approach even with what I consider quite thorough initial searches using other means. This approach is also valuable in iden­tifying literature that might have been missed in other search approaches due to failures to use appropriate key words or to search literatures in other disciplines.

5. Forward Searches

Whereas backward searches attempt to find studies cited in the studies you have, forward searches attempt to find studies that cite the studies you have. Forward searches are often performed using special databases for this purpose (e.g., Social Science Citation Index), though some field-specific databases are incorporating this approach (e.g., the psychology database PsycINFO now has this capacity). To perform a forward search, you enter information for a study you know is relevant to the topic of your meta-analysis, and the search engine finds works that cite this study. Because these citing studies necessar­ily occur after the cited study, the search is moving “forward” in time and is more likely to find newer articles than a backward search.

There are various degrees of intensity in engaging in forward searches. A less intense approach is to identify several of the earliest and most seminal works on the topic, then perform forward searches to identify studies citing these seminal papers. At the other end of the spectrum, you could perform forward searches of all works that you have determined meet the inclusion criteria for your review.

Forward searches are likely to yield high recall, as it is unlikely that many relevant studies would fail to cite at least some of the seminal works in the area. However, my experience11 is that forward searches are often quite low in precision. This is because many papers will cite a seminal work in an area when this area is of tangential interest to the paper.

6. Communication with Researchers in the Field

The final search approach that I will describe is to consult experts/researchers in the field in which you are performing your meta-analysis. This approach actually consists of several possibilities.

At a minimum, you should ask some experts to examine your inclu- sion/exclusion criteria and the list of studies you have identified, request­ing that they note additional studies that should have been included. If you examine these suggested studies and some do meet your inclusion criteria, then you should not only include these studies, but also consider why your search strategy failed to identify these studies (and revise your search strat­egy accordingly). I recommend that you consult colleagues who have a some­what different perspective in the field than your own (i.e., different “camps”) to provide a unique perspective.

Another valuable approach to communicating with researchers is simply to e-mail those individuals who conduct research in the area of your meta­analysis, asking them if they have any additional studies on the topic. This effort can also vary in intensity, ranging from e-mailing just the most active researchers in the field to e-mailing every author of studies you have identi­fied through other means. Although you will have to identify an approach that works best for you given your field and relationships with other research­ers, some practices that I have found valuable are: (1) to clearly state why I am requesting studies (e.g., “I am conducting a meta-analytic review of the associations between X and Y”); (2) to provide a small number of the most critical inclusion criteria (e.g., “I am interested in obtaining studies involving children or adolescents”); and (3) to state the various ways that they could provide the requested information to me (e.g., “I would like the correlation between X and Y, but can compute this from most other statistics you might have available, such as £-tests, ANOVA results, or raw means and standard deviations. I am also happy to compute this correlation myself, if you are willing to share the raw data with the agreement that I will delete this data file after computing this correlation.”).

A related but less targeted approach is to post requests on listservs, web­pages, or similar forums. Many of the same practices that are valuable when e-mailing are useful in such postings, though the standards of particular forums might necessitate briefer requests.

These communications with researchers are extremely valuable, though several considerations are important. First, my impression is that the response rates vary widely for different meta-analysts, with some receiving almost no responses but others receiving tremendous responses. I suspect that the fac­tors that improve response rates include your ability to convince others that your request is important and worth their time, your ability to minimize the burden on the researchers, and the quality of relationships you have with these colleagues. A second consideration is the obvious fact that the more widespread your requests (i.e., numerous e-mails or public postings), the more people know that you are conducting this particular meta-analysis, which is a consideration in terms of the review process. Perhaps the most important consideration, however, is one that I believe means that you absolutely must, to at least some degree, involve colleagues in the area of your meta-analysis: Meta-analytic reviews synthesize the body of knowledge in an area of study and typically provide the foundation for the next wave of empirical study in this area. Thus, the research community has a vested interest in this process, and the meta-analyst has an obligation to consider their input. This statement does not mean that you need to send the initial draft of your meta-analysis to everyone in your field (you should not), nor that your review needs to sup­port the conclusions of everyone in your field (your conclusions are hopefully empirically driven). Instead, by soliciting input from others in your field, whether by simply including the full body of their empirical results in your review or obtaining input from a smaller number of colleagues, your meta­analysis will benefit from this collective knowledge.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Reality Checking: Is My Meta-Analysis Search Adequate?

Regardless of what methods of searching the literature you rely upon, the most important question is whether your search is adequate. You can think of the adequacy of your search in three ways. First, is the sample of studies you have obtained representative of the population of studies, or is it instead biased (as illustrated in Figure 3.2)? Second, does the sample of studies you have obtained provide sufficient statistical power to evaluate the hypotheses you are interested in (or, similarly, does it provide sufficiently narrow confi­dence intervals of effect size estimates to be useful)? Third, would the typi­cal scholar in my field find the sample of studies complete, or have I missed studies that obviously should be included? The first two questions directly affect the quality of the empirical conclusions of your meta-analysis and so are obviously important. The third question is less important to the conclu­sions drawn, but is pragmatically relevant to others’ viewing of your review as adequate. This is a worthy consideration affecting both the likelihood of publication of your review and the impact it will have on your field.

The question of whether the sample you have obtained is an unbiased representation of the population is impossible to answer with certainty. How­ever, there do exist methods of evaluating for the most likely bias—that of publication bias—which I describe in Chapter 11.

Probably the best way to answer all of these questions satisfactorily is to make every reasonable effort to ensure that your search is exhaustive—that is, to ensure that the sample of studies for your meta-analysis contains as close to all the studies that exist in the current population as possible. This goal is probably never entirely attainable, yet if you have made every effort to obtain all available studies, it is reasonable to conclude that you have come “close enough.”12 No one knows when “close enough” is adequate, and there is less empirical evidence to inform this decision than is desired, but I offer the following suggestions for your own consideration of this topic.

First, you should conduct an initial search using some combination of the methods described above that you expect will provide a reasonably thor­ough sample of studies. For example, you might decide to consult prior (nar­rative or meta-analytic) reviews in this area, search several electronic data­bases in which you believe relevant studies might exist (ensuring that these electronic searches include searches of unpublished studies such as disser­tations), several listings of unpublished studies (i.e., conference programs, funding databases, and any research registries that exist in your field), and send out a request to authors via e-mail or listserv/website postings.

Second, you should create a list of studies obtained from these sources and ask some colleagues familiar with this research area to examine this list along with your inclusion/exclusion criteria. If they view it as complete, you have a good beginning. However, if they identify studies that are missing but should have been found, you should revise your search strategies (e.g., speci­fying different key words for electronic searches) and repeat the prior step.

The third suggestion is to take this list and begin forward and back­ward searches. You might start with forward searches, as this is less time­consuming. Here, you would start with a small number of the most seminal works in the area (in the absence of a clear idea of the seminal works, you might create a short list of the first studies and the studies published in the top journals in your field). After performing forward searches with these seminal works (spending considerable time reviewing the citing papers to ensure relevance, as these types of searches are usually low in precision), you probably will have identified some additional studies; if not, you can reason­ably conclude that forward searching will not yield any additional studies. Then, you can begin performing forward searches with the remaining stud­ies, perhaps starting with the oldest studies first, as these have existed for the longest time and have therefore had more opportunity to be cited. At some point, you will likely reach a point where forward searches of more articles no longer yield new articles, and you can stop forward searching.

At this point, you can begin coding studies (see Chapters 4-7). While doing so, you should also perform backward searches (i.e., reading the works carefully for citations to other potentially relevant studies). My experience is that I often find a considerable number of additional studies when I begin coding, but that this number quickly diminishes as I progress in coding stud­ies. If you find that you are almost never identifying additional studies near the end of your coding, you can be reasonably confident that your search is approaching exhaustion.

Despite this confidence, I recommend two additional steps to serve as a reality check. First, sit down with a few years of journals that are likely to publish studies relevant to your meta-analysis, and simply flip through the tables of contents and potentially relevant studies.13 If you do not find any additional articles, then this adds to your confidence that you have con­ducted an exhaustive search. However, if you do find additional articles, then you obviously need to revise your search procedures (if you find relevant articles, carefully consider why they were not found—e.g., did the authors use different key words or terminology than you used in your search?). The second step, if your flipping through the journals suggests the adequacy of your search, is to send the list of studies again to some experts in your field (preferably some who did not evaluate the initial list). If they identify studies you have missed, you should revise your search procedures; but if they do not, you can feel reasonably confident that your search is adequate.

My intention is not to be prescriptive in the process you should take in searching the literature. In fact, I think that the search process I described is more intensive than that used for most published meta-analyses. Neverthe­less, I present these steps as a model of a process that I believe leaves little uncertainty that your search is “close enough” to exhaustive. Although there is no guarantee that you have obtained every study from the population, I believe that after taking these steps you have reached a point where more efforts are unlikely to identify additional studies and are therefore not worth­while. I also believe that no other potential meta-analyst would be willing to engage in significantly greater efforts, so your search represents the best that is likely to be contributed to the field. Moreover, by consulting with experts in your field, you have ensured that your peers view the search as reasonable, which usually means that reviewers will have a favorable view during the review process, and readers will view it as adequate after it is disseminated. In sum, I believe that strategies similar to the one I have described can pro­vide a high degree of confidence that your search is adequate.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: Beginning a Meta-Analytic Database

Aside from perhaps persistence and patience, the most import virtue you can have for searching the literature for a meta-analysis is organization. As you have likely inferred, searching for studies is a time-intensive process, and you certainly do not want to add to this time by repeating work because of poor organization.

A good organizational scheme for the literature search will include sev­eral key components. First, you should have a clear, written statement of the inclusion/exclusion criteria that you will use in evaluating studies found through this search. Toward this end, it might be useful to record stud­ies identified in your search that were excluded for one reason or another (recording why they were excluded). Second, you should have a clear list of methods for searching the literature, with enough details to replicate these searches. For example, you might have a list that begins:

Step 1: Read the following review papers and chapters (listing these works).

Step 2: Search the PsycINFO database using the following key words (listing the key words, including any wildcard marks and logical operations).

Step 3: Search the ERIC database using the following key words (listing the same set of key words as the step 2 search, unless there is reason to use other key words or logical operations).

You then record the dates—and names, if multiple people are conducting the searches—of each search.

During the course of these searches, you will scan many titles and abstracts in an attempt to determine whether each study is relevant for your meta-analysis. I suggest that you be rather inclusive during this initial screen­ing, retaining any studies that might meet your inclusion criterion. You should also retain any nonempirical works, such as reviews or theoretical papers; although these do not provide empirical results for your meta-analysis, it will be worthwhile to read them (1) to identify additional studies cited in these papers, and (2) to inform interpretation of results of your meta-analysis.

As you are identifying works that you will retain, it is critical to have some way of organizing this information. I use spreadsheets such as that shown in Table 3.1. (I have shown only four studies here, your spreadsheet will likely be much larger.) Although you should develop an approach that meets your own needs, this example spreadsheet contains several pieces of information that I recommend recording. The first column contains a number for each paper (article, chapter, dissertation, etc.) identified in the search. The num­ber is arbitrary, but it is useful for filing purposes (as the number of papers becomes large, it is useful to file them by number rather than, e.g., author name). The next four columns contain citation information for the paper. This information is useful not only for citing the paper in your write-up, but in identifying repetitive papers during your multiple search strategies (for this purpose, having this information in a searchable spreadsheet is useful). The sixth column contains the abstract, which is useful if you want to search for specific terms within your spreadsheet. I recommend copying this informa­tion into your spreadsheet if it is electronically available, but it probably is not worth the time needed to type this in manually. The seventh column identi­fies where and when the paper was found; recording the date is important because (1) you might want to update the search near the completion of your meta-analysis, and (2) you should report the last search dates in your presen­tation of your meta-analysis. The two rightmost columns (columns eight and nine) contain information for retrieving and coding the reports. One column indicates whether you have the report, or the status of your attempt to retrieve it (e.g., the third paper notes that I had requested this dissertation through my university’s interlibrary loan system). The last column will become relevant when you begin coding the studies (see Chapters 4-8). Here, I have recorded the person (BS = Brian Stucky, the second author on this paper) who coded this report and the date it was coded. Recording both pieces of information are valuable in case you later identify a problem in the coding (e.g., if one coder was making a consistent error) or if you revise the coding protocol (you then need to modify the coding of all studies coded before this change). In this column, I also record when studies are excluded for a particular reason; for instance, the fourth study was excluded because it used an adult sample (which was one of the specified exclusion criteria in this review).

Of course, you may use a different way of organizing information from your literature search. The point is that you should have some way of organiz­ing information that clearly records important information and avoids any duplication of effort.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Identifying Interesting Moderators in Meta-Analysis

Decisions about which study characteristics to code need to be heavily informed by your knowledge of the content area in which you are performing a meta-analytic review. Nevertheless, I describe two sets of general consider­ations that I believe apply to meta-analytic reviews across fields: considering the research questions you are interested in and considering coding certain specific aspects of studies.

1. Considering Research Questions of Interest

Just as planning a primary research study requires you to select variables based on your research questions, planning a meta-analysis requires that you base your decisions about which study characteristics to code on the research questions that you wish to answer. If your research questions are exclusively about average effect sizes across studies (i.e., combining studies), then you might not need to code much beyond effect sizes, sample sizes, and informa­tion for any artifact corrections you wish to make. I qualify this statement by noting that it is still valuable to be able to provide basic descriptive infor­mation about this sample of studies to inform the generalizability of your review. Nevertheless, the number of study characteristics that you will need to code to address this research question adequately is small.

In contrast, if at least some of your research questions involve com­paring studies (i.e., identifying whether studies with certain features yield larger effect sizes than studies with other features), then it will be much more important to code many study characteristics. Obviously, if you put forth a research question about a specific characteristic moderating effect sizes (e.g., do studies with this characteristic yield larger effect sizes than studies without this characteristic?), then it will be necessary to code this specific characteristic. However, you should also consider what study characteristics might commonly co-occur with the characteristic you are interested in, and code these. For example, if you are interested in investigating whether stud­ies with certain types of samples yield different effect sizes (e.g., children vs. adults), you should carefully consider the other study characteristics that are likely to differ across these types of samples (e.g., studies of adults might frequently rely on self-reports, whereas studies of children might frequently rely on parent reports, observations, etc.). If you fail to code these other study characteristics, then you cannot empirically rule out the possibility that your results involving the coded study characteristic of interest are not really due to these co-occurring characteristics. In contrast, if you do code these char­acteristics, then you are able to evaluate empirically such competing explana­tions (see Chapter 9).

As a more extreme version of research questions involving specific mod­erators, some meta-analysts aim to predict all heterogeneity in effect sizes by coded study characteristics. Although this goal tends to be quite exploratory, and you would therefore view the findings of moderation by specific charac­teristics cautiously, it nevertheless is a goal you might consider. If so, then you will necessarily code a large number of study characteristics; specifically, you will code any study characteristics that meet two conditions. First, the study characteristics are consistently reported in many or even most studies; this is necessary to avoid a preponderance of missing data when you evalu­ate the coded characteristic as a moderator. The second condition is that the study characteristic varies across at least some studies; this variability across studies is necessary for the study characteristic to covary with effect sizes. You would then enter these coded study characteristics into some large pre­dictive model (e.g., forward stepwise regression) to explore relations between them and variation in effect sizes.

2. Considering Specific Aspects of Studies

As I mentioned, the exact study characteristics you code will depend on your research questions and be informed by your knowledge of the topic area. Nevertheless, four general types of characteristics should be considered in any meta-analysis in the social sciences: characteristics of the sample, mea­surement, design, and source (see also Lipsey, 2009; Lipsey & Wilson, 2001, pp. 83-86). These are summarized in Table 4.1.

2.1. Sample Characteristics

Potentially relevant characteristics of the sample that you might consider include aspects of the sampling procedure and the demographic features of the sample. For instance, you might code sampling procedures such as whether the sample was drawn from unique settings (e.g., from a univer­sity setting, some sort of clinical setting, a correctional facility, or specific other settings relevant to the area), whether the study attempted to draw a sample representative of a larger population (e.g., a nationally representative sample) versus relying on a convenience sample, and the country from which the sample was drawn. Potentially relevant demographic features to consider include the gender or ethnic composition of the sample, the mean socioeco­nomic status or age of the sample, or any other potentially relevant descrip­tors (e.g., average IQ). Although you will not necessarily code all of these possible characteristics, either because you do not believe they are relevant or because the primary studies do not consistently report these features, I believe that most meta-analyses in the social sciences should at least consider coding some sample characteristics.

2.2. Measurement Characteristics

In many areas of social science, there exist multiple approaches to measure­ment and multiple specific measures of the variables that comprise your effect sizes. For this reason, you may want to code the measurement characteris­tics of either or both variables comprising your effect size. Potential aspects that can be coded include both the source of information (e.g., self-report; some other reporter such as a spouse, parent, or teacher; observations by the researcher) and specific features of the measurement process (e.g., covert ver­sus overt observations, timed versus untimed performance on a test). In areas where a small number of measurement instruments are widely used, you might also consider coding the specific measure used. In survey research, you might code whether the original version of an instrument, a shortened form, or a translated form of the scale was used. These suggestions repre­sent just a few of the possibilities you might consider. A thorough knowledge of the strengths and limitations about measurement processes and specific measures in your field will be extremely influential in guiding your decisions about the measurement characteristics you might decide to code.

2.3. Design Characteristics

You might also consider coding both broad and narrow characteristics of the designs of studies included in your meta-analysis. At the broad level, you might code, for example, whether studies used experimental group compari­sons, quasi-experimental group comparisons, single-group pre-post com­parisons, or regression discontinuity designs. At a narrower level, you could consider specific design features; for example, if you were conducting a meta­analysis of treatment studies, you might code various aspects of the control groups (e.g., no contact, attention only, treatment as usual, placebo).

2.4. Source Characteristics

Finally, in some instances coding characteristics of the research report itself may be valuable. As described in Chapter 11, you should always code whether or not the report is published (and potentially more nuanced codes such as publication quality) to evaluate evidence of publication bias. There may be instances when it is useful to code the year of publication (or year of pre­sentation for conference presentations, year of defense for dissertation, etc.), which might serve as a proxy for the year the data was collected.1 Evalua­tion of this year as a moderator might illuminate historic trends in the effect sizes across time. It might also be useful to evaluate whether or not studies were funded, or perhaps the specific sources of funding, if you suspect that these factors could bias the results. A fourth set of source characteristics to consider are the potentially relevant characteristics of the researchers them­selves (e.g., discipline, gender, ethnicity). Evaluating these in relation to effect sizes might indicate either the presence of uncoded methodological features (related to, e.g., disciplinary styles) or systematic differences in results poten­tially caused by biases of the researchers (e.g., different magnitudes of gender differences found by male versus female researchers).

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Coding Study “Quality” in Meta-Analysis

Some have recommended that meta-analysts code for study quality and then either (1) include only studies meeting a certain level of quality or (2) evalu­ate quality as a moderator of effect sizes.2 This recommendation is problem­atic, in my view, because it assumes (1) that “quality” is a unidimensional construct and (2) that we are always interested in whether this overarching construct of “quality” directly relates to effect sizes. I believe that each of these assumptions is inaccurate, as I describe next.

1. The Multidimensional Nature of Study Quality

Study quality can be defined in many ways (see Valentine, 2009; for an exam­ple scoring instrument see Valentine & Cooper, 2008). At a broad level, you can consider quality in terms of study validity, specifically internal, external, construct, and statistical conclusion validity (Cook & Campbell, 1979; Shad- ish et al., 2002). Of course, within each of these four broad levels, there exist numerous specific aspects of studies contributing to the validity (and hence, quality). For example, internal validity is impacted by whether or not random assignment was used, the comparability of groups in quasi-experimental designs both initially and throughout the study, and the rates of attrition (to name just a few influences). Even more specifically, many fields of research have adapted—whether explicitly or implicitly—certain empirical practices that many researchers agree contribute to higher or lower quality of studying the phenomenon of interest (for a summary of three explicit sets of criteria, see Valentine, 2009).

If you believe that these numerous features of studies reflect an underly­ing dimension of study quality, then you would expect these various features to co-occur across studies. For example, studies that have certain features reflecting internal validity should have other features reflecting internal validity, and studies with high internal validity should also have high exter­nal, construct, and statistical conclusion validity. Whether or not these co­occurrences exist in the particular collection of studies in your meta-analysis is both a conceptual and an empirical question. Conceptually, do you expect that the various features are reflections of a unidimensional quality con­struct in this area of study? Empirically, do you find substantial and con­sistent positive correlations among these features across the studies in your meta-analysis? If both of these conditions hold, then it may be reasonable to conceptualize an underlying (latent) construct of study quality (see left side of Figure 4.1). However, my impression is that in most fields, the conceptual argument is doubtful and the empirical evaluation is not made.

2. Usefulness of Moderation by Study Quality versus Specific Features

If you cannot provide conceptual and empirical support for an underlying “quality” construct leading to manifestations of different aspects of quality across studies, I believe that it becomes more difficult to describe some phe­nomenon of “quality.” Nevertheless, even if such a construct that produces consistent variation in features across studies does not exist, the collection of these features might still define something meaningful that could be termed “quality.” This situation is displayed on the right side of Figure 4.1. The differ­ence between these two situations—one in which the features of the studies reflecting quality can be argued, conceptually and empirically, to co-occur (left) versus one in which the concept of quality is simply defined by these features—parallels the distinction between reflective versus formative indi­cators (see Figure 4.1 and, e.g., Edwards & Bagozzi, 2000; Howell, Breivik, & Wilcox, 2007; MacCallum & Browne, 1993). However, this approach would also suffer the same problems of formative measurement (e.g., Howell et al.,2007), including difficulties in defining the construct if some of the forma­tive indicators differentially relate to presumed outcomes. In terms of meta­analytic moderator analyses (see Chapter 9), the problem arises when some study features might predict effect sizes at a magnitude—or even direction— differently than others. This situation will lead to a conceptual change in the definition of the “study quality” construct; more importantly, this situa­tion obscures your ability to detect which specific features of the studies are related to variation in effect sizes across studies. I argue that it is typically more useful to understand the specific aspects of study quality that relate to the effect sizes found, rather than some broader, ill-defined construct of “study quality.”

3. Recommendations for Coding Study Quality

In sum, I have argued that (1) the conditions (conceptual unidimensionality and empirically observed substantial correlations) in which study features might be used as reflective indicators of a “study quality” construct are rare, and (2) attempting simply to combine the conceptually multidimensional and empirically uncorrelated (or modestly correlated) features as formative indicators of a “study quality” construct are problematic. This does not mean that I suggest not considering study quality. Instead, I suggest coding the various aspects of study quality that are potentially important within your field and evaluating these as multiple moderators of the effect sizes among your studies.

My recommendation to code, and later analyze (see Chapter 9), individ­ual aspects of study quality means that you must make decisions about what aspects of study quality are important enough to code.3 These decisions can be guided by the same principles that guided your decision to code other potential moderators (see Section 4.1). Based on your knowledge of the area in which you are performing a meta-analysis, you should consider the research questions you are interested in and generally consider coding at least some of the aspects of study quality contributing to internal, external, and construct validity.4 My decision to organize coding around these aspects of validity fol­lows Valentine (2009), and these possibilities are summarized in the bottom of Table 4.1.

3.1. Internal Validity

Internal validity refers to the extent to which the study design allows conclu­sions of causality from observed associations (e.g., association between group membership and outcome). The most important influence on internal validity is likely the study design, with experimental (i.e., random assignment) stud­ies providing more internal validity than quasi-experimental studies (such as matched naturally occurring groups, regression continuity, and single­subject designs; see Shadish et al., 2002). However, other study characteristics of studies might also impact internal validity. The degree to which condition is concealed to participants—also known as the “blinding” of participants to condition—impacts internal validity. For example, studies comparing a group receiving treatment (e.g., medication, psychotherapy) to a control group (e.g., placebo, treatment as usual) can have questionable internal validity if partici­pants know which group they are in. Similarly, studies that are “double blind,” in that the researcher measuring the presumed outcome is unaware of partici­pants’ group membership, are considered more internally valid in some areas of research. Finally, attrition—specifically selective and differential attrition between groups—can impact internal validity, especially in studies not using appropriate imputation technique (see Schafer & Graham, 2002).

3.2. External Validity

External validity refers to the extent to which the findings from a particular study can be generalized to different types of samples, conditions, or different ways of measuring the constructs of interest. However, attention to external validity focuses primarily on generalization to other types of samples/partici- pants. The most externally valid studies will randomly sample participants from a defined population (e.g., all registered voters in a region, all school­children in the United States). In many, if not most, fields, this sort of random sampling is rare. So, it is important for you to determine what you consider a reasonably broad level of generalization in the research context of your meta­analysis, and code whether studies achieve this or focus on a more limited subpopulation (and likely the specific types of subpopulations).5 Fortunately, meta-analytic aggregation of individual studies with limited external validity can lead to conclusions that have greater external validity (see Chapter 2), provided that the studies within the meta-analysis collectively cover a wide range of relevant sample characteristics (see Section 4.1.2.a above on coding these characteristics).

3.3. Construct Validity

Construct validity refers to the degree to which the measures used in a study correspond to the theoretical construct the researchers intend to measure. The heading of “construct validity” is often used to refer to a wider range of measurement properties, including both the reliability and validity of the measure. I suggest coding the reliability of the measures comprising the two variables for potential effect size corrections (see Chapter 6). I do not support decisions to exclude studies with measures below a certain threshold reliabil­ity given that any choice of the threshold is arbitrary and because reliability scores reported in a study are imperfect parameter estimates (e.g., it is very plausible that two studies with identical sampling procedures from the same population, same methodology, and using the same measures could obtain slightly different estimates of internal consistency, perhaps with one at 0.78 and the other at 0.82 around an arbitrary 0.80 cutoff). It is more difficult to make such clear recommendations regarding the validity of the measures. Certainly, you should have a clear operational definition of the constructs of interest that can guide decisions about whether a study should or should not be included in your meta-analysis (see Chapter 3). Beyond this, it is possible to correct for imperfect validity (see Chapter 6), at least in situations where you have a good estimate of the correlation (i.e., validity coefficient) between the measure used in the study and some “gold standard” for the construct. Probably the safest approach is to treat this issue as an empirical question, and code relevant measurement characteristics (see Section 4.1.2.b) for use as potential moderators of study effect sizes.

3.4. Conclusions Regarding Coding Aspects of Quality

This consideration of study “quality” in terms of aspects of validity highlights the range of potential characteristics you can code for your meta-analysis. Given this range, I have treated the issue of coding study “quality” similarly to that of coding study characteristics (see Section 4.1), and do not see these aspects of “quality” holding a greater value than any other study characteristics.

At the same time, you should be aware of the “garbage in, garbage out” criticism (see Chapter 2), and consider this critique in light of the goals and intended audience of your meta-analysis. If your goal is to inform policy or practice, and the intended audience consists primarily of individuals who want clear, defensible answers (e.g., policymakers), then I suggest that you use aspects of study quality primarily as inclusion/exclusion criteria (see Chapter 3) in selecting studies (assuming that enough studies meet these inclusion criteria so as to provide a reasonably precise effect size estimate). In contrast, if your goal is to inform understanding of a phenomenon, and the intended audience is primarily individuals comfortable with nuanced, quali­fied conclusions (e.g., scientists), then I suggest coding these aspects of study quality for analysis as potential moderators of effect sizes (see Chapter 9). Perhaps a middle ground between these two recommendations is to code and evaluate moderation by various aspects of study quality, but to base policy or practice recommendations on results from higher-quality studies when these aspects are found to moderate effect sizes. Regardless of how these aspects of study quality are used (i.e., as inclusion/exclusion criteria versus coded mod­erators), I believe that a focus on specific aspects of study qualities is prefer­able to a single code intended to represent an overall “quality” construct.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Evaluating Coding Decisions in Meta-Analysis

Once you have decided what study characteristics to code, the next step, of course, is to do it—to carefully read obtained reports and to record informa­tion about the studies. The information recorded is that regarding both the study characteristics you have decided to code (see previous two sections) and the effect sizes. I defer discussion of computing effect sizes until Chapter 5, but the same principles of evaluating coding decisions of effect sizes apply as for coding study characteristics that I describe in this section.

Two important qualities of your coding system are the related concepts of transparency and replicability (Wilson, 2009). In addition to these quali­ties, it is also important to consider the reliability of your coding.

1. Transparency and Replicability of Coding

When writing or otherwise presenting your meta-analysis, you should pro­vide enough details of the coding process that your audience knows exactly how you made coding decisions (transparency) and, at least in principle, could reach the same decisions as you did if they were to apply your coding strategy to studies included in your meta-analysis (replicability). To achieve these principles of transparency and replicability, it is important to describe fully how each study characteristic is quantified.

Some study characteristics are coded in a straightforward way; the char­acteristics that require little or no judgment decisions on the part of the coder are sometimes termed “low inference codes” (e.g., Cooper, 2009b, p. 33). For example, coding the mean age of the sample will usually involve simply record­ing information stated in the research reports. You should fully describe even such simple coding (e.g., that you recorded age in years). In my experience, however, even such seemingly simple study characteristics yield complexities. For example, a study might report an age range (from which you might record the midpoint) or a proxy such as grade in school (from which you might esti­mate a likely age). Ideally, your original coding plan will have ways of deter­mining a reasonable value from such information, or you might have to make these decisions as the unexpected decisions arise. In either case, it is important to report these rules to ensure the transparency of your coding process.6

When study characteristics are less obvious (i.e., high inference cod­ing, in which the coder must make judgment decisions), it is critical to fully report the coding process to ensure transparency and replicability (and this process should already be written to ensure reliability of coding; discussed next in Section 4.3.2). For example, if you have decided to code for types of measures or designs of the studies, you should report the different values for this categorical code and define each of the categories. During the planning stages, you should consider whether it is possible to reduce a high inference code into a series of more specific low inference codes.7

2. Reliability of Coding

One way to evaluate empirically the replicability of your coding system is to assess the reliability of independent efforts of coding the same studies. You can evaluate this reliability either between coders (intercoder reliability) or within the same coder (intracoder reliability; Wilson, 2009).

Intercoder reliability is assessed by having two coders from the coding team independently code a subset of overlapping studies. The coders should be unaware of which studies each other is coding because an awareness of this fact is likely to increase the vigilance of coding and therefore provide an overestimate of the actual reliability. The number of doubly coded studies should be large enough to ensure a reasonably precise estimate of reliability. Lipsey and Wilson (2001, p. 86) recommended 20 to 50 studies, and your decision to choose a number within this range might depend on your per­ception of the level of inference of the coding. If your protocol calls for low inference coding, then a lower number should suffice in confirming inter­coder agreement, whereas higher inference coding would necessitate a higher number of overlapping studies to more precisely quantify this agreement.

Intracoder agreement is assessed by having the same person code a sub­set of studies twice. Because it is likely that the coder will be aware of previ­ously coding the study, it is not possible to conceal the studies used to assess this reliability. However, if the coder is unaware during the first coding tri­als of which studies they will recode (e.g., a random sample of studies is selected for recoding after the initial coding is completed), the overestima­tion of reliability is likely reduced. One reason for assessing intracoder agree­ment is to evaluate potential “drift”—changes in the coding process over time that could come about from the coding experience, increasing biases from “expecting” certain results and/or fatigue. A second reason is that it is not possible to assess intercoder agreement. This inability to assess inter­coder agreement is certainly a realistic possibility if you are conducting the meta-analysis alone and you have no colleagues with sufficient expertise or time to code a subset of studies. Intracoder agreement is not a perfect substi­tute for intercoder agreement because one coder might hold potential biases or consistently make the same coding errors during both coding sessions. However, it can serve as reasonable evidence of reliability if efforts are made to ensure the independence of the coding sessions. For example, the coder should work with unmarked copies of the studies (not with copies containing notes from the previous coding), and the coding sessions should be separated by as much time as practical.

Using either an intercoder or intracoder approach, it is useful to report the reliability of each coded study characteristic (i.e., each study characteristic, artifact correction, and effect size), just as you would report the reliability for each variable in a primary study. It is deceptive to report only a single reliabil­ity across codes, as this might obscure important differences in coding reli­ability across study characteristics (Yeaton & Wortman, 1993). Several indices are available for quantifying this reliability (see Orwin & Vevea, 2009), three of which are agreement rates, Cohen’s kappa, and Pearson correlation.8

2.1. Agreement Rate

The most common index is the agreement rate (AR), which is simply the proportion of studies on which two coders (or single coder on two occasions) assign the same categorical code (Equation 4.1; from Orwin & Vevea, 2009, p. 187):

The agreement rate is simple and intuitive, and for these reasons is the most commonly reported index of coding reliability. At the same time, there are limitations to this index; namely, it does not account for base rates of coding (i.e., some values are coded more often than others) and the resulting chance levels of agreement.

2.2. Cohen’s Kappa

An alternative index for reliability of categorical codes is Cohen’s kappa (k), which does account for chance level agreement depending on base rates for coding. Kappa is estimated using Equation 4.2 (from Orwin & Vevea, 2009, pp. 187-188):

Cohen’s kappa is a very useful index of coding reliability for study char­acteristics with nominal categorical levels. When used with ordinal coding, it has the limitation of not distinguishing between “close” and “far” disagree­ments (e.g., a close agreement might be two coders recording 4 and 5 on a 5-point ordinal scale, whereas far disagreement might be two coders record­ing 1 and 5 on this scale). However, ordinal coding can be accommodated by using a weighted kappa index (see Orwin & Vevea, 2009, p. 188). The major limitation of using kappa is that it requires a fairly large number of studies to produce precise estimates of reliability. Although I cannot provide concrete guidelines as to how many studies is “enough” (because this also depends on the distributions of the base rate), I suggest that you use either the upper end of Lipsey and Wilson’s recommendations (i.e., about 40 to 50 studies) or all studies in your meta-analysis if it is important to obtain a precise estimate of coding reliability in your meta-analysis.

2.3. Pearson Correlation

When study characteristics are coded continuously or on an ordinal scale with numerous categories, a useful index of reliability is the Pearson correla­tion (r) between the two sets of coded values. One caveat is that the correla­tion coefficient does not evaluate potential mean differences between coders/ coding occasions. For example, two coders might exhibit a perfect correlation between their recorded scores of mean ages of samples, but one coded the values in years and the other in months. This discrepancy would obviously be problematic in using the coded study characteristic “mean age of sample” for either descriptive purposes or moderator analyses. Given this limitation of the correlation coefficient, I suggest also examining difference scores, or equivalently, performing a repeated-measures (a.k.a. paired samples) £-test to ensure that such discrepancies have not emerged.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Practical Matters: Creating an Organized Protocol for Coding in Meta-Analysis

Once you have decided what study characteristics to code, the next step is to plan to code them (likely coding effect sizes at the same time, as described in Chapter 5). The guidance for this coding comes from a coding protocol, which consists of both the interface coders used to record information from the studies as well as a coding manual providing instructions for this coding process (see Wilson, 2009). Through using this coding protocol, your goal is to create a usable database for later meta-analyses. There are several consid­erations for each aspect, which I describe next.

1. Coding Interface

Considering first the interface coders use to record information, three options include using paper forms that coders complete, using a computer­ized form to collect this information, or coding directly into the electronic format to be used for analyses. Part of an example paper coding form (from a meta-analysis of the association between relational aggression and peer rejection described throughout this book; Card et al., 2008) is shown in Figure 4.2.

Using paper forms would require coders to write information into pre­defined questions (e.g., “Sample age in years:___ ”), which would then be transferred into an electronic database for analyses. The advantages of this approach are (1) that coders need training only in the coding process (guided by the manual of instructions described in Section 4.4.2) rather than proce­dures for entering data into a computer, and (2) the information is checked for plausibility when entered into the computer.

A computerized form would present the same information to coders but would require them to input the coded data electronically, perhaps using a relational database program (e.g., Microsoft Access). This type of interface would require only a small amount of training beyond using paper forms and would reduce the time (and potentially errors) in transferring informa­tion from paper to the electronic format. However, this advantage is also a disadvantage in that it bypasses the check that would occur during this entry from paper forms.

A third option with regard to a coding interface is to code information directly into an electronic format (e.g., Microsoft Excel, SAS, SPSS) later used for analysis. This option is perhaps the most time-efficient of all in reducing the number of steps, but it is also the most prone to errors. I strongly discour­age this third method if multiple coders will be coding studies.

2. Coding Manual

A coding manual is a detailed collection of instructions describing how information reported in research reports is quantified for inclusion in your meta-analysis. Creating a detailed coding manual serves three primary pur­poses. First, this coding manual provides a guide for coders to transfer infor­mation in the study reports to the coding interface (e.g., paper forms). As such, it should be a clear set of instructions for coding both “typical” studies and more challenging coding situations. Second (and relatedly), this coding manual aims to ensure consistency across multiple reporters9 by providing a clear, concrete set of instructions that each coder should study and have at hand during the coding process. Third, this coding manual serves as docu­mentation of the coding process that should guide the presentation of the meta-analysis or be made available to others to ensure transparency of the coding (see beginning of this section).

With regard to the coding manual, the amount of instruction for each study characteristic coded depends on the level of inference of the coding: low-inference coding requires relatively little instruction, whereas high-
inference coding requires more instruction. In addition, the coding manual is most often a work in progress. Although an initial coding manual should be developed before beginning the coding, ambiguities discovered during the coding process likely will force ongoing revision.

Turning again to the example coding form of Figure 4.2, we should note that this form would be accompanied by a detailed coding manual that all coders have been trained in and have present while completing this form. To provide illustrations of the type of information that might be included in such a manual, we can consider two of the coded study characteristics. First, item 5 (mean age) might be accompanied by the rather simple instruction “Record the mean age of the sample in years.” However, even this relatively simple (low-inference) code requires fairly extensive elaboration: “If study analyzed a subset of the data, record the mean age of the subset used in analy­ses. If study reported a range of ages but not the mean, record the midpoint of this range.” My colleagues and I also had to change the coding protocol rather substantially when we found that many studies failed to report ages, but did report the grades in school of participants. This led us to add the “grade” code (item 6) along with instructions for entering this information in the database: “If sample age is not reported in the study, then an estimated age can be entered from grade using the formula Age = Grade + 5.”10 A second study characteristic shown in Figure 4.2 that illustrates typical instruction is item 10 (aggression—source of information). The coding manual for this item specifies the choices that should be coded (self-report, peer nomination, peer rating, teacher report, parent report, researcher observations, or other) and definitions of each code.

3. Database for Meta-Analysis

The product of your coding should be an electronic file with which to con­duct your meta-analysis. Table 4.2 provides an example of what this database might look like (if complete, the table would extend far to the right to include other coded study characteristics, coded effect sizes [Chapter 5], information for any artifact corrections [Chapter 6], and several calculations for the actual meta-analysis [Chapters 8-10]). Although the exact variables (columns) you include will depend on the study characteristics you decide to code, the gen­eral layout of this file sould be considered. Here, each row represents a single coded study, and each column represents a coded study characteristic.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

The Common Metrics in Meta-Analysis: Correlation, Standardized Mean Difference, and Odds Ratio

1. Significance Tests Are Not Effect Sizes

Before describing what effect sizes are, I describe what they are not. Effect sizes are not significance tests, and significance tests are not effect sizes. Although you can usually derive effect sizes from the results of significance tests, and the magnitude of the effect size influences the likelihood of find­ing statistically significant results (i.e., statistical power), it is important to distinguish between indices of effect size and statistical significance.

Imagine that a researcher, Dr. A, wishes to investigate whether two groups (male versus female, two treatment groups, etc.) differ on a particular variable X. So she collects data from five individuals in each group (N = 10). She finds that Group 1 members have scores of 4, 4, 3, 2, and 2, for a mean of 3.0 and (population estimated) standard deviation of 1.0, whereas Group 2 members have scores of 6, 6, 5, 4, and 4, for a mean of 5.0 and standard devia­tion of 1.0. Dr. A performs a t-test and finds that t(8) = 3.16, p = .013. Finding that Group 2 was significantly higher than Group 1 (according to traditional criteria of a = .05), she publishes the results.

Further imagine that Dr. B reads this report and is skeptical of the results. He decides to replicate this study, but collects data from only three individu­als in each group (N = 6). He finds that individuals in Group 1 had scores of 4, 3, and 2, for a mean of 3.0 and standard deviation of 1.0, whereas Group 2 members had scores of 6, 5, and 4, for a mean of 5.0 and standard deviation of 1.0. His comparison of these groups results in £(4) = 2.45, p = .071. Dr. B concludes that the two groups do not differ significantly and therefore that the findings of Dr. A have failed to replicate.

Now we have a controversy on our hands. Graduate student C decides that she will resolve this controversy through a definitive study involving 10 individuals in each group (N = 20). She finds that individuals in Group 1 had scores of 4, 4, 4, 4, 3, 3, 2, 2, 2, and 2, for a mean of 3.0 and standard devia­tion of 1.0, whereas individuals in Group 2 had scores of 6, 6, 6, 6, 5, 5, 4, 4, 4, and 4, for a mean of 5.0 and a standard deviation of 1.0. Her inferential test is highly significant, £(18) = 4.74, p = .00016. She concludes that not only do the groups differ, but also the difference is more pronounced than previously thought!

This example illustrates the limits of relying on the Null Hypothesis Significance Testing Framework in comparing results across studies. In each of the three hypothetical studies, individuals in Group 1 had a mean score of 3.0 and a standard deviation of 1.0, whereas individuals in Group 2 had a mean score of 5.0 and a standard deviation of 1.0. The hypothetical research­ers’ focus on significance tests led them to inappropriate conclusions: Dr. B’s conclusion of a failure to replicate is inaccurate (because it does not consider the inadequacy of statistical power in the study), as is Student C’s conclusion of a more pronounced difference (which mistakenly interprets a low p value as informing the magnitude of an effect). A focus on effect sizes would have alleviated the confusion that arose from a reliance only on statistical signifi­cance and, in fact, would have shown that these three studies provided per­fectly replicating results. Moreover, if the researchers had considered effect sizes, they could have moved beyond the question of whether the two groups differ to consider also the question of how much the two groups differ. These limitations of relying exclusively on significance tests have been the subject of much discussion (see, e.g., Cohen, 1994; Fan, 2001; Frick, 1996; Harlow, Mulaik, & Steiger, 1997; Meehl, 1978; Wilkinson & the Task Force on Statis­tical Significance, 1999), yet this practice unfortunately persists.

For the purposes of most meta-analyses, I find it useful to define an effect size as an index of the direction and magnitude of association between two vari- ables.1 As I describe in more detail later in this chapter, this definition includes traditional measures of correlation between two variables, differences between two groups, and contingencies between two dichotomies. When conducting a meta-analysis, it is critical that effect sizes be comparable across studies. In other words, a useful effect size for meta-analysis is one to which results from various studies can be transformed and therefore combined and compared. In this chapter I describe ways that you can compute the correlation (r), standard­ized mean difference (g), or odds ratio (o) from a variety of information com­monly reported in primary studies; this is another reason that these indexes are useful in summarizing or comparing findings across studies.

A second criterion for an effect size index to be useful in meta-analysis is that it must be possible to compute or approximate its standard error. Although I describe this importance more fully in Chapter 8, I should say a few words about it here. Standard errors describe the imprecision of a sam­ple-based estimate of a population effect size; formally, the standard error represents the typical magnitude of differences of sample effect sizes around a population effect size if you were to draw multiple samples (of a certain size N) from the population. It is important to be able to compute standard errors of effect sizes because you generally want to give more weight to studies that provide precise estimates of effect sizes (i.e., have small standard errors) than to studies that provide less precise estimates (i.e., have large standard errors). Chapter 8 of this book provides further description of this idea.

Having made clear the difference between statistical significance and effect size, I next describe three indices of effect size that are commonly used in meta-analyses.

2. Pearson Correlation

The Pearson correlation, commonly represented as r, represents the asso­ciation between two continuous variables (with variants existing for other forms, such as rpb when one variable is dichotomous and the other is continu­ous, φ when both are dichotomous). The formula for computing r (the sample estimate of the population correlation, p) within a primary data set is:

The conceptual meaning of positive correlations is that individuals who score high on X (relative to the sample mean on X) also tend to score high on Y (relative to the sample mean on Y), whereas individuals who score low on X also tend to score low on Y. The conceptual meaning of negative correlations is that individuals who score high on one variable tend to score low on the other variable. The rightmost portion of Equation 5.1 provides an alternative representation that illustrates this conceptual meaning. Here, Z scores (stan­dardized scores) represent high values as positive and low values as negative, so a preponderance of high scores with high scores (product of two positive) and low scores with low scores (product of two negative) yields a positive average cross product, whereas high scores with low scores (product of posi­tive and negative) yield a negative average cross product.

You are probably already familiar with the correlation coefficient, but per­haps are not aware that it is an index of effect size. One interpretation of the cor­relation coefficient is in describing the proportion of variance shared between two variables with r2. For instance, a correlation between two variables of r = .50 implies that 25% (i.e., .502) of the variance in these two variables overlaps. It can also be kept in mind that the correlation is standardized, such that cor­relations can range from 0 to ± 1. Given the widespread use of the correlation coefficient, many researchers have an intuitive interpretation of the magnitude of correlations that constitute small or large effect sizes. To aid this interpreta­tion, you can consider Cohen’s (1969) suggestions of r = ± .10 representing small effect sizes, r = ± .30 representing medium effect sizes, and r = ± .50 representing large effect sizes. However, you should bear in mind that the typical magnitudes of correlations found likely differ across areas of study, and one should not be dogmatic in applying these guidelines to all research domains.

In conclusion, Pearson’s r represents a useful, readily interpretable index of effect size for associations between two continuous variables. In many meta-analyses, however, r is transformed before effect sizes are combined or compared across studies (for contrasting views see Hall & Brannick, 2002; Hunter & Schmidt, 2004; Schmidt, Hunter, & Raju, 1988). Fisher’s transfor­mation of r, denoted as Zr, is commonly used and shown in Equation 5.2.

The reason that r is often transformed to Zr in meta-analyses is because the distribution of sample r’s around a given population p is skewed (except in sample sizes larger than those commonly seen in the social sciences), whereas a sample of Zrs around a population Zr is symmetric (for further details see Hedges & Olkin, 1985, pp. 226-228; Schulze, 2004, pp. 21-28).2 This sym­metry is desirable when combining and comparing effect sizes across studies. However, Zr is less readily interpretable than r both because it is not bounded (i.e., it can have values greater than ±1.0) and simply because it is unfamil­iar to many readers. Typical practice is to compute r for each study, convert these to Zr for meta-analytic combination and comparison, and then convert results of the meta-analysis (e.g., mean effect size) back to r for reporting. Equation 5.3 converts Zr back to r.

Although I defer further description until Chapter 8, I provide the equa­tion for the standard error here, as you should enter these into your meta­analytic database during the coding process. The standard error of Zr is shown in Equation 5.4.

This equation reveals an obvious relation inherent to all standard errors: As sample size (N) increases, the denominator of Equation 5.4 increases and so the standard error decreases. A desirable feature of Zr is that its standard error depends only on sample size (as I describe later, standard errors of some effects also depend on the effect sizes themselves).

3. Standardized Mean Difference

The family of indices of standardized mean difference represents the mag­nitude of difference between the means of two groups as a function of the groups’ standard deviations. Therefore, you can consider these effect sizes to index the association between a dichotomous group variable and a continu­ous variable.

There are three commonly used indices of standardized mean difference (Grissom & Kim, 2005; Rosenthal, 1994).3 These are Hedges’s g, Cohen’s d, and Glass’s index (which I denote as gGlass),4 defined by Equations 5.5, 5.6, and 5.7, respectively:

These three equations are identical in consisting of a raw (unstandard­ized) difference of means as their numerators. The difference among them is in the standard deviations comprising the denominators (i.e., the standardiza­tion of the mean difference). The equation (5.5) for Hedges’s g uses the pooled estimates5 of the population standard deviations of each group, which is the familiar . The equation (5.6) for Cohen’s d is similar but instead uses the pooled sample standard deviations,  sample standard deviation is a biased estimation of the population standard deviation, with the underestimation greater in smaller than larger samples. However, with even modestly large sample sizes, g and d will be virtually identical, so the two indices are not always distinguished.6 Often, people describe both indices as Cohen’s d, although it is preferable to use precise terminology in your own writing.7

The third index of standardized mean difference is gGlass (sometimes denoted with Δ or g’), shown in Equation 5.7. Here the denominator consists of the (population estimate of the) standard deviation from one group. This index is often described in the context of therapy trials, in which the control group is said to provide a better index of standard deviation than the therapy group (for which the variability may have also changed in addition to the mean). One drawback to using gGlass in meta-analysis is that it is necessary for the primary studies to report these standard deviations for each group; you can use results of significance tests (e.g., £-test values) to compute g or d, but not gGlass. Reliance on only one group to estimate the standard deviation is also less precise if the standard deviations of the two populations are equal (i.e., homoscedastic; see Hedges & Olkin, 1985). For these reasons, meta­analysts less commonly rely on gGlass relative to g or d. On the other hand, if the population standard deviations of the groups being compared differ (i.e., heteroscedasticity), then g or d may not be meaningful indexes of effect size, and computing these indexes from inferential statistics reported (e.g., £-tests, F-ratios) can be inappropriate. In these cases, reliance on gGlass is likely more appropriate if adequate data are reported in most studies (i.e., means and standard deviations from both groups).8 If it is plausible that heteroscedastic­ity might exist, you may wish to compare the standard deviations (see Shaf­fer, 1992) of two groups among the studies that report these data and then base the decision to use gGlass versus g or d depending on whether or not (respectively) the groups have different variances.

Examining Equations 5.5—5.7 leads to two observations regarding the standardized mean difference as an effect size. First, these can be either posi­tive or negative depending on whether the mean of Group 1 or 2 is higher. This is a desirable quality when your meta-analysis includes studies with potentially conflicting directions of effects. You need to be consistent in considering a par­-ticular group (e.g., treatment vs. control, males vs. females) as Group 1 versus 2 across studies. Second, these standardized mean differences can take on any values. In other words, they are not bounded by ±1.00 like the correlation coefficient r. Like r, a value of 0 implies no effect, but standardized mean dif­ferences can also have values greater than 1. For example, in the hypothetical example of three researchers given earlier in this chapter, all three researchers would have found g = (3 – 5)/1 = -2.00 if they considered effect sizes.

Although not as commonly used in primary research as r, these stan­dardized mean differences are intuitively interpretable effect sizes. Knowing that the two groups differ by one-tenth, one-half, one, or two standard devia­tions (i.e., g or d = 0.10, 0.50, 1.00, or 2.00) provides readily interpretable information about the magnitude of this group difference.9 As with r, Cohen (1969) provided suggestions for interpreting d (which can also be used with g or gGlass), with d = 0.20 considered a small effect, d = 0.50 considered a medium effect, and d = 0.80 considered a large effect. Again, it is important to avoid becoming fixated on these guidelines, as they do not apply to all research situations or domains. It is also interesting to note that transforma­tions between r and d (described in Section 5.5) reveal that the guidelines for interpreting each do not directly correspond.

As I did for the standard error of Zr, I defer further discussion of weight­ing until Chapter 8. However, the formulas for the standard errors of the commonly used standardized mean difference, g, should be presented here:

I draw your attention to two aspects of this equation. First, you should use the first equation when sample sizes of the two groups are known; the second part of the equation is a simplified version that can be used if group sizes are unknown, but it is reasonable to assume approximately equal group sizes. Sec­ond, you will notice that the standard errors of estimates of the standardized mean differences are dependent on the effect size estimates themselves (i.e., the effect size appears in the numerators of these equations). In other words, there is greater expected sampling error when the magnitudes (positive or negative) of standardized mean differences are large than when they are small. As dis­cussed later (Chapter 8), this means that primary studies finding larger effect sizes will be weighted relatively less (given the same sample size) than primary studies with smaller effect sizes when results are meta-analytically combined.

Before ending this discussion of standardized mean difference effect sizes, it is worth considering a correction that you should use when primary study sample sizes are small (e.g., less than 20). Hedges and Olkin (1985) have shown that g is a biased estimate of the population standardized mean differences, with the magnitude of overestimation becoming nontrivial with small sample sizes. Therefore, if your meta-analysis includes studies with small samples, you should apply the following correction of g for small sam­ple size (Hedges & Olkin, 1985, p. 79; Lipsey & Wilson, 2001, p. 49):

4. Odds Ratio

The odds ratio, which I denote as o (OR is also commonly used), is a useful index of effect size of the contingency (i.e., association) between two dichoto­mous variables. Because many readers are likely less familiar with odds ratios than with correlations or standardized mean differences, I first describe why the odds ratio is advantageous as an index of association between two dichot­omous variables.10

To understand the odds ratio, you must first consider the definition of odds. The odds of an event is defined as the probability of the event (e.g., of scoring affirmative on a dichotomous measure) divided by the probabil­ity of the alternative (e.g., of scoring negative on the measure), which can be expressed as odds = p / (1 – p), where p equals the proportion in the sample (which is an unbiased estimate of population proportion, n) having the characteristic or experiencing the event. For example, if you conceptual­ized children’s aggression as a dichotomous variable of occurring versus not occurring, you could find the proportion of children who are aggressive (p) and estimate the odds of being aggressive by dividing by the proportion of children who are not aggressive (1 – p). Note that you can also compute odds for nominal dichotomies; for example, you could consider biological sex in terms of the odds of being male versus female, or vice versa.

The next challenge is to consider how you can compare probabilities or odds across two groups. This comparison actually indexes an association between two dichotomous variables. For instance, you may wish to know whether boys or girls are more likely to be aggressive, and our answer would indicate whether, and how strongly, sex and aggression are associated. Sev­eral ways of indexing this association have been proposed (see Fleiss, 1994). The simplest way would be to compute the difference between probabilities in two groups, pi – p2 (where pi and p2 represent proportions in each group; in this example, these values would be the proportions of boys and girls who are aggressive). An alternative might be to compute the rate ratio (sometimes called risk ratio), which is equal to the proportion experiencing the event (or otherwise having the characteristics) in one group divided by the pro­portion experiencing it in the other, RR = pi / p2. A problem with both of these indices, however, is that they are highly dependent on the rate of the phenomenon in the study (for details, see Fleiss, 1994). Therefore, studies in which different base rates of the phenomenon are found (e.g., one study finds a high prevalence of children are aggressive, whereas a second finds that very few are aggressive) will yield vastly different differences and risk ratios, even given the same underlying association between the variables. For this reason, these indices are not desirable effect sizes for meta-analysis.

The phi (9) coefficient is another index of association between two dichotomous variables. It is estimated via the following formula (where 9 is the estimated population association):

Despite the lack of similarity in appearance between this equation and Equation 5.1, Φ is identical to computing r between the two dichotomous variables, X and Y. In fact, if you are meta-analyzing a set of studies involving associations between two continuous variables in which a small number of studies artificially dichotomize the variables, it is appropriate to compute ^ and interpret this as a correlation (you might also consider correcting for the attenuation of correlation due to artificial dichotomization; see Chapter 6).

However, Φ (like the difference in proportions and rate ratio) also suffers from the limitation that it is influenced by the rates of the variables of inter­est (i.e., the marginal frequencies). Thus studies with different proportions of the dichotomies can yield different effect sizes given the same underlying association (Fleiss, 1994; Haddock, Rindskopf, & Shadish, 1998). To avoid this problem, the odds ratio (o) is preferred when one is interested in associa­tions between two dichotomous variables. The odds ratio in the population is often represented as omega (ω) and is estimated from sample data using the following formula:

Although this equation is not intuitive, it helps to consider that it repre­sents the ratio of the odds of Y being positive when X is positive [(n11/n1•) / (n10/n1•)] divided by the odds of Y being positive when X is negative [(n01/n0•) / (n00/n0•)], algebraically rearranged.

The odds ratio is 1.0 when there is no association between the two dichotomous variables, ranges from 0 to 1 when the association is negative, and ranges from 1 to positive infinity when the association is positive. Given these ranges, the distribution of sample estimates around a population odds ratio is necessarily skewed. Therefore, it is common to use the natural log transformation of the odds ratio when performing meta-analytic combina­-tion or comparison, expressed as ln(o). The standard error of this logged odds ratio is easily computed (whereas computing the standard error of an untransformed odds ratio is more complex; see Fleiss, 1994), using the fol­lowing equation:

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.

Computing r from Commonly Reported Results

You can compute r from a wide range of results reported in primary studies. In this section, I describe how you can compute this effect size when primary studies report correlations, inferential statistics (i.e., £-tests or F-ratios from group comparisons, x2 from analyses of contingency tables), descriptive data, and probability levels of inferential tests. I then describe some more recent approaches to computing r from omnibus test results (e.g., ANOVAs with more than two groups). Table 5.1 summarizes the equations that I describe for computing r, as well as those for computing standardized mean differ­ences (e.g., g) and odds ratios (o).

1. from reported correlations

In the ideal case, primary reports would always report the correlations between variables of interest. This reporting certainly makes our task much easier and reduces the chances of inaccuracies due to computational errors or rounding imprecision. When studies report correlation coefficients, these are often in a tabular form.

TABLE 5.1. Summary of Equations for Computing Effect Sizes from Results of Primary Studies

Although it may be tempting to simply identify this correlation within a table and consider the study coded, it is still important to read the text of the report carefully. This reading may reveal additional information not included in the table that may be of interest, such as other effect sizes or correla­tions separately for subgroups. The text (as well as notes to the tables) may also contain important information regarding the correlation itself, such as whether it was computed for only a subset of the data, was based on a smaller sample size due to pairwise deletion of missing data, or is actually a partial or semipartial correlation due to the control of some other variables.

Although you still need to carefully read studies reporting correlation coefficients, these are much easier to code accurately. Unfortunately, many studies do not report these correlations, so you must turn to other data to code effect sizes. The following can be considered options when studies fail to report actual correlations.

2. From Inferential Statistics

Primary studies will often report results of inferential tests without reporting actual effect sizes (despite recommendations against this practice). This situ­ation can arise in several ways. First, the primary study may simply report the significance test of a correlation coefficient without reporting the coef­ficient itself; most commonly, studies report this significance as a £-test. Sec­ond, the authors of the primary study may have artificially dichotomized one of the variables to form two groups, then compared the groups using an inde­pendent sample £-test or an ANOVA reported as a one degree of freedom (in numerator) F-ratio. Artificially dichotomizing a continuous variable attenu­ates the effect size estimate (see Hunter & Schmidt, 1990; MacCallum et al., 2002), so you might not only compute r as described below but also consider correcting for this dichotomization using the approaches described in Chap­ter 6. A third situation is that the authors of the primary study dichotomized both variables involved in the correlation, then analyzed the data as contin­gency tables and reported a x2 statistic with one degree of freedom (a situa­tion in which you would also want to consider corrections for dichotomiza- tion described in Chapter 6).

The following formulas allow us to compute correlations in each of these situations. When primary studies report a £-test value (either for the signifi­cance of the correlation or in comparing two groups), the following equation can be used (Rosenthal, 1991, 1994):

Note that this equation provides either the positive or the negative square root, and it is important to take the value that reflects the direction of the effect in the same way across studies. For instance, if I am interested in the association between relational aggression and rejection, and compute r from a £-test comparing rejection in aggressive versus nonaggressive groups, I need to consider which group has a higher mean when computing the sign of r (positive if the aggressive group is more rejected and negative if the non­aggressive group is more rejected).

Primary studies might alternatively conduct an analysis of variance (ANOVA) between two groups. The resulting inferential statistic is an F-ratio with 1 degree of freedom in the numerator (i.e., Fq/) Because this F-ratio is the same as the square of the parallel £-test, it follows that you can compute a correlation from this value with this equation:

As with the £-test, you must be sure to take either the positive or the negative square root, depending on the direction of mean differences.

Equations 5.13 and 5.14 are for use in converting tests of independent sample £-tests or F-ratios to r. An alternative, albeit less frequent, situation occurs when primary studies report these statistics for repeated-measures (a.k.a. within-subject) comparisons. For instance, a study might report levels of rejection for a sample of children who were aggressive at one time point but not at another. Some recommend against combining independent sample and repeated-measures results in the same meta-analysis (e.g., Lipsey & Wil­son, 2001, state that these should be considered in separate meta-analyses). However, when you believe that the two methodologies address the same effect, you should also explore the moderator variable “type of methodology” (i.e., independent sample versus repeated measures). When computing r, you can use the same formulas (Equations 5.13 and 5.14) for either the inde­pendent sample or repeated-measures £-tests or F-ratios. This is not the case when computing standardize mean differences.

Primary studies might dichotomize both variables of interest and report results as a 2 X 2 contingency tables analysis with a reported x2 with 1 degree of freedom. The formula to convert this x2 into r is (Rosenthal, 1991, 1994):

As with computing r from the £-test or F-ratio, it is critical that you take the correct positive or negative square root. To determine which is correct, it is necessary to examine the reported contingency tables: A positive associa­tion is indicated if observed cell frequencies are higher than expected (under the null hypothesis of no association) in the major diagonal (if the contin­gency table is arranged with higher variable values as the lower row and right column), whereas a negative association is indicated if these frequencies are lower than expected. For example, you would consider aggression and rejec­tion to be positively correlated if children who were aggressive and rejected and children who were not aggressive and not rejected occurred more fre­quently than expected.

3. From Descriptive Data

Often primary studies do not report all results that you are interested in as significance tests, but will instead provide descriptive data (often in a table) that can be used to compute r.

Primary studies may present descriptive data (means and standard devi­ations) for one variable based on two groups formed by dichotomizing the other variable (paralleling the case of reported £-tests or F-ratios described above). In this case, it is convenient to compute a standardized mean differ­ence (using Equation 5.5), then transform these into r using Equation 5.26. As with computing r from £-tests and F-ratios, you should consider correcting for the attenuation of effect size due to the artificial dichotomization of the grouping variable (see Chapter 6).

It is also common for primary studies to report results in 2 X 2 contin­gency tables when both variables are dichotomized. In this situation, one can compute ^ using Equation 5.8 and then interpret this ^ as r. You should then correct this correlation for the dichotomization of both variables (see Chapter 6).

4. From Probability Levels of Significance Tests

In some situations, primary studies will not provide any other information other than results of significance tests. The first potential reason for this is simply inadequate reporting of results of a parametric inferential test (i.e., the authors report the statistical significance of a £-test, F-ratio, or x2 test of a contingency, but not the value itself). If the exact significance probability is reported for a £-test, two-group ANOVA, or 2 X 2 contingency analysis, one can simply find the corresponding t, F, or x2 value at that level of significance and then use Equation 5.13, 5.14, or 5.15 (respectively) to compute r.

A second reason why you might only have probabilities from a signifi­cance test is the primary study’s report of probabilities from nontraditional inferential tests (e.g., nonparametric tests). In these situations in which other methods of computing effect size are unavailable, one can compute effect sizes from these significance tests. To do so, you first identify the exact prob­ability, p, of the significance test and look up the standard normal deviate (i.e., Z) score corresponding to the given two-tailed p that is more extreme than this score (it is important to avoid confusion of this Z-score with the Fisher’s transformation of r, denoted as Zr, described earlier). For example, if a primary study reported a two-tailed (which is assumed if the study did not specify) p = .032, you would identify the one-tail p as .016 and the corre­sponding Z = 2.14. You can find this corresponding Z-score in tables in many introductory statistics books, although you need to be careful to correctly use these tables (e.g., many tables will list p as the proportion or percentage of the normal distribution between the mean and Z, so it is necessary to look up the Z associated with 0.50 – p or 50 – p, for proportions and percentages, respectively). These tables are also often limited with very small values of p because they frequently do not list these extreme values with enough preci­sion to accurately identify Z. For these reasons, it is often useful to find Z using a computer to identify the inverse of the standard normal cumulative distribution; you can use basic programs such as Microsoft Excel (using the “normsinv” function) to compute exact Z from p.

After computing Z, it is straightforward to compute the corresponding effect size given this value and sample size. The following equation converts Z to r for a given sample size N:

As when computing r from significance tests (t or F), it is important to take either the positive or negative square root from Equation 5.16 to repre­sent the direction of the effect.

In all-too-many primary studies, researchers report a range of probabil­ity but not the exact probability (or associated t or F). For instance, it is not uncommon for primary studies to report that an association or comparison of groups was significant, and then only state that p < .05 (or some other value). In these instances, if the report provides no other information, you cannot compute an exact effect size. You then have two options. The option is to contact the study authors requesting more information, such as the actual effect size, inferential statistic (t or F), or exact significance probability (p). This option is certainly preferable in obtaining accurate effect sizes; unfortu­nately, it is not always possible because authors have retired, left academia, are unwilling to respond to your request, or for any other of numerous rea­sons. In these situations, the second option is to compute the best estimate of effect size given the reported results, which is typically the lower-bound effect size given the upper-bound probability. In other words, if a study reports that p < .05 (let’s say for a sample size of N = 100), you can make the conserva­tive assumption that p = .05 and then compute the associated Z (=1.96) and r (from Equation 5.16, r = √(1.962/100) = .20). It is important to recognize that this value of r is a lower-bound estimate of the actual effect size found in the primary study. To illustrate, if the true p = .03, r = .22, if p = .01, r = .26, if p = .001, r = .33, and if p = .0001, r = .39, and so on. In other words, if a study only reports that p from a test of significance test is less than some value (e.g., p < .05), you can only conclude that the effect size is greater than some value (e.g., r > .20). Common convention is to be conservative and conduct analyses using this minimum value.

A similar situation of inadequate reporting of data arises when primary studies report only that a particular effect is not statistically significant. In this situation, it is possible to compute a range of possible values of the effect size. To do so, you can compute the Z-score associated with the chosen a (assume a = .05 if not otherwise stated) and then apply Equation 5.15 to determine the maximum magnitude of r that would fail to yield a statisti­cally significant effect given the sample size. You can conclude that the actual effect size of the study was greater than the negative r and less than the positive r. For example, if N = 100, you know that -.20 < r < .20. However, common convention is to take the smallest magnitude effect size—in other words, to assume r = .00.

Taking the minimal effect sizes from primary studies reporting only that the p is less than some value or that an effect size is not significant is clearly not an ideal situation. When this practice is used for a substantial number of studies, the result will be that the mean effect size will be biased toward smaller magnitude (and tests of heterogeneity and moderation also may be biased). The best way to avoid this problem would be to (1) carefully read primary studies for any other information from which effect sizes can be computed and (2) persistently seek further information from authors of the primary studies. If you are still forced to make lower-bound estimates of effect sizes for some studies, it is good practice to (1) report the percentage of included studies for which these lower-bound estimates were made; and (2) conduct a sensitivity analysis by comparing results obtained with these stud­ies versus without them (e.g., conducting two sets of analyses including and excluding these studies, or else evaluating a dichotomous moderator vari­able identifying these studies; one hopes that the impact of these studies is trivial). Alternatively, if many effect sizes (or coded study characteristics) are missing, it might be useful to rely on more recent methods of missing data management (see Pigott, 2009). In Chapters 9 and 10, I describe a structural equation modeling (SEM) representation of meta-analysis that uses sophis­ticated full information maximum likelihood (FIML) methods of handling missing data (Cheung, 2008).

5. From Results of Omnibus Tests

The effect sizes of interest to meta-analysts typically involve associations between two variables. As illustrated earlier, this information is sometimes obtained from two group comparisons on a continuous variable (t-tests or F-ratios with 1 df in numerator). In contrast, some primary studies report results of omnibus tests involving differences among three or more groups (F-ratios with 2 or more numerator dfs). Although exceptions might exist, these omnibus results are generally of little direct use within a meta-analysis. As Rosenthal (1991) poignantly stated, “only rarely is one interested in know­ing . . . that somewhere in the thicket of df there lurk one or more meaningful answers to meaningful questions that we had not the foresight to ask of our data” (p. 13). In other words, you are more often interested in identifying the linear (or other specified form) relations between two variables or the mag­nitudes of differences between two specific groups, more so than whether a number of groups differ in some unspecified way. For example, you might be interested in the linear relation between aggression and rejection from results comparing rejection among children who are aggressive never, some­times, or often, whereas the question of whether there are some differences among these groups (i.e., the omnibus ANOVA) is of less interest. Similarly, you might be interested in a specific comparison of psychosocial intervention versus control conditions from a three-level ANOVA of control, psychoso­cial intervention, and pharmacological intervention conditions. These situa­tions require us to extract meaningful information (i.e., effect sizes) from less meaningful omnibus tests.

Techniques for computing effect sizes from these omnibus tests are described in detail by Rosenthal, Rosnow, and Rubin (2000), and I refer read­ers to this source for complete description. Here, I briefly outline the approach to computing r from descriptive data (i.e., means and standard deviations) or results of one-way ANOVAs with three or more groups. Procedures for man­aging repeated-measures and factorial ANOVAs are described in Rosenthal et al. (2000).

5.1. From Descriptive Statistics

The first situation I consider is when the primary study reports group sizes, means, and standard deviations from three or more groups. The first step in computing the linear association between the independent (i.e., group­ing) and dependent (i.e., outcome) variables is to determine a set of contrast weights for the groups, denoted as Xg for the g groups, such that these con­trast weights sum to zero. The most typical choices of contrast weights are —1, 0, and 1 for three groups; -3, —1, 1, and 3 for four groups; and -2, —1, 0, 1, and 2 for five groups (contrast weights for more groups could be obtained through tables of orthogonal contrast codes, e.g., Cohen, Cohen, West, & Aiken, 2003, p. 215; Rosenthal et al., 2000, p. 153).11

After determining appropriate contrast weights (Ag), the next step is to use these and the reported group sizes (ng) and means (Mg) to compute the average squared deviation due to the linear contrast, MScontrast:

Given this squared deviation due to the contrast, one can then evaluate the statistical significance of the linear contrast, if this is of interest. This statistical significance can be evaluated as the Fcontrast, which has 1 df in the numerator and dferror, or S(ng – 1), in the denominator. Regardless of whether you are interested in the significance of this contrast, the next step is to compute FCOntrast as MScontrast divided by MSwithin, where MSwithin might be reported in the primary study or can be computed as the group-size weighted average of within-group variances, S(ng sg2) / Sng.

From this Fcontrast, the final step to computing an effect size from this three or more group situation is to compute r (called reffect size by Rosenthal et al., 2000)using the following equation:

Because Fbetween and dfbetween are from the original omnibus test of group differences, primary studies typically report these values. If a study does not provide these values, you can easily compute these values from the reported sample sizes, means, and standard deviations.12

5.2. From df > 2 F-Ratio

Another common method of reporting results of comparisons of three or more groups in primary studies is to report the omnibus F-ratio. To compute an effect size from this F-ratio, the primary study must also report the means (but standard deviations are not necessary) of three or more groups. If the primary study does not report the means of the groups, it is not possible to compute an effect size indexing the association between the independent (grouping) and dependent (outcome) variables (note that simply using the formula for the two group ANOVAs, Equation 5.14, is not appropriate).

Computing r from reported means and an omnibus F-ratio is similar to the computation from means, standard deviations, and sample sizes described in the previous section. Specifically, you still (1) determine appropriatecontra st weights (λg); (2) compute MScontrast using Equation 5.17; and (3) compute Fcontrast for use in subsequent computations as described earlier. The difference here is that you do not use the reported group standard deviations to compute MSwithin (which is used to compute Fcontrast). If this value is reported in an ANOVA table, you can easily obtain this value. Otherwise, you must compute this MSwithin from the reported omnibus F-ratio, based on the fact that MSwithin = MSbetween/F. Although MSbetween will typically not be reported if an ANOVA table is not provided, this can be computed from the reported means from the G groups: MSbetween = S(Mg – GM)2/G – 1.

You then follow the same steps described in the previous section: (1) computing Fcontrast as MScontrast/MSwithin; (2) computing r using Equation 5.18. Thus, obtaining r from data where there are three or more groups is
similar when studies report either descriptive statistics or results of an omnibus one-way ANOVA.

5.3. Final Words Regarding Computing r from Omnibus Tests

In this section, I provide only a brief overview of computing r from the results of omnibus tests reported in primary studies. Although the simple situations I have described will likely help in most situations, others that I have not described here may emerge. My recommendation to readers who commonly encounter these situations is to first consult the book by Rosen­thal et al. (2000), which provides further details on computing r in situations I have described as well as others, including factorial designs and repeated- measures ANOVAs. These authors also describe alternative assignment of contrast weights that may be of interest.

If you encounter situations not described here or in Rosenthal et al. (2000), several options are available to you. First, I recommend consulting the literature for more recent treatments that might apply to this situation.

Computing effect sizes such as r from omnibus test results has only recently gained attention (due largely to the Rosenthal et al. book), and it is likely that more will be written on this topic. Second, you might be able to apply the logic of this approach to develop reasonable ways of computing a meaningful effect size from omnibus results. It seems safe to suggest that if you can (1) identify the amount of variance due to the desired effect (e.g., a linear rela­tion between the independent and dependent variables) and (2) determine a direction of effect, then it is possible to compute an r that indexes this effect. A third option, of course, is to request further information from the authors of the primary studies. Although this approach might deprive you of the joys of discovering ingenious ways of computing an effect size, you should remember that this is usually the most straightforward and most accurate way of obtaining the desired information.

Source: Card Noel A. (2015), Applied Meta-Analysis for Social Science Research, The Guilford Press; Annotated edition.