Your browser doesn't support JavaScript. Please upgrade to a modern browser or enable JavaScript in your existing browser.
Skip Navigation U.S. Department of Health and Human Services www.hhs.gov
Agency for Healthcare Research Quality www.ahrq.gov
Archive print banner

This information is for reference purposes only. It was current when produced and may now be outdated. Archive material is no longer maintained, and some links may not work. Persons with disabilities having difficulty accessing this information should contact us at: https://info.ahrq.gov. Let us know the nature of the problem, the Web address of what you want, and your contact information.

Please go to www.ahrq.gov for current information.

Literature Search and Abstraction

All searches involve at least the MEDLINE® English-language database and the Cochrane Collaboration Library, using appropriate search terms to retrieve studies that meet the previously established inclusion and exclusion criteria. The search also includes other databases when indicated by the topic. The topic teams supplement these searches with references from reviews, current articles, and suggestions from experts in the field. Two members of the topic team (typically EPC staff) review abstracts of all articles. If either reviewer believes that the abstract meets the inclusion criteria, the EPC retrieves the full text of the article. The eligibility criteria are reapplied by one reviewer who, if the article is included, abstracts information about the patient population, study design, interventions (where appropriate), quality indicators, and findings.

Evaluating Evidence: Rethinking Quality

The Methods Work Group, recognizing the central role that evaluating the quality of the evidence plays in the process of making evidence-based guidelines, focused much effort on this issue and decided to refine the process used by the previous Task Force. Specifically, the third Task Force adopted three important changes to the process: adding a rating of internal validity to the study design criterion for judging individual studies, explicitly assessing evidence at three different strata, and separating the magnitude of effect from the assessment of quality.

Evaluating Quality at Three Strata: Stratum 1, the Individual Study

For some years, the standard approach to evaluating the quality of individual studies was based on a hierarchical grading system of research design in which RCTs received the highest score (Table 2). The maturation of critical appraisal techniques has drawn attention to the limitations of this approach, which gives inadequate consideration to how well the study was conducted, a dimension known as internal validity (20). A well-designed cohort study may be more compelling than an inadequately powered or poorly conducted RCT (21,22).

Table 2. Hierarchy of Research Design

I: Evidence obtained from at least one properly randomized controlled trial.
II-1: Evidence obtained from well-designed controlled trials without randomization.
II-2: Evidence obtained from well-designed cohort or case-control analytic studies, preferably from more than one center or research group.
II-3: Evidence obtained from multiple time series with or without the intervention. Dramatic results in uncontrolled experiments (such as the results of the introduction of penicillin treatment in the 1940s) could also be regarded as this type of evidence.
III: Opinions of respected authorities, based on clinical experience, descriptive studies and case reports, or reports of expert committees.


To accompany the standard categorization of research design, the third Task Force added a three-category rating of the internal validity of each study: "good," "fair," and "poor." To distinguish among good, fair, and poor, the Task Force modified criteria developed by others (23-26) to create a set of operational parameters for evaluating the internal validity of five different study designs: systematic reviews, case-control studies, RCTs, cohort studies, and diagnostic accuracy studies (Table 3). These criteria are used not as rigid rules but as guidelines; exceptions are made with adequate justification. In general, a good study meets all criteria for that study design; a fair study does not meet all criteria but is judged to have no fatal flaw that invalidates its results; and a poor study contains a fatal flaw.

Table 3. Criteria for grading the internal validity of individual studies

Study design: Systematic reviews
Criteria:

  • Comprehensiveness of sources/search strategy used
  • Standard appraisal of included studies
  • Validity of conclusions
  • Recency and relevance

Study design: Case-control studies
Criteria:

  • Accurate ascertainment of cases
  • Nonbiased selection of cases/controls with exclusion criteria applied equally to both
  • Response rate
  • Diagnostic testing procedures applied equally to each group
  • Appropriate attention to potential confounding variables

Study design: Randomized controlled trials (RCTs) and cohort studies
Criteria:

  • Initial assembly of comparable groups: For RCTs: adequate randomization, including concealment and whether potential confounders were distributed equally among groups For cohort studies: consideration of potential confounders with either restriction or measurement for adjustment in the analysis; consideration of inception cohorts
  • Maintenance of comparable groups (includes attrition, crossovers, adherence, contamination)
  • Important differential loss to follow-up or overall high loss to follow-up
  • Measurements: equal, reliable, and valid (includes masking of outcome assessment)
  • Clear definition of interventions
  • All important outcomes considered
  • Analysis: adjustment for potential confounders for cohort studies, or intention-to-treat analysis for RCTs

Study design: Diagnostic accuracy studies
Criteria:

  • Screening test relevant, available for primary care, adequately described
  • Study uses a credible reference standard, performed regardless of test results
  • Reference standard interpreted independently of screening test
  • Handles indeterminate results in a reasonable manner
  • Spectrum of patients included in study
  • Sample size
  • Administration of reliable screening test

Thus, the topic team assigns each study two separate ratings: one for study design and one for internal validity. A well-performed RCT, for example, would receive a rating of I-good, whereas a fair cohort study would be rated II-2-fair. In many cases, narrative text is needed to explain the rating of internal validity for the study, especially for those studies that play a pivotal role in the analytic framework. When the quality of an individual study is the subject of significant disagreement, the entire Task Force may be asked to rate the study and the final rating is applied after debate and discussion.

Even well-designed and well-conducted studies may not supply the evidence needed if the studies examine a highly selected population of little relevance to the general population seen in primary care. Thus, external validity—the extent to which the studies reviewed are generalizable to the population of interest—is considered on a par with internal validity. Deciding whether generalizing in specific situations is appropriate is based on explicit principles developed by the Task Force (go to the Extrapolation and Generalization section).

Evaluating Quality at Three Strata: Stratum 2, the Linkage

The quality of evidence in a single study constitutes only one stratum in analyzing the quality of evidence for a preventive service. One might also consider two additional levels of assessment: the quality of the body of evidence for each linkage (key question) in an analytic framework, and the overall quality of the body or bodies of evidence for a preventive service, including all linkages in the analytic framework (Table 4).

Table 4. Evaluating the quality of evidence at three strata

Level of evidence: Individual study
Criteria for judging quality:

  • Internal validity a
  • External validity b

Level of evidence: Linkage in the analytic framework
Criteria for judging quality:

  • Aggregate internal validity a
  • Aggregate external validity b
  • Coherence/consistency

Level of evidence: Entire preventive service
Criteria for judging quality:

  • Quality of the evidence from Stratum 2 for each linkage in the analytic framework
  • Degree to which there is a complete chain of linkages supported by adequate evidence to connect the preventive service to health outcomes
  • Degree to which the complete chain of linkages "fit" together c
  • Degree to which the evidence connecting the preventive service and health outcomes is "direct" d

[a] Internal validity is the degree to which the study(ies) provides valid evidence for the population and setting in which it was conducted.
[b] External validity is the extent to which the evidence is relevant and generalizable to the population and conditions of typical primary care practice.
[c] "Fit" refers to the degree to which the linkages refer to the same population and conditions. For example, if studies of a screening linkage identify people who are different from those involved in studies of the treatment linkage, the linkages are not supported by evidence that "fits" together.
[d] "Directness" of evidence is inversely proportional to the number of bodies of evidence required to make the connection between the preventive service and health outcomes. Evidence is direct when a single body of evidence makes the connection, and more indirect if two or more bodies of evidence are required.



In assessing quality at the second level, the body of evidence supporting a given linkage in the analytic framework, the Task Force recognizes three important criteria. The first two follow directly from criteria for the first stratum. Internal validity (including research design) and external validity (generalizability) remain important, but at this level they are considered in the aggregate for all relevant studies (Table 4).

The third criterion for evaluating the quality of the body of evidence concerning the linkage in an analytic framework is consistency and coherence. Coherence means that a body of evidence makes sense, that is, that the evidence fits together in an understandable model of the situation. The Task Force does not necessarily require consistency, recognizing that studies may produce different results in different populations, and heterogeneity of this sort may still be coherent with the hypothesized model of how interventions relate to outcomes. Consistent results of several studies across different populations and study designs do, however, contribute to coherence.

A topic team considers these three criteria—aggregate internal validity, aggregate external validity, and coherence/consistency—in evaluating the quality of the body of evidence concerning the linkage in an analytic framework (Table 4). It assigns good, fair, or poor ratings to each of these three factors. In making these judgments, the Task Force has no simple formula but rather considers all the evidence, giving greater weight to studies of higher quality. Topic teams write brief explanatory narratives to provide the rationale for their ratings.

Evaluating Quality at Three Strata: Stratum 3, the Entire Preventive Service

The third level of assessing quality considers the evidence for the entire preventive service. Previous Task Forces used the hierarchical rating of research design (Table 2) to describe the best evidence for a preventive service. The evidence for a preventive service would receive a II-2 code, for example, if the best evidence consisted of a controlled cohort study. As noted above, the current USPSTF has added to this grading of research design an assessment of how well the study was conducted.

Even with this addition, however, examination of the analytic framework shows the difficulty in using this rating scheme alone to judge the quality of the evidence for an entire preventive service. The quality of the evidence may depend on which linkage it is examining. For example, the evidence for smoking cessation counseling could be described as grade I-good evidence (because well-performed RCTs have shown that counseling and nicotine replacement therapy reduce smoking rates) or as grade II-2-good evidence (because only cohort studies have shown that stopping smoking improves health). The more precise conceptualization is that smoking cessation counseling consists of multiple components, as reflected in the linkages for its analytic framework (e.g., Figure 2) and that different levels of evidence support each linkage.

The third Task Force adopted an approach that systematically examines the evidence for each linkage, and all linkages together, in the analytic framework. The underlying issue is whether the evidence is adequate to determine the existence and magnitude of a causal connection between the preventive service (on the left side of the analytic framework) and health outcomes (on the right side of the analytic framework).

Rather than applying formal rules for determining the overall quality of evidence, the Task Force adopted a set of general criteria that it considers when making this judgment (Table 4). These criteria are as follows:

  • Quality of the evidence from Stratum 2 for each linkage in the analytic framework.
  • Degree to which a complete chain of linkages supported by adequate evidence connects the preventive service to health outcomes.
  • Degree to which the linkages fit together.
  • Degree to which the evidence connecting the preventive service and health outcomes is direct.

As noted earlier, the directness of evidence is inversely proportional to the number of linkages (bodies of evidence) that must be pieced together to infer that a preventive service has an impact on health. The evidence is most direct if a single body of evidence, corresponding to the overarching linkage in the analytic framework, provides adequate evidence concerning the existence and magnitude of health effects resulting from the use of the preventive service. The evidence is indirect if, instead of having overarching evidence, one must rely on two or more bodies of evidence corresponding to linkages in the analytic framework to make an adequate connection between the use of the preventive service and health.

Based on these considerations, the Task Force grades the overall quality of the evidence using the same tripartite scheme (good, fair, and poor) applied to other levels of evidence. The Task Force decided against a formal system for assigning these grades. Instead, it makes its reasoning explicit in an explanatory narrative in the recommendation statement, providing the overall assessment of the quality of the evidence and the rationale behind this assessment.

In general, good overall evidence includes a high-quality direct linkage between the preventive service and health outcomes. Fair evidence is typically indirect but it is adequate to complete a chain of linkages across the analytic framework from the preventive service to health outcomes. The evidence is inadequate to make this connection unless the linkages fit together in a meaningful way. For example, in some situations screening may detect people who are different from those involved in studies of treatment efficacy. In this case, the screening and treatment linkages do not fit together. Poor evidence has a formidable break in the evidence chain such that information is inadequate to connect the preventive service and health outcomes.

To make its reasoning explicit, the Task Force includes an explanatory narrative about its overall rating of the evidence in the recommendation statement.

Separating Magnitude of Effect from Quality

When reviewers consider the quality of evidence, they often confound quality of evidence with magnitude of effect. Evidence for an intervention is sometimes described as good if it shows a dramatic effect on outcomes. Strictly speaking, whether a study provides accurate information should be independent of its findings. The magnitude of observed benefits and/or harms from a service, although of critical importance to decisions about whether it should be recommended, is a separate issue from the quality of the data. The Task Force examines magnitude (or effect size) separately from the quality of evidence, but it merges both issues in making its recommendations (go to the discussion in "Assessing Magnitude of Net Benefit" section).

Return to Contents

Assessing Magnitude of Net Benefit

When the overall quality of the evidence is judged to be good or fair, the Task Force proceeds to consider the magnitude of net benefit to be expected from implementation of the preventive service. Determining net benefit requires assessing both the magnitude of benefits and the magnitude of harms and weighing the two. When the evidence is considered to be poor, the Task Force has no scientific basis for making conjectures about magnitude.

The Task Force classifies benefits, harms, and net benefits on a 4-point scale: "substantial," "moderate," "small," and "zero/negative." It has adopted no standardized metric (such as number needed to screen, number needed to treat, number of lives extended, years of life saved, and/or quality-adjusted life years) for comparing net benefit across preventive services. Ideally, a quantitative definition for such terms as substantial or moderate benefit would make these categorizations more defensible, less arbitrary, and more useful to policymakers in ranking the relative priority of preventive services. Unfortunately, the Task Force has not yet solved the methodologic challenges to deriving such a metric.

Although the Task Force has decided against a rigid formula for defining these terms, it has developed a conceptual framework and a process for making these distinctions. In assessing the magnitude of benefits and harms, the Task Force uses a modification of the statistical concept of the confidence interval. The magnitude of effect in individual studies is given by a point estimate surrounded by a confidence interval. Point estimates and confidence intervals often vary among studies of the same question, sometimes considerably. The Task Force examines all relevant studies to construct a general, conceptual "confidence interval" of the range of effect-size values consistent with the literature. It considers the upper and lower bounds of this confidence interval in assessing the magnitude of benefits and harms.

Return to Contents

Assessing Magnitude of Benefits

The Task Force thinks of benefit from both population and individual perspectives. For the benefit to be considered substantial, the service must have:

  • At least a small relative impact on a frequent condition with a substantial population burden, or
  • A large impact on an infrequent condition that poses a significant burden at the individual patient level.

For example, counseling for tobacco cessation produces a change in behavior in only a small proportion of patients (27), but the societal implications are sizable because of the large number of tobacco users in the population and the burden of illness and death that is averted if even a small percentage of people stop smoking. Conversely, phenylketonuria is a grave condition that affects a very small proportion of the population, but neonatal screening markedly reduces morbidity and mortality from the disease (6). Although the target conditions in these examples differ considerably in prevalence, the Task Force views both preventive services as having a substantial magnitude of benefit. "Outcomes tables" (similar to "balance sheets" [28]) are the Task Force's standard resource for estimating the magnitude of benefit (28,29). These tables, prepared by the topic teams for use at Task Force meetings, compare the condition-specific outcomes expected for a hypothetical primary care population with and without use of the preventive service. These comparisons may be extended to consider only people of specified age or risk groups or other aspects of implementation. Thus, outcomes tables allow the Task Force to examine directly how the preventive service affects benefits for various groups.

One important problem with outcomes tables is that the evidence typically differs across table cells. For some services and some groups, the frequency of the outcome may be clear, but for others one can calculate the frequency of the outcome only by making broad assumptions, some with greater scientific support than others. Thus, outcomes tables must provide information about both the frequency of outcomes and how certain we are about that information.

Return to Contents

Assessing Magnitude of Harms

The Task Force considers all types of potential harms of a service, both direct harms of the service itself (e.g., those from a screening test or preventive medication) and indirect harms that may be downstream consequences of the initial intervention (e.g., invasive follow-up tests or harms of treatments). The Task Force considers potential medical, psychological, and nonhealth harms (e.g., effects on insurability).

All analytic frameworks include linkages concerning the potential harms of preventive services, and all topic teams search for evidence about these harms. The Task Force strives to give equal weight to benefits and harms in its assessment of net benefit, but the amount of evidence about benefits is usually greater. Few studies provide useful information on adverse outcomes. Thus, the Task Force often finds itself trying to estimate harms based on little evidence. Methods of making this estimation are lacking, but the Task Force continues to discuss ways to frame the range of reasonable estimates of harm for each preventive service.

When evidence on harms is available, the topic teams assess its quality in a manner like that for benefits and include adverse events in the outcomes tables. When few harms data are available, the Task Force does not assume that harms are small or nonexistent. It recognizes a responsibility to consider which harms are likely and to judge their potential frequency and the severity that might ensue from implementing the service (30). It uses whatever evidence exists to construct a general confidence interval on the 4-point scale (e.g., substantial, moderate, small, and zero/negative) described above.

Return to Contents

Assessing Net Benefits: Weighing Benefits and Harms

Value judgments are involved in using the information in an outcomes table to rate either benefits or harms on the Task Force's 4-point scale. Value judgments are also needed to weigh benefits against harms to arrive at a rating of net benefit.

The need to invoke value judgments is most obvious when the Task Force must weigh benefits and harms of different types against each other in coming to a collective assessment of net benefits. For example, although breast cancer screening for certain age groups may reduce deaths from breast cancer (31), it also increases the number of women who must experience the anxiety of a work-up for a false-positive mammogram (32). Determining which of the four categories of net benefit to assign to this service depends greatly on the value one places on each outcome.

In making its determinations of net benefit, the Task Force strives to consider what it believes are the general values of most people. It does this with greater confidence for certain outcomes (e.g., death) about which there is little disagreement about undesirability, but it recognizes that the degree of risk people are willing to accept to avert other outcomes (e.g., cataracts) can vary considerably (33). When the Task Force perceives that preferences among individuals vary greatly, and that these variations are sufficient to make the average trade-off of benefits and harms a "close call", then it will often assign a C recommendation (below). This recommendation indicates that the decision is likely to be sensitive to individual patients' preferences.

Return to Contents

Extrapolation and Generalization

As noted in the "Review of the Evidence" section, the Task Force regularly faces the issue of generalization in determining the quality of evidence. The Task Force makes recommendations intended for the general primary care situation; for this purpose, high-quality evidence is evidence that is relevant and valid for this setting. When studies examine different situations and settings, the issue of generalization arises.

Likewise, the magnitude of the effect of interest to the Task Force is that resulting from implementation in the primary care setting. Calculations based on extrapolation are usually required to estimate the likely magnitude of effect for the primary care situation.

Some degree of extrapolation and generalization is invariably required to use evidence in the research literature to make guidelines for the primary care situation. For some services, the evidence may provide high-quality information about the efficacy of a preventive service in the hands of experts for a specific subpopulation. For others, evidence about efficacy often comes from studies of symptomatic patients who are more severely ill than patients who would be discovered by screening. Even when good randomized trials of therapeutic efficacy in asymptomatic patients exist (e.g., therapy of lipid disorders), female, elderly, and younger patients may be underrepresented, and eligibility criteria might exclude patients with characteristics that are typical of a general primary care population. Other commonly encountered issues are whether the efficiency of screening in one practice setting can be replicated in other settings and whether efficacy persists or diminishes beyond the length of time usually covered by available studies.

In the absence of good evidence, to what extent can one use reasoned judgments based on assumptions with varying degrees of scientific support to draw conclusions about the potential benefits and harms of a preventive service? The Task Force developed a policy for determining the conditions under which extrapolation and generalization are reasonable. These conditions include:

  • Biologic plausibility.
  • Similarities of the populations studied and primary care patients (in terms of risk factor profile, demographics, ethnicity, gender, clinical presentation, and similar factors).
  • Similarities of the test or intervention studied to those that would be routinely available or feasible in typical practice.
  • Clinical or social environmental circumstances in the studies that could modify the results from those expected in a primary care setting.

Judgments about extrapolation and generalization, because they are often matters of policy and subjective judgment rather than hard science, are made by the Task Force and not the EPCs.

Return to Contents
Proceed to Next Section

 

The information on this page is archived and provided for reference purposes only.

 

AHRQ Advancing Excellence in Health Care