Skip Navigation U.S. Department of Health and Human Services
Agency for Healthcare Research Quality
Archive print banner

Systems to Rate the Strength of Scientific Evidence


Evidence Report/Technology Assessment: Number 47

This information is for reference purposes only. It was current when produced and may now be outdated. Archive material is no longer maintained, and some links may not work. Persons with disabilities having difficulty accessing this information should contact us at: Let us know the nature of the problem, the Web address of what you want, and your contact information.

Please go to for current information.

Under its Evidence-based Practice Program, the Agency for Healthcare Research and Quality (AHRQ) is developing scientific information for other agencies and organizations on which to base clinical guidelines, performance measures, and other quality improvement tools. Contractor institutions review all relevant scientific literature on assigned clinical care topics and produce evidence reports and technology assessments, conduct research on methodologies and the effectiveness of their implementation, and participate in technical assistance activities.

Select for PDF File (76 KB). PDF Help.

Introduction / Methods / Results / Systems for Rating the Quality of Individual Articles / Systems for Grading the Strength of a Body of Evidence / Discussion / Factors Important in Developing and Using Rating Systems / Selecting Systems for Use Today: A "Best Practices" Orientation / Recommendations for Future Research / Conclusion / References / Availability of Full Report


Health care decisions are increasingly being made on research-based evidence rather than on expert opinion or clinical experience alone. Systematic reviews represent a rigorous method of compiling scientific evidence to answer questions regarding health care issues of treatment, diagnosis, or preventive services. Traditional opinion-based narrative reviews and systematic reviews differ in several ways. Systematic reviews (and evidence-based technology assessments) attempt to minimize bias by the comprehensiveness and reproducibility of the search for and selection of articles for review. They also typically assess the methodologic quality of the included studies—i.e., how well the study was designed, conducted, and analyzed—and evaluate the overall strength of that body of evidence. Thus, systematic reviews and technology assessments increasingly form the basis for making individual and policy-level health care decisions.

Throughout the 1990s and into the 21st century, the Agency for Healthcare Research and Quality (AHRQ) has been the foremost Federal agency providing research support and policy guidance in health services research. In this role, it gives particular emphasis to quality of care, clinical practice guidelines, and evidence-based practice—for instance through its Evidence-based Practice Center (EPC) program. Through this program and a group of 12 EPCs in North America, AHRQ seeks to advance the field's understanding of how best to ensure that reviews of the clinical or related literature are scientifically and clinically robust.

The Healthcare Research and Quality Act of 1999, Part B, Title IX, Section 911(a) mandates that AHRQ, in collaboration with experts from the public and private sectors, identify methods or systems to assess health care research results, particularly "methods or systems to rate the strength of the scientific evidence underlying health care practice, recommendations in the research literature, and technology assessments." AHRQ also is directed to make such methods or systems widely available.

AHRQ commissioned the Research Triangle Institute—University of North Carolina EPC to undertake a study to produce the required report, drawing on earlier work from the RTI-UNC EPC in this area.1 The study also advances AHRQ's mission to support research that will improve the outcomes and quality of health care through research and dissemination of research results to all interested parties in the public and private sectors both in the United States and elsewhere.

The overarching goals of this project were to describe systems to rate the strength of scientific evidence, including evaluating the quality of individual articles that make up a body of evidence on a specific scientific question in health care, and to provide some guidance as to "best practices" in this field today. Critical to this discussion is the definition of quality. "Methodologic quality" has been defined as "the extent to which all aspects of a study's design and conduct can be shown to protect against systematic bias, nonsystematic bias, and inferential error."(Ref. 1, p. 472) For purposes of this study, the authors hold quality to be the extent to which a study's design, conduct, and analysis have minimized selection, measurement, and confounding biases, with their assessment of study quality systems reflecting this definition.

The authors do acknowledge that quality varies depending on the instrument used for its measurement. In a study using 25 different scales to assess the quality of 17 trials comparing low molecular weight heparin with standard heparin to prevent post-operative thrombosis, Juni and colleagues reported that studies considered to be of high quality using one scale were deemed low quality on another scale.2 Consequently, when using study quality as an inclusion criterion for meta-analyses, summary relative risks for thrombosis depended on which scale was used to assess quality. The end result is that variable quality in efficacy or effectiveness studies may lead to conflicting results that affect analyst's or decisionmakers' confidence about findings from systematic reviews or technology.

The remainder of this summary briefly describes the methods used to accomplish these goals and provides the results of the authors' analysis of relevant systems and instruments identified through literature searches and other sources. They present a selected set of systems that they believe are ones that clinicians, policymakers, and researchers can use with reasonable confidence for these purposes, giving particular attention to systematic reviews, randomized controlled trials (RCTs), observational studies, and studies of diagnostic tests. Finally, they discuss the limitations of this work and of evaluating the strength of the practice evidence for systematic reviews and technology assessments and offer suggestions for future research. The authors do not examine issues related to clinical practice guideline development or assigning grades or ratings to formal guideline recommendations.

Return to Contents


To identify published research related to rating the quality of studies and the overall strength of evidence, the authors conducted two extensive literature searches and sought further information from existing bibliographies, members of a technical expert panel, and other sources. They then developed and completed descriptive tables—hereafter "grids"—that enabled them to compare and characterize existing systems. These grids focus on important domains and elements that the authors concluded any acceptable instrument for these purposes ought to cover. These elements reflect steps in research design, conduct, or analysis that have been shown through empirical work to protect against bias or other problems in such investigations or that are long-accepted practices in epidemiology and related research fields. They assessed systems against domains and assigned scores of fully met (Yes), partially met (Partial), or not met (No).

Then, drawing on the results of their analysis, the authors identified existing quality rating scales or checklists that in their view can be used in the production of systematic evidence reviews and technology assessments and laid out the reasons for highlighting these specific instruments. An earlier version of the entire report was subjected to extensive external peer review by experts in the field and AHRQ staff, and the authors revised that draft as part of the steps to produce this report.

Return to Contents


Data Collection

The authors reviewed the titles and abstracts for a total of 1,602 publications for this project. From this set, they retained 109 sources that dealt with systems (i.e., scales, checklists, or other types of instruments or guidance documents) pertinent to rating the quality of individual systematic reviews, RCTs, observational studies, or investigations of diagnostic tests, or with systems for grading the strength of bodies of evidence. In addition, they reviewed 12 reports from various AHRQ-supported EPCs. In all, the authors considered 121 systems as the basis for this report.

Specifically, they assessed 20 systems relating to systematic reviews, 49 systems for RCTs, 19 for observational studies, and 18 for diagnostic test studies. For final evaluative purposes, the authors focused on scales and checklists. In addition, they reviewed 40 systems that addressed grading the strength of a body of evidence (34 systems identified from their searches and prior research and 6 from various EPCs). The systems reviewed totals more than 121 because several were reviewed for more than one grid.

Return to Contents

Systems for Rating the Quality of Individual Articles

Important Evaluation Domains and Elements

For evaluating systems related to rating the quality of individual articles, the authors defined important domains and elements for four types of studies. Boxes A and B list the domains and elements used in this work, highlighting (in italics) those domains they regarded as critical for a scale or checklist to cover before they could identify a given system as likely to be acceptable for use today.

Systematic Reviews

Of the 20 systems concerned with systematic reviews or meta-analyses, the authors categorized one as a scale3 and 10 as checklists.4-14 The remainder are considered guidance documents.15-23

To arrive at a set of high-performing scales or checklists pertaining to systematic reviews, the authors took account of seven key domains (see Box A): study question, search strategy, inclusion and exclusion criteria, data abstraction, study quality and validity, data synthesis and analysis, and funding or sponsorship. One checklist fully addressed all seven domains.7 A second checklist also addressed all seven domains but merited only a "Partial" score for study question and study quality.8 Two additional checklists 6,12 and the one scale23 addressed six of the seven domains.

Box A. Important Domains and Elements for Systems to Rate Quality of Individual Articles

Systematic Reviews

  • Study question*
  • Search strategy*
  • Inclusion and exclusion criteria*
  • Interventions
  • Outcomes
  • Data extraction*
  • Study quality and validity
  • Data synthesis and analysis
  • Results
  • Discussion
  • Funding or sponsorship*

Randomized Clinical Trials

  • Study question
  • Study population*
  • Randomization*
  • Blinding*
  • Interventions*
  • Outcomes*
  • Statistical analysis*
  • Results
  • Discussion
  • Funding or sponsorship*

*(Key domains in italics.)

These latter two checklists excluded funding; the scale omitted data abstraction and had a "Partial" score for search strategy.

Randomized Clinical Trials

In evaluating systems concerned with RCTs, the authors reviewed 20 scales,18,24-42 11 checklists,12-14,43-50 one component evaluation,51 and seven guidance documents.1,11,52-57 In addition, they reviewed 10 rating systems used by AHRQ's EPCs.58-68

The authors designated a set of high-performing scales or checklists pertaining to RCTs by assessing their coverage of the following seven domains (see Box A): study population, randomization, blinding, interventions, outcomes, statistical analysis, and funding or sponsorship. They concluded that eight systems for RCTs represent acceptable approaches that could be used today without major modifications.14,18,24,26,36,38,40,45

Two systems fully addressed all seven domains24,45 and six addressed all but the funding domain.14,18,26,36,38,40 Two were rigorously developed,38,40 but the significance of this factor has yet to be tested.

Of the 10 EPC rating systems, most included randomization, blinding, and statistical analysis,58-61,63-68 and five EPCs covered study population, interventions, outcomes, and results as well.60,61,63,65,66

Users wishing to adopt a system for rating the quality of RCTs will need to do so on the basis of the topic under study, whether a scale or checklist is desired, and apparent ease of use.

Observational Studies

Seventeen non-EPC systems concerned observational studies. Of these, the authors categorized four as scales31,32,40,69 and eight as checklists.12-14,45,47,49,50,70 They classified the remaining five as guidance documents.1,71-74 Two EPCs used quality rating systems for evaluating observational studies; these systems were identical to those used for RCTs.

To arrive at a set of high-performing scales or checklists pertaining to observational studies, the authors considered the following five key domains: comparability of subjects, exposure or intervention, outcome measurement, statistical analysis, and funding or sponsorship. As before, they concluded that systems that cover these domains represent acceptable approaches for assessing the quality of observational studies.

Of the 12 scales and checklists the authors reviewed, all included comparability of subjects either fully or in part. Only one included funding or sponsorship and the other four domains the authors considered critical for observational studies. Five systems fully included all four domains other than funding or sponsorship.14,32,40,47,50

Two EPCs evaluated observational studies using a modification of their RCT quality system.60,64 Both addressed the empirically derived domain comparability of subjects, in addition to outcomes, statistical analysis, and results.

In choosing among the six high-performing scales for assessing study quality, users will have to evaluate which system is most appropriate for the task being undertaken, how long it takes to complete each instrument, and its ease of use. The authors were unable to evaluate these three instrument properties in the project.

Box B. Important Domains and Elements for Systems to Rate Quality of Individual Articles

Observational Studies

  • Study question
  • Study population
  • Comparability of subjects*
  • Exposure or intervention*
  • Outcome measurement*
  • Statistical analysis*
  • Results
  • Discussion
  • Funding or sponsorship*

Diagnostic Test Studies

  • Study population*
  • Adequate description of test*
  • Appropriate reference standard*
  • Blinded comparison of test and reference*
  • Avoidance of verification bias*

*(Key domains in italics.)

Studies of Diagnostic Tests

Of the 15 non-EPC systems identified for assessing the quality of diagnostic studies, six are checklists.12,14,49,75-78 Five domains are key for making judgments about the quality of diagnostic test reports: study population, adequate description of the test, appropriate reference standard, blinded comparison of test and reference, and avoidance of verification bias. Three checklists met all these criteria.49,77,78 Two others did not address test description, but this omission is easily remedied should users wish to put these systems into practice.12,14 The oldest system appears to be too incomplete for wide use.75,76

With one exception, the three EPCs that evaluated the quality of diagnostic test studies included all five domains either fully or in part.59,68,79,80 The one EPC that omitted an adequate test description probably included this information apart from its quality rating measures.79

Systems for Grading the Strength of a Body of Evidence

The authors reviewed 40 systems that addressed grading the strength of a body of evidence: 34 from sources other than AHRQ EPCs and 6 from the EPCs. Their evaluation criteria involved three domains—quality, quantity, and consistency (Box C)—that are well-established variables for characterizing how confidently one can conclude that a body of knowledge provides information on which clinicians or policymakers can act.

The 34 non-EPC systems incorporated quality, quantity, and consistency to varying degrees. Seven systems fully addressed the quality, quantity, and consistency domains.11,81-86 Nine others incorporated the three domains at least in part.12,14,39,70,87-91

Of the six EPC grading systems, only one incorporated quality, quantity, and consistency.93 Four others included quality and quantity either fully or partially.59-60,67-68 The one remaining EPC system included quantity; study quality is measured as part of its literature review process, but this domain appears not to be directly incorporated into the grading system.66

Box C. Important Domains and Elements for Systems to Grade the Strength of Evidence

Quality: The aggregate of quality ratings for individual studies, predicated on the extent to which bias was minimized.

Quantity: Magnitude of effect, numbers of studies, and sample size or power.

Consistency: For any given topic, the extent to which similar findings are reported using similar and different study designs.

Return to Contents


Identification of Systems

The authors identified 1,602 articles, reports, and other materials from their literature searches, web searches, referrals from their technical expert advisory group, suggestions from independent peer reviewers of an earlier version of this report, and a previous project conducted by the RTI-UNC EPC. In the end, the authors' formal literature searches were the least productive source of systems for this report. Of the more than 120 systems they eventually reviewed that dealt with either quality of individual articles or strength of bodies of evidence, the searches per se generated a total of 30 systems that they could review, describe, and evaluate. Many articles from the searches related to study quality were essentially reports of primary studies or reviews that discussed "the quality of the data"; few addressed evaluating study quality itself.

The literature search was most problematic for identifying systems to grade the strength of a body of evidence. Medical Subject Headings (MeSH) terms were not very sensitive for identifying such systems or instruments. The authors attribute this phenomenon to the lag in development of MeSH terms specific for the evidence-based medicine field.

For those involved in evidence-based practice and research, the authors caution that they may not find it productive simply to search for quality rating or evidence grading schemes through standard (systematic) literature searches. This is one reason that the authors are comfortable with identifying a set of instruments or systems that meet reasonably rigorous standards for use in rating study quality and grading bodies of evidence. Little is to be gained by directing teams seeking to produce systematic reviews or technology assessments (or indeed clinical practice guidelines) to initiate wholly new literature searches in these areas.

At the moment, the authors cannot provide concrete suggestions for efficient search strategies on this topic. Some advances must await expanded options for coding the peer-reviewed literature. Meanwhile, the authors suggest that investigators wishing to build on these efforts might well consider tactics involving citation analysis and extensive contact with researchers and guideline developers to identify the rating systems they are presently using. In this regard, the efforts of at least some AHRQ-supported EPCs will be instructive.

Return to Contents

Factors Important in Developing and Using Rating Systems

Distinctions Among Types of Studies, Evaluation Criteria, and Systems

The authors decided early on that comparing and contrasting study quality systems without differentiating among study types was likely to be less revealing or productive than assessing quality for systematic reviews, RCTs, observational studies, and studies of diagnostic tests independently. In the worst case, in fact, combining all such systems into a single evaluation framework risked nontrivial confusion and misleading conclusions, and they were not willing to take the chance that users of this report would conclude that "a single system" would suit all purposes. That is clearly not the case.

The authors defined quality based on certain critical domains, which comprised one or more elements. Some were based directly on empirical results that show that bias can arise when certain design elements are not met; they considered these factors as critical elements for the evaluation. Other domains or elements were based on best practices in the design and conduct of research studies. These are widely accepted methodologic standards, and investigators (especially for RCTs and observational studies) would probably be regarded as remiss if they did not observe them. The authors' evaluation of study quality systems was done, therefore, against rigorous criteria.

Finally, they contrasted systems on descriptive factors such as whether the system was a scale, checklist, or guidance document; how rigorously it was developed; whether instructions were provided for its use; and similar factors. This approach enabled the authors to home in on scales and checklists as the more likely methods for rating articles, that might be adopted more or less as is.

Numbers of Quality Rating Systems

The authors identified at least three times as many scales and checklists for rating the quality of RCTs as for other types of studies. Ongoing methodological work addressing the quality of observational and diagnostic test studies will likely affect both the number and the sophistication of these systems. Thus, the findings and conclusions with respect to these latter types of studies may need to be readdressed once results from more methodological studies in these areas are available.

Challenges of Rating Observational Studies

An observational study by its very nature "observes" what happens to individuals. Thus, to prevent selection bias, the comparison groups in an observation study are supposed to be as similar as possible except for the factors under study. For investigators to derive a valid result from their observational studies, they must achieve this comparability between study groups (and, for some types of prospective studies, maintain it by minimizing differential attrition). Because of the difficulty in ensuring adequate comparability between study groups in an observational study—both when the project is being designed or upon review after the work has been published—the authors raise the question of whether nonmethodologically trained researchers can identify when potential selection bias or other biases more common with observational studies have occurred.

Instrument Length

Older systems for rating individual articles tended to be most inclusive for the quality domains the authors chose to assess.24,45 However, these systems also tended to be very long and potentially cumbersome to complete. Shorter instruments have the obvious advantage of brevity, and some data suggest that they will provide sufficient information on study quality. Simply asking about three domains (randomization, blinding, and withdrawals) apparently can differentiate between higher- and lower-quality RCTs that evaluate drug efficacy.34

The movement from longer, more inclusive instruments to shorter ones is a pattern observed throughout the health services research world for at least 25 years, particularly in areas relating to the assessment of health status and health-related quality of life. Thus, this model is not surprising in the field of evidence-based practice and measurement. However, the lesson to be drawn from efforts to derive shorter, but equivalently reliable and valid, instruments from longer ones (with proven reliability and validity) is that substantial empirical work is needed to ensure that the shorter forms operate as intended. More generally, the authors are not convinced that shorter instruments per se will always be better, unless demonstrated in future empirical studies.

Reporting Guidelines

Reporting guidelines such as the CONSORT, QUOROM, and forthcoming STARD statements are not to be used for assessing the quality of RCTs, systematic reviews, or studies of diagnostic tests, respectively. However, the statements can be expected to lead to better reporting and two downstream benefits. First, the unavoidable tension (when assessing study quality) between the actual study design, conduct, and analysis and the reporting of these traits may diminish. Second, if researchers consider these guidelines at the outset of their work, they are likely to have better designed studies that will be easier to understand when the work is published.

Conflicting Findings When Bodies of Evidence Contain Different Types of Studies

A significant challenge arises in evaluating a body of knowledge comprising observational and RCT data. A contemporary case in point is the association between hormone replacement therapy (HRT) and cardiovascular risk. Several observational studies but only one large and two small RCTs have examined the association between HRT and secondary prevention of cardiovascular disease for older women with preexisting heart disease. In terms of quantity, the number of studies and participants is high for the observational studies and modest for the RCTs. Results are fairly consistent across the observational studies and across the RCTs, but between the two types of studies the results conflict. Observational studies show a treatment benefit, but the three RCTs showed no evidence that hormone therapy was beneficial for women with established cardiovascular disease.

Most experts would agree that RCTs minimize an important potential bias in observational studies, namely selection bias. However, experts also prefer more studies with larger aggregate samples and/or with samples that address more diverse patient populations and practice settings—often the hallmark of observational studies. The inherent tension between these factors is clear. The lesson the authors draw is that a system for grading the strength of evidence, in and of itself and no matter how good it is, may not completely resolve the tension. Users, practitioners, and policymakers may need to consider these issues in light of the broader clinical or policy questions they are trying to solve.

Return to Contents

Selecting Systems for Use Today: A "Best Practices" Orientation

Overall, many systems covered most of the domains that are considered generally informative for assessing study quality. From this set, the authors identified 19 generic systems that fully address our key quality domains (with the exception of funding or sponsorship for several systems). 3,6-8,12,14,18,24,26,32,36,38,40,45,47,49,50,77,78 Three systems were used for both RCTs and observational studies.

In the authors' judgment, those who plan to incorporate study quality into a systematic review, evidence report, or technology assessment can use one or more of these 19 systems as a starting point, being sure to take into account the types of study designs occurring in the articles under review. Other considerations for selecting or developing study quality systems include the key methodological issues specific to the topic under study, the available time for completing the review (some systems seem rather complex to complete), and whether the preference is for a scale or a checklist. They caution that systems used to rate the quality of both RCTs and observational studies—what they refer to as "one size fits all" quality assessments—may prove to be difficult to use and, in the end, may measure study quality less precisely than desired.

The authors identified seven systems that fully addressed all three domains for grading the strength of a body of evidence. The earliest system was published in 1994;81 the remaining systems were published in 199911 and 2000,82-86 indicating that this is a rapidly evolving field.

Systems for grading the strength of a body of evidence are much less uniform than those for rating study quality. This variability complicates the job of selecting one or more systems that might be put into use today. Two properties of these systems stand out. Consistency has only recently become an integral part of the systems reviewed in this area. The authors see this as a useful advance. Also continuing is the use of a study design hierarchy to define study quality as an element of grading overall strength of evidence. However, reliance on such a hierarchy without consideration of the domains discussed throughout this report is increasingly seen as unacceptable. As with the quality rating systems, selecting among the evidence grading systems will depend on the reason for measuring evidence strength, the type of studies that are being summarized, and the structure of the review panel. Some systems appear to be rather cumbersome to use and may require substantial staff, time, and financial resources.

Although several EPCs used methods that met the authors' criteria at least in part, these were topic-specific applications (or modifications) of generic parent instruments. The same is generally true of efforts to grade the overall strength of evidence. For users interested in systems deliberately focused on a specific clinical condition or technology, they refer readers to the citations given in the main report.

Return to Contents

Recommendations for Future Research

Despite being able to identify various rating and grading systems that can more or less be taken off the shelf for use today, the authors found many areas in which information or empirical documentation was lacking. They recommend that future research be directed to the topics listed below, because until these research gaps are bridged, those wishing to produce authoritative systematic reviews or technology assessments will be somewhat hindered in this phase of their work. Specifically, they highlight the need for work on:

  • Identifying and resolving quality rating issues pertaining to observational studies.
  • Evaluating inter-rater reliability of both quality rating and strength-of-evidence grading systems.
  • Comparing the quality ratings from different systems applied to articles on a single clinical or technology topic.
  • Similarly, comparing strength-of-evidence grades from different systems applied to a single body of evidence on a given topic.
  • Determining what factors truly make a difference in final quality scores for individual articles (and by extension a difference in how quality is judged for bodies of evidence as a whole).
  • Testing shorter forms in terms of reliability, reproducibility, and validity.
  • Testing applications of these approaches for "less traditional" bodies of evidence (i.e., beyond preventive services, diagnostic tests, and therapies)—for instance, for systematic reviews of disease risk factors, screening tests (as contrasted with tests also used for diagnosis), and counseling interventions.
  • Assessing whether the study quality grids that the authors developed are useful for discriminating among studies of varying quality and, if so, refining and testing the systems further using typical instrument development techniques (including testing the study quality grids against the instruments they considered to be "high quality").
  • Comparing and contrasting approaches to rating quality and grading evidence strength in the United States and abroad, because of the substantial attention being given to this work outside this country; such work would identify what advances are taking place in the international community and help determine where these are relevant to the U.S. scene.

Return to Contents
Proceed to Next Section


The information on this page is archived and provided for reference purposes only.


AHRQ Advancing Excellence in Health Care