Proof and Policy from Medical Research Evidence (continued)
Table 4. Sources and Designs of Research
Primary Studies in Humans
Numerous rating schemas exist—in the form of checklists and scales—that can help delineate the types of research that are most appropriate to answer particular questions. There are also multiple rating schemes for appraising particular study designs such as randomized trials. These are approaches chiefly for grading the quality of individual studies, but their reliability, validity, feasibility, and utility are today largely either unmeasured or quite variable (Sacks, Chalmers, and Smith 1983; Schulz et al. 1994; Guyatt et al. 1995, 1998; Moher et al. 1995; Hadorn et al. 1996; U.S. Preventive Services Task Force 1996; Lohr and Carey 1999; SIGN 1999a, 1999b).
The value of any single piece of medical research evidence is derived from how it fits with and expands previous work and from the study's intrinsic properties (Cooper 1984: 79-113). Integrating an entire body of relevant medical research, and then assessing the strength of that collection of research, is usually more important than critiquing a single piece of research evidence. This often requires piecing together heterogeneous items of direct and indirect evidence. (Medical evidence is considered indirect if two or more bodies of evidence are required to relate the exposure, diagnostic strategy, or intervention to the principal outcome.)
Integrating evidence is invariably a subjective process, dependent on the skills and values of the individuals who are trying to synthesize multiple pieces of diverse medical evidence. Individuals summarizing medical research make judgments about the relevance, legitimacy, and relative uncertainty of particular pieces of evidence, the importance of missing evidence, the soundness of any models for linking evidence, and the appropriateness of conducting a quantitative summary (Mulrow, Langhorne, and Grimshaw 1997). Conclusions of any synthesis of indirect research evidence are inferential and based on a combination of facts, arguments, and analogies. An important pitfall to avoid is confusing lack of high-level evidence with evidence against effectiveness: absence of proof is not the same as proof of absence.
Several frameworks can help guide, standardize, and make explicit the process of synthesizing bodies of medical research evidence (Hill 1965; Naranjo et al. 1981; Cadman et al. 1984; Pere et al. 1986; Sox et al. 1989; Woolf et al. 1990; Woolf 1991; Eddy, Hasselblad, and Shachter 1992; Huff 1992; NHMRC 1995; Fleming and DeMets 1996; Cook et al. 1997; Mulrow, Langhorne, and Grimshaw 1997). An example of a classic framework for assessing a body of evidence relating to harm is given in Table 5 (Hill 1965). Some of these criteria are similar to those noted in Table 4 regarding critical evaluation of individual pieces of evidence relating to harm. However, the framework for synthesizing a body of evidence and for designating the strength of that evidence has significant differences; a hierarchy of relevant valid evidence (e.g., experimental evidence in humans) and an emphasis on consistent and coherent results across multiple types and sources of evidence are apparent.
Table 5. Framework for Synthesizing Body of Evidence Relating to Harm
In the end, those compiling medical research evidence may be able to define and assign only relatively subjective classifications of the strength of evidence on a given question—such as "excellent" to "poor" or "strong," "moderate," or "weak." For example, "good" evidence may exist when data in individual studies are sufficient for assessing the quality of those findings, when data across studies are consistent, and when they indicate that the intervention in question is superior to alternative treatments. By contrast, evidence may be only "fair" when information from individual studies can be graded but is subject to challenge on quality grounds and/or when reasonably acceptable data across studies are inconsistent in their findings. Finally, a body of evidence may be characterized as "poor" when the number of relevant studies is minimal, when the quality of individual studies is highly suspect because of flaws in design or conduct, or when the evidence is so conflicting that no logical or defensible conclusions can be drawn.
Applicability of Medical Research Evidence to Populations or Individuals
Much research evidence applies to probabilities of occurrences in groups or populations and not in individual patients. In either instance, accurate prediction or proof of causality (or both) applicable to real-life settings is difficult and relies on judgment regarding the magnitude of probability and uncertainty (reasonable doubt) that one considers as acceptable proof. For example, even therapies that are "proven effective" will not work in every patient, and therapies or exposures that are "proven harmful" will not harm every patient to whom they are given.
Guides for applying medical research evidence to the individual patient situation call for the following actions (Glasziou et al. 1998; Ross 1998): (a) stratify research findings according to an individual's characteristics (often not possible); (b) ask whether the underlying pathophysiology and presence of comorbid conditions in the individual patient situation are so different that the research is not applicable; (c) assess whether the intervention or exposure in the real-life setting approximates that tested in research; (d) estimate benefits and harms from research obtained from groups, but apply those estimates based on established knowledge of the individual's characteristics or risks; and (e) take into account individual preferences, competing priorities, and resources.
Recommendations Based on Evidence: Guidelines versus Standards
Medical recommendations based on research evidence can be formed as guidelines or standards. Clinical practice guidelines are "systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific clinical circumstances" (Institute of Medicine 1990: 38). Methods of formulating guidelines may differ in several respects, including methods for identifying, appraising, and ranking relevant research evidence; models for integrating indirect evidence; methods for incorporating experience and opinion; whether harms, costs, and values are explicitly considered; and sponsorship (ibid.).
The four critical concepts to understand about the creation of defensible guidelines are (1) that the development process is open, documented, and reproducible; (2) that the resulting product or products can be of use to both clinicians and patients; (3) that the concept of "appropriateness" of services is well reflected in the guideline (where appropriateness means essentially that the potential health benefits of the service exceed the potential harms or risks by a margin sufficiently large that the service is worth providing); and (4) that the guideline relates specifically to clearly defined clinical issues.
Explicit criteria have been available for a decade to use in assessing the soundness of practice guidelines and in directing the development of new guidelines from systematic reviews of evidence (Woolf 1992; Carter et al. 1995; Cluzeau and Littlejohns 1999; Shaneyfelt, Mayo-Smith, and Rothwangl 1999). Such criteria emphasize two broad attributes of guidelines: that they be credible with practitioners, patients, payers, and policy makers, and that the developers be accountable for the conclusions they draw from the evidence and for the recommendations they base on those conclusions.
Important criteria concerning the process of guideline development call for developers to ensure the clarity of what they have written, that they have used a multidisciplinary approach, that they have dated their work and identified a point in the future when the guidelines ought to be revisited in the light of possible new evidence, and that the entire process be documented. Equally important criteria about the substance of the guideline reinforce the views that the clinical scope of the guideline be explicit, that the guideline provide for appropriate flexibility for clinical decision makers when medical evidence is not clear-cut, and that the guideline have acceptable reliability and validity.
Arguably the most important attribute of guidelines is validity. That is, guidelines should, when followed, lead to the health and cost outcomes expected for them. Elements of their validity consider the substance and quality of the research evidence cited, the ways that such evidence is evaluated, the strength of the collective body of evidence in question, the intensity or force of recommendations in light of the strength of evidence, and judgments about likely net benefits to patient populations. In some instances, empirical evaluations of the validity and utility of specific guideline recommendations may be available.
Whether created or adapted locally or nationally, most guidelines are an amalgam of clinical experience, expert opinion, and research evidence (Institute of Medicine 1992; Woolf 1999). In the United States, there are literally thousands of practice guidelines. Not surprisingly, some of these vary in content and conclusions, conflict with one another, or both.
Guidelines most often apply to the general and not the particular. They require extrapolation to individual circumstance. Whether individual circumstances warrant a different standard can be judged only case by case. Following evidence-based guidelines may generally but not always assure good medical care; diverging from guidelines does not always signal poor care (Mulrow 1996; Weingarten 1997; Woolf et al. 1999).
Unlike a guideline, which is a recommendation for best practices, standards are practices that are medically necessary and services that any practitioner under any circumstance would be required to render (Brook 1991; Leape 1995; Eddy 1996). Guidelines are meant to be flexible and amenable to tailoring to meet individual circumstances; standards are meant to be inflexible and should always be followed, not tailored (Eddy 1996). Formulating standards rather than guidelines requires a higher bar. One needs to consider the relative effectiveness and harms of a wide variety of diagnostic and treatment options for multiple possible medical conditions that a patient or population may face. One also needs to assess feasibility and costs of those options.
Evidence-based guidelines that focus on single conditions likely will inform, but not determine, standards of medical care that our society deems necessary. Likewise, research evidence can and should inform standards of care, but research evidence in and of itself will invariably be inadequate to establish standards because standards will require priority setting based on cost and value judgments.
At the present time, consumers, health care providers, judges, and policy makers lack ready, scientific means for comparing the relative effectiveness and harms of various types of medical care (Woolf 1999). Such information is critical for setting priorities and standards. An irony of our medical information age and of evidence-based medicine is that we have thousands of studies and systematic summaries of those studies that focus on effects of specific exposures or treatments on particular outcomes. Although valuable, this narrowly focused repository of data provides a piecemeal rather than an integrative approach when choosing among competing priorities and setting the standards that are most likely to improve health.
Moreover, we have little scientific work from the perspective of defining global or national health goals and examining the relative effectiveness of various strategies for achieving those goals. A recent suggestion regarding the creation of a bibliographic research evidence collection center, paired with a simulation modeling program, could aid better estimation of the potential benefits and harms of competing health care strategies (ibid.). Such projections could help policy makers, clinicians, and patients give due priority to the strategies most likely to improve health. Regardless, we need greater emphasis on formulating broader evidence-based guidelines and standards that at least (a) address clusters of conditions (e.g., cardiovascular disease or cancer) rather than single specific conditions and (b) define and translate harms as well as they define and translate benefits. For evidence-based medicine, a final irony may be that these more integrative approaches are sorely needed, yet they rely on more assumptions than do simple but less integrative techniques.
All these factors point to an important conclusion about the role of evidence-based practice and guidelines in the courts today. The gaps and deficiencies in current guidelines make them difficult to apply as the definitive information for legal or judicial decision making, just as they may often be difficult to implement in medical decision making. The field of evidence-based medicine is progressing rapidly in clinical substance and methodology, but the day has not yet come when it undergrids all that is or could be done in medicine or the medicolegal context.
Medical research is continually evolving and accumulating; yesterday's precedent may be today's anachronism. Interpreting and judging medical research evidence involves explicit as well as subjective processes. Although neither research evidence nor its synthesis is always neutral and objective, we do have evidence-based techniques that aid comprehensive collation, unbiased and explicit evaluation, and systematic summarization of available research. For example, hierarchies of types of research evidence that are relevant for different types of questions have been developed. In addition, techniques exist by which to appraise the relevance and validity of individual pieces as well as bodies of research evidence and to link them to guidelines and standards.
Such developments in evidence-based medicine are an aid, not a panacea, for definitively establishing benefits and harms of medical care and prognoses of patients. First, interpreting and judging continually evolving medical research involves subjective processes that are inherently dependent on the "eye of the observer." Second, although methods of rating and integrating research evidence are evolving and being tested, any single or uniform "best method" for such a complex task is unlikely to be available in the near future (if ever). Third, guidelines, even when based firmly on high-quality research, are not always relevant or valid for individual situations; nor, usually, are they adequate for establishing medical necessity across different conditions. Fourth, much research applies to groups of patients or populations and not to individuals. Fifth, for both medicine and law, accurate prediction and/or absolute proof of causality applicable to individuals or to real-life settings are difficult, if not impossible, in many instances. Finally, the contributions of medical research evidence to proof or policy for any given clinical (or legal) situation will come in a context in which judgment and values, understanding of probability, and tolerance for uncertainty all have their place.
Ad Hoc Working Group for Critical Appraisal of the Medical Literature. 1987. Academia and Clinic: A Proposal for More Informative Abstracts of Clinical Articles. Annals of Internal Medicine 106:598-604.
Carter, A. O., R. N. Battista, M. J. Hodge, S. Lewis, A. Basinski, and D. Davis. 1995. Report on Activities and Attitudes of Organizations Active in the Clinical Practice Guidelines Field. Canadian Medical Association Journal 153:901-907.
Cluzeau, F. A., and P. Littlejohns. 1999. Appraising Clinical Practice Guidelines in England and Wales: The Development of a Methodologic Framework and Its Application to Policy. Journal of Quality Improvement 25:514-521.
Cook, D. J., D. L. Sackett, and W. O. Spitzer. 1995. Methodologic Guidelines for Systematic Reviews of Randomized Controlled Trials in Health Care from the Potsdam Conference on Meta-Analysis. Journal of Clinical Epidemiology 48:17-71.
Guyatt, G. H., D. L. Sackett, J. C. Sinclair, R. Hayward, D. J. Cook, R. J. Cook, et al. 1995. Users' Guides to the Medical Literature IX: A Method for Grading Health Care Recommendations. Evidence-Based Medicine Working Group. Journal of the American Medical Association 274:1800-1804.
Huff, J. 1992. A Historical Perspective on the Classification Developed and Used for Chemical Carcinogens by the National Toxicology Program during 1983-1992. Scandinavian Journal of Work and Environmental Health 18(supp1.):74-82.
Michaud, C., and C. J. L. Murray. 1996. Resources for Health Research and Development in 1992: A Global Overview. In Investing in Health Research and Development. Report of the Ad Hoc Committee on Health Research Relating to Future Intervention Options. Geneva: World Health Organization.
Moher, D., A. R. Jadad, G. Nichol, M. Penman, P. Tugwell, and S. Walsh. 1995. Assessing the Quality of Randomized Controlled Trials: An Annotated Bibliography of Scales and Checklists. Controlled Clinical Trials 16:62-73.
Naranjo, C. A., U. Busto, E. M. Sellers, P. Sandor, I. Ruiz, E. A. Roberts, et al. 1981. A Method for Estimating the Probability of Adverse Drug Reactions. Clinical Pharmacology and Therapeutics 30:239-245.
National Health and Medical Research Council (NHMRC), Quality of Care and Health Outcomes Committee. 1995. Guidelines for the Development and Implementation of Clinical Practice Guidelines. Canberra: Australian Government Publishing.
Schulz, A. F., I. Chalmers, D. A. Grimes, and D. G. Altman. 1994. Assessing the Quality of Randomization from Reports of Controlled Trials Published in Obstetrics and Gynecology Journals. Journal of the American Medical Association 272:125-128.
Scottish Intercollegiate Guidelines Network (SIGN). 1999b. SIGN Guidelines: An Introduction to SIGN Methodology for the Development of Evidence-Based Clinical Guidelines. Report no. 39. Edinburgh: SIGN.
Shaneyfelt, T. M., M. F. Mayo-Smith, and J. Rothwangl. 1999. Are Guidelines Following Guidelines? The Methodological Quality of Clinical Practice Guidelines in the Peer-Reviewed Literature. Journal of the American Medical Association 281:1900-1905.
Silverman, W. A. 1993. Doing More Good Than Harm. In Doing More Good Than Harm: The Evaluation of Health Care Interventions, ed. K. S. Warren and M. Nosteller. New York: New York Academy of Sciences, pp. 5-11.
Woolf, S. H. 1991. Manual for Clinical Practice Guideline Development. Publication no. 91-0007. Rockville, MD: U.S. Department of Health and Human Services, Public Health Service, Agency for Health Care Policy and Research.
Woolf, S. H., R. N. Battista, G. M. Anderson, A. G. Logan, and E. Wang. 1990. Assessing the Clinical Effectiveness of Preventive Maneuvers: Analytic Principles and Systematic Methods in Reviewing Evidence and Developing Clinical Practice Recommendations. A Report by the Canadian Task Force on the Periodic Health Examination. Journal of Clinical Epidemiology 43:891-905.