Decisions Encountered During Key Task Number 6: Creating Performance Reports

Methodological Considerations in Generating Provider Performance Score

Creating a performance report involves decisions about methods for calculating and categorizing performance scores as well as other, equally important decisions that affect the usability (or evaluability) of the report. A way of distinguishing these two sets of considerations follows:

Usability (or evaluability) considerations focus on the information a user can extract from a performance report. The way performance data are displayed may confuse patients. When this happens, patients may be unable to extract any information whatsoever. Or they may misinterpret data and therefore be misled by the report. Examples of usability decisions include:

  • Should a report contain many performance categories or only a few?
  • Should providers be displayed in order of performance ranking or in some other kind of order (e.g., alphabetical order)?
  • Should numbers indicate performance, or should some other kind of symbol be used (e.g., star ratings)?
  • How should the concept of statistical uncertainty be displayed in order to maximize public understanding?
  • How many measures should be reported?

    These usability decisions can draw guidance from studies that have investigated which kinds of data displays are most understandable to patients. Separate AHRQ reports by Hibbard and Sofaer provide guidance on usability and evaluability decisions.1-2

The methodological considerations covered here focus on whether the information in a performance report might be misleading, even when this information is understood perfectly by the patient. In other words, even if a performance report is so clear that patients and providers can extract and understand all the information it contains, this performance information may contain fundamental problems. Providers who are truly higher performing may be reported as lower performing, and vice versa. The degree to which these problems might be present depends on the available data and the methodological decisions a Chartered Value Exchange (CVE) makes.

In creating performance reports, decisions about usability and other methodological issues are linked to each other. The desirability of each methodological option may change, depending on decisions about usability, and vice versa. For example, a CVE may initially make a usability-based decision to create a performance report that ranks providers in five performance categories on two composite measures of performance. However, once the performance data are computed into performance scores for providers, the CVE may find that the number of misclassified providers is unacceptably high. There also may be methodological problems with the composites (e.g., the individual measures may not "agree" with each other—and therefore cannot create a "coherent" composite measure).

Faced with methodological issues, a CVE may want to compromise between usability and methodological considerations. Providers might be reported in only three categories instead of five, and eight individual measures might be reported instead of two composites. But these changes may make the report more confusing to patients. Striking the right balance between usability and other methodological considerations is a "value judgment" that may be negotiated among CVE stakeholders.


A. Will performance be reported at single points in time, or as trends?

Once provider performance scores are calculated, these scores can be reported in many different formats. This section and the following two sections illustrate the methodological tradeoffs a CVE may encounter when choosing among some commonly discussed ways of reporting performance. These tradeoffs can apply regardless of which performance measure is being reported.

Here, we discuss reporting performance at a single point in time or reporting trends. These two options are not mutually exclusive, and they can be combined in a performance report.

  1. Option 1: Report performance at a single point in time ("achieved performance"). For each provider included in a report, a CVE may report a performance level representing care delivered over a single period. The period is usually chosen to be as recent as possible. If provider performance is not changing, reported performance may predict the kind of future performance a patient is likely to receive.


    • The reporting period can be lengthened to help deal with misclassification risk (the "rolling average" approach, discussed in the section on Task Number 5).


    • If provider performance is changing, "achieved performance" may be misleading due to the lag between the time health care is delivered and the time performance is reported. In other words, past performance may not accurately predict the future performance a patient is likely to receive.

  2. Option 2: Report performance change over time ("performance trends"). Performance change is rarely reported on its own, but performance change can be added to reports of "achieved performance."


    • If provider performance is changing, then reporting which providers are improving (or not) may enable patients to better predict the kind of performance they are likely to receive from a provider.


    • Reporting performance change may increase the complexity of methods for dealing with misclassification risk (go to Introduction) and performing case mix adjustment (go to section on Task Number 5). Consultation with a statistician will be necessary.
    • It may not be possible to adequately separate changes in "true" performance from random variation in measured performance (caused by chance alone). This may result in unacceptably high risk of misclassification due to chance.

Return to Contents


B. How will numeric performance scores be reported?

As in the preceding section, the following options are not mutually exclusive. A performance report can simultaneously use combinations of these strategies.

  1. Option 1: Report numeric performance scores. An example of a numeric performance score is the actual percentage of patients receiving a measured service. This numeric score is inherently meaningful. Other numeric scores, such as average ratings on a patient satisfaction survey, may not be inherently meaningful.


    • Provides detailed performance score data to patients. For example, a patient will be able to see that one provider delivers a measured service to 80 percent of patients while another only delivers the service to 79 percent.


    • The performance classes implied by numeric scores may lead to unacceptably high rates of misclassification due to chance (go to Introduction). In the 80 percent versus 79 percent example, this 1 percentage point difference may be almost entirely due to chance. But reporting numeric scores may mislead patients to believe that the provider with the 80 percent observed score has a higher "true" score than the provider measured at 79 percent.

  2. Option 2: Report performance scores with a representation of measurement error. Measurement error can be represented as a numeric range of uncertainty (often a 95% confidence interval). Alternatively, measurement error can be represented in a categorical fashion. For example, performance scores with high measurement error might be marked with a special symbol.


    • Still provides detailed performance score data to patients, but may also convey a sense of the range over which "true" performance is likely to be located.
    • Even if no range of uncertainty is included, gives "fair warning" to patients through the use of special symbols to mark scores with high measurement error.


    • Numeric ranges of uncertainty can be very difficult to understand, even for
    • Ranges of uncertainty may not communicate the right information to patients who are trying to compare providers. Numeric ranges of uncertainty based on statistical significance are only valid for comparing one provider's performance to a fixed (nonrandom) benchmark. However, a comparison between two providers is a comparison between two random variables (because measurement error is present for both providers). Uncertainty about the performance difference between every pair of providers is what determines the chances that these providers are misclassified relative to each other. Without a series of charts showing the range of uncertainty about performance differences between all possible combinations of providers, users cannot know the likelihood of misclassification (even if patients understand the report and use it exactly as instructed). For further explanation on this point, it may be advisable to consult a statistician.
    • Despite the instruction, "Do not use this report to make comparisons between providers," patients may still compare providers. If comparisons are still made, the effective risk of misclassification may be unacceptably high.
    • Patients may not understand what is meant by a special symbol of measurement error. They may ignore this special symbol and be misled about relative provider performance.
    • There is no "best" range of numeric uncertainty to display. The 95 percent confidence interval, although conventional, is essentially the result of a value judgment.

  3. Option 3: Report "shrunken" performance scores. "Shrunken" performance refers to performance estimates produced by special statistical techniques that incorporate measurement error into the performance score itself. When within-provider error is high (e.g., because of low numbers of observations), shrunken estimates "shrink" performance scores back toward the average of the entire provider distribution. When within-provider error is lower, the shrunken estimates still pull performance back toward the mean, but the amount of this pulling is lower.vii

    Put another way, each shrunken performance score is a weighted average of each provider's performance and the average performance of all providers. When there is high uncertainty about an individual provider's score, the average performance of all providers is more heavily weighted. When there is less uncertainty about a provider's score, the average performance of all providers is less heavily weighted (so the provider's shrunken score is close to the raw score). Shrunken performance scores and a related strategy called "mixed performance classification" are discussed in more detail in the section on Task Number 5.


    • Relative to reports of raw performance scores, shrunken scores may be less likely to mislead patients about relative provider performance. Because shrunken scores incorporate uncertainty about provider performance and about the entire provider population, shrunken scores provide better predictions of future provider performance.


    • Generating shrunken performance estimates is methodologically complex and can be difficult to explain. Stakeholders may not understand why the performance reported for a given provider incorporates information about the entire population of providers in a report.

  4. Option 4: Report provider rankings. Ranking can be done by ordering providers from highest to lowest based on their performance scores.


    • Easy to understand; facilitates comparison between providers.
    • High degree of detail. For example, patients can see which provider was ranked seventh and which was ranked eighth.


    • The rate of provider misclassification due to chance is likely to be unacceptably high because ranking maximizes the number of reporting categories: each rank defines a category. As a rule, the more reporting categories are included in a report, the higher the misclassification risk (go to "classification system" in Appendix 2).

Return to Contents


C. How will performance be categorized?

While numeric performance scores and performance rankings implicitly categorize provider performance (i.e., by allowing comparisons between providers, with each score or ranking constituting a "category"), strategies for explicitly categorizing performance are also common.

  1. Option 1: Report categories of performance based on national benchmarks. For example, a CVE might report local providers as having performance in categories defined by the 25th, 50th, and 75th percentiles of national performance.


    • Using benchmarks enables comparison of local provider performance relative to national performance.
    • Reporting performance in a small number of categories may reduce the risk of misclassification due to chance (go to "classification system" in Appendix 2).


    • If nearly all local providers are in the same category relative to national performance (e.g., all are above the 75th percentile), then the report will not be useful to patients in choosing among local providers. On the other hand, it may be reassuring to patients that all local providers are "good enough" on a given measure (assuming providers are indistinguishable because they are all high performers).
    • National benchmarks may not have the intended meaning if measure specifications have been locally modified (go to section on Task Number 2).
    • If categories are too wide (i.e., include a broad range of scores), then meaningful performance variation may be hidden.

  2. Option 2: Report categories of performance based on local benchmarks. For example, a CVE might report local providers as having performance in categories defined by the 25th, 50th, and 75th percentiles of local performance. Under this system, there will always be some providers reported in the highest category of performance and some reported in the lowest.


    • By always including some providers in each performance category, increases the likelihood of providing useful information to patients choosing among local providers.
    • May motivate performance competition between local providers (however, if this harms professional relationships that benefit patients, it may not be desirable).


    • May make it difficult to compare local provider performance to national benchmarks.

  3. Option 3: Report categories of performance based on tests of statistical significance. Tests of statistical significance compare each provider's observed performance to some reference value. This reference value is often, but does not have to be, the average performance of the entire provider population (local or national). Statistical significance-based thresholds commonly use a 5 percent "level of significance" (or "95% confidence"), but there is nothing special about the 5 percent figure. The level of statistical significance that is acceptable to CVE stakeholders is a value judgment that can be negotiated among the stakeholders of each CVE. The section on Task Number 5 explains how statistical significance relates to misclassification risk.


    • May limit the number of providers who, due to chance alone, are misclassified as having performance that is different from the reference value (usually the mean).


    • When providers' true performance is different from the reference value, this approach may increase the number of such providers who are misclassified as having performance that is the same as the reference value. In other words, more truly high- or low-performing providers will be reported as having average performance.
    • This approach may result in categories that are too wide, especially if only three categories are reported (e.g., above average, average, and below average). Meaningful performance variation may be hidden.

Return to Contents


D. Will composite measures be used?

Composite measures (also known as "summary measures") combine the performance data from two or more individual performance measures into a single performance score. For example, a provider's performance on four individual measures of diabetes care might be combined into a composite measure of "overall diabetes care."

A separate AHRQ decision guide titled Selecting Quality and Resource Use Measures: A Decision Guide for Community Quality Collaboratives provides a complementary discussion that defines composite measures in more detail and describes their possible uses more broadly.14 Here, the discussion focuses on key methodological decision points regarding composite measures for public reporting:

  • Will composite measures be used?
  • If composites will be used, which individual measures will be combined?
  • For a given collection of individual measures, exactly how will these measures be combined? In other words, how will the composite measure be constructed?

The following options illustrate the tradeoffs involved in making these decisions. These options are not mutually exclusive and may be chosen in various combinations. The first question is considered below, and the second and third questions are presented in the following two sections.

To avoid redundancy, we present only the advantages and disadvantages of the "yes" answer to the question, "Will composite measures be used?"

  1. Option 1: Report composite measures of provider performance. The alternative to this option is to only report performance on individual measures. Relative to reporting performance on individual measures, creating and reporting composites has advantages and disadvantages.


    • Compared to reports of a large number of individual performance measures, reports of a small number of composite measures may be less overwhelming to patients who are trying to choose a provider.viii
    • Use of composites may reduce the risk of performance misclassification due to chance (go to section on Task Number 5).


    • The inherent meaning of individual performance measures may be lost. For example, an individual measure score such as "75% of diabetic patients receive lipid screening" has clear clinical meaning. But the meaning of a composite score such as "75% on overall diabetes quality" is unclear. This concern is less important when the individual measures themselves have no clear inherent meaning (e.g.,measures of patient experience).ix This concern may also be less important when categories of performance will be reported, rather than numeric performance scores.
    • By reducing the amount of data detail, performance reports may be less useful to providers who are trying to improve.
    • Patients with particular health conditions may care about specific individual measures. Presenting these measures as composites means that these patients will not be able to see the individual measures.
    • Composite scores may be very sensitive to exactly which measures are included and how they are combined.37 There is no single "right" way to make composites. The best choice for many decisions about composite construction will be uncertain, even when CVE stakeholders have agreed on the value judgments of performance reporting. If sensitivity analysis (redoing the performance report using different but justifiable methods; go to item G, below) reveals that scores change dramatically when alternative composite construction strategies are used, then reporting composite measures may mislead patients. Patients may not know that other possible composite constructions would produce different results.
    • Some types of composites may unintentionally overemphasize certain individual measures and underemphasize others. This can happen when one measure has too much weight (go to discussion of weighting in item F).

Return to Contents


E. If composite measures will be used, which individual measures will be combined?

Once a CVE has decided to use composite measures, the next methodological question is which individual performance measures will be combined into a composite measure.

  1. Option 1: Choose individual measures for inclusion in a composite based on whether they statistically "belong together." Each composite measure contains two or more individual measures. But which measures should be included in a composite? One way to decide is to use special statistical techniques to let the data decide which measures to include. These techniques work by looking for sets of measures that are "correlated" or associated with each other: when a provider does well on one of these measures, the provider tends to also do well on the others. Composites that are constructed in this way are called "reflective" or "latent" composites.38 A statistical technique known as "factor analysis" is a common approach used to identify the measures included in these composites.


    • Using statistical procedures to select the measures that will go into each composite is a relatively automatic process. However, consultation with a statistician may be necessary.
    • There is extensive precedent for this methodology: Such composites are the most common way to present data from patient experience surveys such as CAHPS® (Consumer Assessment of Healthcare Providers and Systems).39
    • Individual measures within a composite will be "correlated." When a provider does well on one of these measures, the provider will also tend to do well on the others.


    • This methodology may result in composite measures that do not make intuitive clinical sense. For example, four individual diabetes measures may be available to a CVE. It might make clinical sense to expect these four measures to form a composite. However, if statistical techniques are used to determine which measures will be included in which composite, these four measures could end up in two or more composites (where they might be combined with measures of depression and cancer screening).
    • This methodology may not identify a composite for every individual measure. Some individual performance measures may not be correlated with the others. These "orphan" measures can be reported individually or excluded from performance reports.
    • This approach relies on complex statistical methodology that may be difficult to explain.

  2. Option 2: Choose individual measures for inclusion in a composite based on nonstatistical judgment. With this option, a CVE uses its own judgment (clinical or otherwise) to choose which measures to include in a composite. For example, a CVE may decide to make composites that include all measures for a health condition (e.g.,all measures for heart disease or all measures for cancer screening or preventive care). The Apgar score for newborns is a commonly used example of a clinical composite that combines a variety of vital signs and physical findings.40 In the development of such composites, statistical correlation between the constituent individual measures is a secondary concern. Composites that are constructed in this way are sometimes called "formative" composites.38


    • When they are based on clinical judgment, these composites make intuitive clinical sense.


    • CVE stakeholders may not agree on which individual performance measures belong in which composite.
    • The individual measures within a composite may not be correlated, which is especially likely when trying to create a single "global" composite that combines all measures.41-42 Worse, individual measures may be inversely correlated, so that high performance on one measure tends to predict low performance on another. When the measures within a composite are not correlated, patients will not be able to see when a provider is truly better on one kind of measure than on another. For example, a provider may be very good at delivering colorectal cancer screening but not as good at delivering cervical cancer screening. If a clinical "cancer screening" composite combines these measures, this difference in performance will be masked. A CVE can check composites for "internal consistency" to see how well the constituent measures are correlated with each other.x
    • If the individual measures within a composite are not statistically correlated with each other, the risk of performance misclassification due to chance may increase.

  3. Option 3: Choose composite measures that have been endorsed by a national body. The National Quality Forum (NQF) has endorsed a small number of composite measures. These include composites for measuring inpatient quality of care that were developed by AHRQ.43 Patient experience surveys also generally include instructions on how data from individual survey items should be combined into specific composite measures.39


    • Documented rationales and usage advice for these measures may be available.


    • Endorsed composites may not cover all the measures a CVE would like to report.
    • A CVE may not have access to all the measures included in an endorsed composite.
    • Even though endorsed composites may be internally consistent in national data, these composites may not be internally consistent within the performance data being reported by a CVE. In other words, individual measures may behave differently in a CVE's local area. In this case, nationally endorsed composites may mask differences in performance within a composite (as discussed in Option 2 earlier in this section).

Return to Contents


F. How will each composite measure be constructed from a given set of individual measures?

Once a CVE has determined which individual performance measures will be combined into a composite measure, the next methodological question is exactly how the composite measure will be constructed from the individual measures. There are many options for calculating composite performance from performance data on a given set of individual measures.

  1. Option 1: Combine performance data from individual measures using a weighted average approach. Once a CVE has identified the measures that will go into a composite, data from these measures can be combined in many different ways. "Weighted averaging approaches" refers to a large family of specific strategies that multiply scores on individual measures by a weight and then take the average of these weighted scores. In creating such a composite, a CVE will need to specify the following:
    • The weights that will be used (i.e., how much each individual measure will matter within a composite). There is no single "best" weighting strategy. When the measures included in a composite are identified based on whether they statistically correlate with each other, the same kinds of statistical techniques can also determine how much weight to give each measure. When measures are included in a composite based on nonstatistical judgment, a common strategy is to give more weight to the measures for which more observations are available. However, many other weighting strategies can be used: equal weighting, weighting based on local health priorities, etc.

      The important thing is to purposefully choose and understand the effects of the weights that are used. Composite measures always weight their constituent measures, implicitly or explicitly. If a CVE does not explicitly consider its weighting strategy, then unintended results may occur. For example, if measures are weighted by their numbers of observations and one measure has many more observations than the others, then this one measure will dominate the entire composite. If a CVE does not intend for one measure to dominate the composite, then the composite may mislead patients who believe the composite reflects all its constituent measures (i.e., the composite measure will have low validity).

    • The way measures will be standardized. Some measures of performance have higher "degrees of difficulty" than others. A measure's degree of difficulty is generally felt to be higher when its average performance score is lower. The reasoning is that if all providers have low performance on a measure, the measure must be difficult. To avoid penalizing providers for having more observations on measures with high degrees of difficulty, the individual measures can be "standardized" so that they have equal average scores. Other standardization techniques are also possible, such as using the exact same measure weighting scheme for every provider in a report.43 From a mathematical perspective, standardizing the measures within a composite is no different from performing case mix adjustment; the kinds of techniques that can be used are the same. However, unlike case mix adjustment, standardizing the measures within a composite is unlikely to causecontroversy. (For weighted average composites, standardization is necessary to avoid unintentional systematic performance misclassification.)
    • The way missing data will be handled in computing composite scores. When a provider has no data on an individual measure, this can affect the calculation of a composite that contains this measure. The section on Task Number 4 discusses ways a CVE can deal with missing data. If a strategy for handling missing data is not identified, a report can unintentionally report misleading composite scores. Consultation with a statistician is advisable.

    Examples of weighted average approaches are available in published papers,34,37,42,44-46 and several are summarized in the AHRQ decision guide by Romano, et al.14

    Advantages of weighted average approaches:

    • The composite score takes all its constituent measures into account.
    • Weighted averages are good conceptual fit for health conditions such as screening and chronic disease (where imperfect care is probably better than no care at all).

    Disadvantages of weighted average approaches:

    • Not a good conceptual fit for sets of measures in which one failure is clinically equivalent to multiple failures (e.g., a breach in operating room sterility).

    Example: Reporting weighted average composite measures

    For acute myocardial infarction, the Wisconsin Healthcare Value Exchange ( and reports hospital performance on a composite measure that combines seven individual process measures with one measure of patient survival. The composite quality measure is created in two steps. First, the seven process measures are averaged, weighting each measure by its denominator. Second, the "process composite" created in the first step is averaged with the patient survival measure, with the process composite having seven times the weight of the survival measure. Of note, the process and survival components of the overall composite are calculated differently. The process component is a performance rate between 0 and 1, but survival is expressed as a ratio of observed-to-expected survival events (a ratio that may exceed 1). Combining measures based on different units (and with different scales of measurement) complicates the interpretation of the weighting scheme.

    Similar composite measures are also reported for hospital quality of care for pneumonia and heart failure.

  2. Option 2: Combine performance data from individual measures using an "all-or-none" approach.47 "All-or-none" performance composites start by giving a score of one for each patient who receives satisfactory performance on every measured service included in a composite. But if a single service was not delivered, the all-or-none composite score is zero for that patient. The all-or-none performance score for a provider is the number of ones divided by the number of patients. For example, suppose a composite includes four measures. Only a patient for whom performance is satisfactory on all four measures will count in the numerator of an all-or-none composite. In other words, the all-or-none composite measures the percentage of patients who receive "perfect" care. This means that all-or-none composites treat "almost perfect" performance for a given patient the same as the lowest possible performance. Somewhat less stringent variations of all-or-none composites are also possible.37


    • Easy to explain.
    • In situations where one failure produces the same result as multiple failures (such as a breach in sterility in an operating room), may be clinically meaningful.48
    • May encourage providers to design system-level strategies for delivering all necessary care.
    • When performance on individual measures is already high, will result in lower performance scores, potentially motivating further improvement efforts by providers.47


    • Interpretation of performance scores may be unclear for all-or-none composites in most clinical situations (e.g., composites for diabetes or screening). If a provider has an all-or-none score of 40 percent, it is impossible to tell whether the remaining 60 percent of the provider's patients are receiving almost perfect care or very poor care. Therefore, important differences in provider performance may be masked by all-or-none composites.
    • All-or-none composites may unintentionally encourage providers to "give up" on a patient for whom there is a failure on just one measure within the composite.

    Examples: Reporting all-or-none composite measures

    Leading organizations of the Minnesota Healthcare Value Exchange ( and report "all-or-none" composite measures for both ambulatory and hospital quality of care. The "all-or-none" approach was selected because it was felt to be easy to explain to patients and clinicians and to represent a more comprehensive view of a condition or episode. In addition, this approach was chosen because it enabled more providers with smaller volumes of patients to be included in public reporting. Also, there was a larger amount of between-provider performance variation on these composites (relative to individual performance measures). The measures included in each composite were chosen on the basis of clinical, nonstatistical judgment (go to Option 2 earlier in this section). These composite measures enjoy stakeholder buy-in.

    The Wisconsin Healthcare Value Exchange ( also reports an "all-or-none" composite measure for "diabetes optimal testing" in the ambulatory setting.

Return to Contents


G. What final validity checks might improve the accuracy and acceptance of performance reports?

By checking the validity of performance reports before they are made public, CVEs may improve their acceptance by key stakeholders. Here are some final checks a CVE may consider performing.

  1. Assess and report the risk of misclassification due to chance. Misclassification is unavoidable in performance measurement and reporting. As discussed in the Introduction, the risk of misclassification due to chance can be assessed for each provider included in a performance report. For some types of measures, each provider in a report can, in theory, have a different probability of performance misclassification.


    • The degree to which a performance report could misrepresent provider performance and mislead patients will be known.


    • Will require consultation with a statistician.


    • The overall amount of misclassification due to chance that is actually found in the report may be higher than the allowable amount negotiated by CVE stakeholders (if a level was negotiated earlier, as suggested in the section on Task Number 1). In this case, a CVE may consider renegotiating the maximum acceptable level of misclassification or choose one of the options discussed in the section on Task Number 5.

  2. Gather feedback from the providers in the report and make corrections. Before releasing a performance report to the public, a CVE may give providers a confidential preview of how their performance will be reported. These providers can also be given a mechanism for responding to the CVE with their questions and concerns, and the CVE may use provider feedback to make corrections to the performance report.


    • May uncover previously unknown problems with data quality. A CVE may be able to address these problems prior to publication of the final performance report.
    • May enhance provider buy-in.
    • May create an incentive for providers to create more accurate data for performance measurement (e.g., an incentive to submit more accurate billing codes to health plans).
    • Depending on the level of data detail available to a CVE, can include information that might be useful to providers seeking to improve their performance. For example, if a CVE has access to patient-level performance data, the CVE may be able to give each provider a list of patients who have not received a measured service (e.g., a list of patients overdue for cervical cancer screening).

    Limitations and caveats:

    • A CVE may not have access to all the data a provider might want to see (e.g., a list of patients who were included in a performance measure).
    • Some providers may not respond to requests for feedback.
    • If data problems are uncovered, addressing these problems can consume time and resources.

    Examples: Gathering feedback from providers

    • The New York Quality Alliance (NYQA; has not yet produced a public report of provider performance, but the NYQA has produced confidential performance reports in preparation for public reporting. In order to be compliant with New York's Patient Charter, the NYQA has instituted a correction loop that allows physicians, through a secure Web portal, to correct the patient-by-patient claims data used to calculate Healthcare Effectiveness Data and Information Set (HEDIS) quality measures. This correction process involves uploading clinical notes to support the requested corrections. The Web portal also allows physicians to confirm whether they provide primary care.
    • Massachusetts Health Quality Partners (MHQP; gives each physician group a preview of the group's scores on HEDIS measures. If the physician groups think these scores are inaccurate, they can make an appeal. However, because MHQP uses a distributed data model in which health plans calculate the measures, MHQP does not know which patients are included in the HEDIS measure scores. Therefore, if a physician group requests a HEDIS score correction, the group must communicate directly with the health plans. MHQP staff report that over time, these communications with health plans have encouraged physician groups to submit more accurate billing codes (since these are the basis for the HEDIS measures).
    • When including a provider in a report for the first time (or including a new performance measure), the Oregon Health Care Quality Corporation ( performs a confidential round of reporting to providers before starting public reporting. This step "lets off steam" and allows an accuracy check. For subsequent public reports, providers can still check their performance data on a patient-by-patient basis using a secure Web site. Finally, each provider is given a one-time chance to opt out of public reporting, if needed, to sort out why the reporting is not working in that setting. This opt-out also is intended to give low-performing providers a chance to improve.
    • In preparation for publicly reporting the performance of individual physicians on measures of ambulatory care quality, the California Physician Performance Initiative (CPPI) has produced confidential performance reports. CPPI has requested that physicians review their performance scores, affirm their results, identify patient exclusions, and supply missing information. The CVE has received feedback that checking these data on a patient-by-patient basis can be quite onerous, especially for cancer screening measures with large denominators (i.e., large numbers of patient events for review). The CVE is therefore trying to find a sampling strategy for this data accuracy check (i.e., a strategy in which physicians will only need to check a sample of their patients, and reported performance scores will be based on this checked sample).

  3. Assess the sensitivity of performance reports to earlier decisions. In producing reports of provider performance, CVEs must make choices at the decision points discussed in this paper. In general, these decisions do not have "right answers," and CVEs may justifiably select from among many options. Sometimes, however, the tradeoffs involved at a decision point may not be entirely clear. The impact of each decision on the final published performance report may depend on the particular combination of decisions a CVE makes. CVE stakeholders may therefore be justified in asking, "What would have happened to the final report if we had made decision X differently?" The answer to this question is especially important in areas where a methodological decision was the result of contentious negotiation.

    One way to see how methodological decisions have affected performance reports is to conduct "sensitivity analysis." This process includes going back to a certain decision point (or combination of decision points), choosing another option, and recreating the performance report. Sensitivity analyses can also be incorporated at each step in the report-generating process. Early identification of methodological decisions that have dramatic effects on performance scores (or categories) can be addressed by stakeholders before a full report is created.

    CVEs may want to start with the following sensitivity analyses. These analyses include some of the decision points where (1) the "best" choice of methods is least certain, and (2) the impact of methodological choices on performance reports is likely to be greatest.

    • Assess sensitivity to choice of attribution strategy (discussed in the section on Task Number 5). Many different attribution strategies can be considered, and research suggests that the choice of attribution strategy may affect which providers and measures can be included in a report.26,49 Choice of attribution strategy may also affect the reported performance of providers.
    • Assess sensitivity to choice of strategy for creating composite measures, if composites are used. There are many different strategies for creating composite measures (as discussed above), and research suggests that the choice of strategy can have a substantial impact on the reported performance of providers.37 Both absolute performance (i.e., the composite performance score itself) and relative performance (i.e., how providers compare with each other in a report) can be affected.
    • Assess sensitivity to choice of strategy for limiting the risk of misclassification due to chance. Many different strategies for limiting misclassification risk can be used, alone and in combination (go to the section on Task Number 5). Each of these strategies has strengths and weaknesses, and choice of strategy can affect the reported performance of providers.
    • Assess sensitivity to choice of strategy for handling outliers. As mentioned in the section on Task Number 5, outliers are an especially important concern when reporting measures of the cost of care (but there may also be outliers on other performance measures). Because of multiple options for handling outliers, sensitivity analyses that try different approaches can provide valuable guidance (and possibly reassurance) to stakeholders when reporting performance on cost measures.
    • Assess sensitivity to case mix adjustment. As discussed in the section on Task Number 5, whether to adjust (or stratify) performance data for patient characteristics can be a controversial decision. When such controversy is present, producing performance reports with and without case mix adjustment can help CVE stakeholders get a sense of whether case mix adjustment meaningfully changes the report.
    • Assess sensitivity to type of performance data. As discussed in the section on Task Number 3, many different types of performance data can be used to generate the scores included in performance reports. Research suggests that the type of data might have a substantial impact on performance reports.21 Obtaining some types of data (e.g.,hybrid or medical record data) may require significant time and resources, but a CVE could consider performing sensitivity analysis on just a subset of the providers included in a report.

    Advantages of sensitivity analysis:

    • Sensitivity analysis provides a sense of how sensitive a report is to methodological decisions where the best answer is unclear. If the performance report is essentially the same regardless of the methodological decisions (i.e., the same providers are categorized as higher and lower performers), then acceptance of the report may improve.
    • This analysis may improve buy-in from CVE stakeholders who dissented on a key methodological decision, because they get to see what would have happened with their way of doing things.

    Limitations of sensitivity analysis:

    • In a distributed data model (discussed in the section on Task Number 3), some sensitivity analyses will require the cooperation of each data source. For example, a CVE might obtain HEDIS measure numerators and denominators from a health plan. Performing sensitivity analyses on the attribution strategy will require the health plan to recalculate these numerators and denominators for each new attribution rule.
    • If "prescored" performance data are used (discussed in the section on Task Number 3), it may not be possible to conduct many important sensitivity analyses.

vi See reports by Hibbard and Sofaer for guidance on whether patients can generally understand such ranges of uncertainty.1-2
vii Technical note: There is increasing interest in shrinking performance scores not to the overall mean, but to a stratified mean based on a relevant stratifying variable. For example, Dimick and colleagues have shrunken mortality rates for selected procedures to the corresponding volume-stratified mean, given that the best a priori estimate of a hospital's performance (in the absence of actual data) is based on its procedure-specific volume.36
viii See reports by Hibbard and Sofaer for guidance on which kinds of reporting formats might be preferable for purposes such as helping patients choose providers.1-2
ix Note that many patient experience (or patient satisfaction) survey results are reported as composite measures.
x A special statistic called "Cronbach's alpha" is a common way of checking internal consistency. This technique will be familiar to statisticians.

Page last reviewed September 2011
Internet Citation: Decisions Encountered During Key Task Number 6: Creating Performance Reports: Methodological Considerations in Generating Provider Performance Score. September 2011. Agency for Healthcare Research and Quality, Rockville, MD.