Decisions Encountered During Key Task Number 5: Computing Provider-Level Performance Scores
Methodological Considerations in Generating Provider Performance Scores for Use in Public Reporting
Once aggregated performance data have been audited and initial performance measures have been chosen, provider-level performance scores can be computed. Before a Chartered Value Exchange (CVE) computes these scores, however, performance data must be attributed to providers. Attribution is relatively simple when each patient receives the care that is being measured from only one provider. However, attribution becomes more complex when a performance measure encompasses care that patients receive from more than one physician or hospital (e.g., measuring care delivered over a time period, as in an episode of care).
The parties responsible for attribution and computation may vary depending on what kind of data a CVE receives. For example, if a CVE receives raw performance data, then the CVE will likely complete these tasks internally or hire a vendor to complete them. If a CVE uses a distributed data model in which performance scores are generated by the sources of performance data (go to the section on Task Number 3), then these sources will perform these tasks with the CVE's guidance. If other organizations (e.g., health plans) will complete the attribution and computation, a CVE's role may be to ensure methodological consistency. In other words, each provider of performance scores in a distributed data model should be attributing performance data and computing performance scores in the exact same way.
A. How will performance data be attributed to providers?
To generate reports of provider performance, a CVE needs to ensure that the data used to calculate performance scores are attributed to providers. In other words, each piece of performance data that goes into a report must be associated with a provider. There are many different ways to attribute performance data to providers. The best way to attribute performance data depends on the purpose of performance reporting and the type of measure being reported. For example:
- If the purpose of performance reporting is to foster a sense of teamwork and shared responsibility among a group of providers, then performance data might be attributed to the group. Attribution to individual physicians (or other types of providers) might be reserved for confidential reports to each group of providers.
- If performance reporting is intended to raise community awareness or foster cooperation among the providers serving a community, it may be sufficient to attribute performance data at the community level (without attributing data to any particular provider). This strategy may be especially attractive when improving performance on a measure is likely to require the coordinated activities of many different providers. Public health measures such as infant mortality are one type of performance measure commonly attributed at the community level.
- If performance reporting is intended to help patients choose a given type of provider, then attribution of performance data to each provider of that type would be optimal. The goal is to make the types of providers in the report match the types of providers that patients are seeking. If patients are looking for individual physicians, then a report that attributes performance data to individual physicians might be the most useful to these patients.
In attributing data to a given provider, the goal is often to identify those patients for whom the provider is responsible for providing care and for whom the provider can affect the health services, patient experiences, costs of care, or clinical outcomes being measured. The same rules can be applied when attributing episodes of care (rather than patients) to providers. The goal is to identify episodes of care for which a provider is responsible and for which the provider can have an impact on the measure being applied to the episode.
On the surface, this sounds straightforward, and in some situations it is. For example, a hospital performance measure might indicate the rate of an immediate complication of surgery (e.g.,intraoperative mortality). A straightforward rule would be to attribute performance data to the hospital in which the surgery took place. Another example of a straightforward attribution rule occurs when patients fill out surveys about their hospitalizations; the survey responses are generally attributed to the entire hospital, and this is the approach taken by Hospital Compare.23
However, attribution of performance data is often not straightforward. Attribution problems are especially pertinent when a CVE wants to attribute performance data to individual practitioners. Patients often receive care from many different providers, and the processes, outcomes, costs, and experiences of care may be influenced by all these providers. For example, it might seem logical to assign measures of care for hypertension to the primary care provider (PCP) for each patient. But how can a CVE tell who a patient's PCP is (especially when the patient is enrolled in a fee-for-service or preferred provider organization [PPO] product)? This can be especially difficult if only administrative data are available.
In an analysis of Medicare claims data that were supplemented by a physician survey, Pham and colleagues found that in a single year, patients saw a median of seven unique physicians.24 Using some attribution rules discussed below, Pham and colleagues found that only 79 percent of patients could be assigned to a PCP. Moreover, PCP assignment changed for nearly half of patients over a 2-year period. Attribution to specialists was possible for fewer patients.
In another study of community health centers, Landon and colleagues found that many patients infrequently received care (less than once per year), and simulated performance on quality measures would depend on how these patients were attributed to each community health center.25 An additional recent study by Mehrotra and colleagues found that 12 different attribution rules would lead to substantially different reports of the performance of individual physicians on episode-based measures of the costs of care.26
Held against the standards of responsibility and impact mentioned above, these are sobering results. With patients seeing so many different providers, deciding which provider is responsible for which performance measures can be a challenging task. This may be especially true of measures of health outcomes. Even if a PCP (who presumably would accept responsibility for providing certain services) can be reliably identified, can this PCP justifiably be held responsible for health outcomes that may have developed over decades? Assigning responsibility for certain health outcomes may not be justifiable unless patients can be consistently assigned to a provider over time.
There is no single, generally accepted "best way" to attribute performance data to providers. CVEs may choose from a variety of attribution strategies, and these strategies may vary by measure and by provider type. We suggest that CVEs include all stakeholders in negotiations over attribution rules.
To help guide these negotiations, CVEs can refer to the guiding questions discussed above:
- What is the purpose of reporting?
- For a given patient or episode of care and the type of performance being measured, which providers are plausibly responsible for providing the associated care?
- For a given patient or episode of care and the type of performance being measured, which providers can plausibly have an impact on measured performance?
- Later, once different attribution rules have been tried: How much of a difference in reported performance do different attribution rules really make?
The options presented below illustrate some attribution strategies a CVE might consider, but many others are possible. Because there are so many possibilities, a CVE may want to revisit this decision at a later point and determine what effects a different choice of attribution strategy would have had on performance reports (go to the section on Task Number 6).
Option 1: Attribute performance data based on other sources of information. In other words, a CVE can use information that is not derived directly from the performance data to determine attribution. For example, a health maintenance organization (HMO) may require all its members to choose a PCP soon after they enroll, regardless of whether these members generate any performance data. This list of chosen PCPs can be used to attribute measures of performance to the PCP identified for each patient. Surveys of patient experience may similarly ask patients to identify the provider on which they are reporting their experiences.
- High face validity, especially if corroborated by a service-based definition (see below).
- Does not require patients to receive health care.
- Easy to explain.
- Patient self-identification data may not be available for many patients (e.g., enrollees in fee-for-service or PPO health plans). If these data are unavailable, they can be difficult and expensive to obtain.
- This strategy may only work for certain types of providers (e.g., PCPs) and certain types of performance measures.
Option 2: Attribute performance data based on simple plurality of services (or visits). This approach requires administrative data such as health plan claims or other kinds of data about services that have been delivered (e.g., records of the number of visits to a provider).24 Patients can be assigned to the provider who has seen them the greatest number of times during the measurement period. In the case of ties, patients can be assigned to the most recently seen provider (or assigned to both).
- Necessary data likely to be available for all patients who have received health care during the measurement period.
- Relatively inexpensive.
- Easy to explain.
- Patients who have received no health care cannot be assigned. This is an especially concerning problem in the case of quality measures that are based on underused health services (e.g., colorectal cancer screening).
- When patients see many providers, the "plurality" provider may actually only provide a small fraction of the total care received by the patient. Depending on the performance measure in question, responsibility and ability to affect care under these circumstances may be less clear.
- Level of face validity may vary, depending on the performance measure. Plurality attribution may make more sense for primary care performance measures than for measures intended to assess specialty care.
Option 3: Attribute performance data based on "enhanced" plurality of services. This option includes a family of strategies based on the plurality strategy discussed in Option 2. In addition to the most visits with a patient, "enhanced" plurality strategies include other requirements. For example, such a strategy may require that a patient have at least 50 percent of his or her visits with a provider before making an assignment.24 Or a strategy may require that the duration of the relationship between a patient and provider be a certain length (measured as the time elapsed between the earliest and most recent services).
For individual practitioners, such strategies can be devised on a specialty-by-specialty basis when practitioner specialty is known. For some specialties and some performance measures, even a single visit may be enough to allow credible attribution. However, practitioner specialty data may not be available in administrative sources. More complicated strategies, such as those used to determine patient-physician "connectedness," also may be used.27
- Improves face-validity, relative to simple plurality.
- Relatively inexpensive.
- As more requirements are added before assignment of patients to providers, the number of patients who cannot be assigned will grow.
- Patients who have received no health care cannot be assigned.
- This approach may get methodologically complex and hard to explain.
- This approach may require data that are not commonly available in administrative sources. Such data may be difficult and costly to obtain.
Option 4: Attribute performance data to multiple providers. It may not be necessary to choose just one provider when assigning a patient (or assigning an episode of care). In fact, sometimes assigning a single patient's data to multiple providers makes sense (e.g., when calculating a measure of coordination of care among providers). An example of a multiple-provider attribution strategy is to assign a patient to every provider who accounts for at least 25 percent of services delivered to the patient during the measurement period.24 Under this strategy, performance data from a single patient could be attributed to between one and four different providers. Similarly, it may make sense to attribute episodes of care to multiple providers when the actions of each provider affect the measure being applied to the episode (e.g., joint attribution of long-term hip replacement outcomes to the surgeon and to providers of rehabilitation care).
- May encourage cooperation between providers.
- Relatively inexpensive.
- Providers may be attributed performance data for which they do not accept responsibility.
- Lack of single-provider attribution may dilute the incentive to improve. Some providers may behave as "free riders," benefiting from the performance improvement efforts of others.
Examples: Attributing performance data to providers
The Puget Sound Health Alliance (http://www.wacommunitycheckup.org) uses different attribution rules for different types of measures. For measures of screening and first contact care, data are attributed to a single PCP for each patient based on a modified plurality algorithm that applies the following ordered rules: greatest number of "evaluation and management" (E&M) visits, highest sum of RVUs ("relative value units" associated with the E&M visits), and most recent service data. Each rule is applied only when the previous rule results in a tie between two or more providers who self-identify as PCPs. However, measures of chronic disease care can be attributed to multiple providers, including both PCPs and non-primary care specialists. For a given measure, all providers in certain specialties with any E&M visits in the past 24 months are attributed patients eligible for the measured service. For example, asthma measures are attributed to PCPs, allergists, and pulmonologists. More detail on this attribution strategy is available in the technical specifications at http://www.wacommunitycheckup.org/editable/files/July_2009/TechSpecs_CommunityCheckupJul09_final.pdf.
The New York Quality Alliance (http://www.nyqa.org) is using adjudicated health plan claims data to calculate Healthcare Effectiveness Data and Information Set (HEDIS) measures of primary care quality. For patients in an HMO or point of service (POS) product, the CVE attributes patients to PCPs based on the identification supplied by each health plan (since HMO and POS enrollees are required to choose a PCP). However, for patients not enrolled in these products, attribution is based on an "enhanced plurality" strategy: Patients must have a plurality of visits with the PCP during the time period specified in the HEDIS measure, including at least one preventive visit or two E&M visits.
For each provider, Aligning Forces for Quality-South Central Pennsylvania (http://www.aligning4healthpa.org) allows any patient seen at least once in the past year for any purpose (including urgent care) to be sampled for performance score calculation. Therefore, this CVE allows patients to be attributed to multiple providers.
B. What are the options for handling outlier observations?
Outlier observations are performance measure values that are far outside the usual range (e.g., a patient or episode of care for which costs are 20 times the average, or a hospital stay that costs $1). These outlier observations are a critical concern when measuring the costs of care, because within-provider (patient-to-patient) variation in costs can greatly exceed between-provider variation in average costs of care. In other words, when outliers are present, they can greatly increase the average error per observation (discussed in Appendix 2), which can reduce measurement reliability and raise misclassification risk. In addition, outlier observations often reflect data values that are erroneous or that are being incorrectly interpreted.
Options for handling outliers include:
- Exclude the outlier data from calculations of performance scores. When there are few outliers, this may be a reasonable option.
- Change the values of outlier data so that they are within the range that is usually seen. Values can be truncated (or "Winsorized") so that the outlier values are replaced by a more commonly-seen value (e.g., the 5th percentile value for low outliers and the 95th percentile value for high outliers).28-29 To address outliers, it may be advisable to consult a statistician with experience in performance reporting.
C. Will case mix adjustment be performed? (If so, how?)
Case mix adjustmentiv (also known as "risk adjustment") refers to statistical techniques that "adjust" performance scores to compensate for the characteristics of providers' patients and other factors felt to be beyond providers' control. To see why case mix adjustment might be desirable, suppose Provider A cares for patients who tend to be older than average. If Provider A is rated on patient mortality, Provider A's performance will probably be worse than average. This may be entirely due to Provider A's older patient population. After all, mortality rates increase dramatically as patients grow older. If these differences in patient age are not taken into account, performance reports might systematically mislead patients who are trying to assess the performance of Provider A. This kind of systematic, predictable misleading information is due to a problem called "statistical bias."
There is an important difference between statistical bias and misclassification due to chance.v If there is a large amount of statistical bias, a provider will have measured performance that is either consistently worse than "true performance" (e.g., Provider A) or consistently better than true performance (e.g., a provider with younger, healthier patients). On the other hand, if the risk of misclassification due to chance is high, a provider's measured performance may deviate from true performance in a manner that is unpredictable and inconsistent in direction.
This distinction between statistical bias and misclassification risk has important real-world methodological implications. Getting more observations may reduce the risk of misclassification due to chance, but more observations will not address statistical bias. Only case mix adjustment methods (or a related method called "stratification") can address statistical bias. On the other hand, case mix adjustment is unlikely to improve the risk of misclassification due to chance.
CVE stakeholders may benefit from understanding a very important point about case mix adjustment: Choosing which characteristics (if any) to include in case mix adjustment is a value judgment. There is a strong case for adjusting for a patient characteristic when two criteria are met: (1) The patient characteristic is beyond the control of a provider, and (2) CVE stakeholders agree that the existing overall relationship between the patient characteristic and measured performance is acceptable.
The first criterion seems straightforward for patient characteristics such as age and socioeconomic status (i.e., patient income and education level); a provider will have little or no influence over these characteristics. But what about patient adherence with recommended medical treatments and counseling? Some stakeholders might argue that providers should not be held accountable for patient adherence since it is entirely up to patients to adhere to their recommended health care. Such a point of view seems to be a value judgment and a potentially controversial one. Other stakeholders might argue that providers in fact exert a significant influence over patient adherence, since providers can give patients enhanced self-management support, education, appointment reminders, and other services that improve adherence.
If performance reports are adjusted for patient adherence, this adjustment will prevent the report from incentivizing providers to improve patient adherence. After all, if a provider focuses on improving performance by improving adherence and succeeds, the provider's adherence-adjusted performance will not budge (because the newly adherent patients are effectively held to a higher standard, assuming adherence is reassessed). On the other hand, not adjusting for patient adherence may incentivize providers to avoid treating nonadherent patients, especially when providers doubt their ability to improve adherence (or believe that improving adherence will require unrealistic levels of effort and expense).
The second criterion for adjustment—whether the relationship between the patient characteristic and measured performance is acceptable—is fundamentally a value judgment as well. CVE stakeholders may generally accept that older patients will have higher mortality rates than younger patients and believe this will always be the case. By adjusting mortality rates for patient age, a CVE would implicitly accept the mortality "disparity" between older and younger patients. This does not mean that there would be no incentive to reduce mortality for older patients; it just means that all other things being equal, publicly reporting age-adjusted mortality rates will not incentivize providers to reduce the age-related mortality "disparity."
However, when CVE stakeholders find performance disparities unacceptable (e.g., racial or socioeconomic disparities), there are important drawbacks to adjusting for patient characteristics. This is true even though the first criterion for case mix adjustment may be satisfied (as it is for patient race and socioeconomic status, or SES). For example, adjusting performance for patient SES implicitly accepts the continuation of performance disparities between providers that serve greater and lesser shares of low-SES patients.
Providers would still be incentivized to improve performance for all patients. But there would not be a systematically different degree of improvement incentive for providers serving greater shares of low-SES patients (i.e., a greater incentive that could result in greater improvement of care for low-SES patients and a reduction in the overall performance disparity in a CVE's area). Whether such a disparity-reducing incentive would in fact be created by the performance report would depend on exactly how performance is reported (go to the section on Task Number 6) and on how providers respond to the report. On the other hand, not adjusting for patient characteristics such as SES could demoralize providers serving low-SES populations and create an unintended incentive to avoid serving vulnerable patient groups.
One option for balancing the advantages and disadvantages of case mix adjustment is to pursue an alternative approach: stratification of performance results (discussed below). Table 2 presents a framework for thinking through the "value judgments" inherent in deciding which patient characteristics to include in case mix adjustment.
A complementary discussion of case mix adjustment is available in a separate AHRQ decision guide, Selecting Quality and Resource Use Measures: A Decision Guide for Community Quality Collaboratives.14 Here, we repeat an important limitation about case mix adjustment: Case mix adjustment works by accounting for observable differences in the characteristics of patients. However, not all differences between patients are observable in the data available to a CVE. Therefore, case mix adjustment cannot guarantee that a provider's low (or high) performance is not due to some unobserved patient characteristic.
In general, case mix adjustment is considered appropriate for measures of health outcomes, patient experience, and costs of care. However, case mix adjustment is generally not performed for measures of processes of care (such as checking cholesterol levels in patients with diabetes). This is because process measures generally have restrictive criteria for patient inclusion, and these criteria enforce a kind of uniformity among these patients (at least regarding the need for the measured service). In other words, all patients who qualify for a process measure should be receiving the measured service according to guidelines.
CVEs should consider an important caveat to the "do not adjust process measures" rule. When aggregating performance data across different data sources, a CVE may want to consider performing case mix adjustment based on the source of each observation (e.g., the health plan reporting each observation). This adjustment accounts for the different ways data sources may collect and report performance data.
Without accounting for data source, a CVE may find that some providers have performance that is higher or lower than others' simply because their data came predominantly from a different source. Adjusting for data source when aggregating multisource performance data is analogous to "standardizing" individual performance measures in the creation of performance composites (go to the section on Task Number 6).
Option 1: Perform case mix adjustment using predetermined methods. As mentioned in the section on Task Number 2), case mix adjustment instructions may already be available for performance measures with nationally endorsed specifications.
- A CVE can perform case mix adjustment without having to derive a custom case mix adjustment methodology.
- Use of predetermined methods improves the likelihood that performance will be reported in an accurate category.
- If a measure's specifications have been altered by the CVE, then the predetermined case mix adjustment methods will probably not be valid.
Option 2: Perform case mix adjustment using locally derived "custom" methods. With statistical consultation, a CVE may be able to use the performance data it has aggregated to generate risk-adjusted performance scores.
- Allows flexibility in measure specification and in choosing the reference value to which case mix adjusted performance scores can be compared (e.g., the local average rather than a national average).
- Deriving case mix methodologies can be a complex undertaking, and significant time, resources, and expertise may be required.
- If insufficient numbers of observations are present, deriving valid case mix adjustment methods may not be possible.
Option 3: Report "stratified" performance. Stratification involves calculating multiple performance scores for each provider on a given measure. For example, a report might separately display providers' mortality rates for younger and older patients. Or a report might separately display performance for Medicare, Medicaid, and commercial health plan enrollees. Stratification can be an alternative to case mix adjustment because within each "stratum" (or subset of patients), patients are similar to each other, which reduces the amount of statistical bias. However, like case mix adjustment, stratification can only account for observable patient characteristics.
- Methodologically simpler than case mix adjustment; easier to explain and understand (one can see that "apples are compared to apples, and oranges compared to oranges").
- Enables "fair" comparisons between providers without hiding performance disparities that a CVE would like to reduce.
- Sample sizes can become small within each stratum, increasing the risk of misclassification due to chance. This is especially true when trying to account for more than one or two patient characteristics.
- This approach increases the number of scores included in a report, which may make the report more difficult to understand.
Examples: Using stratification instead of case mix adjustment
For clinic and medical group performance on ambulatory quality measures, the Puget Sound Health Alliance (http://www.wacommunitycheckup.org) displays provider performance stratified by insurance (commercial health plan vs. Medicaid enrollees). Natasha Rosenblatt, Data Projects Manager of the Alliance, explained that the stratified reports were added because some clinics that predominantly served Medicaid enrollees "were doing terrific work with a difficult population, but this performance wasn't showing up in the overall results." By comparing the overall performance report with the stratified reports, one can see how stratification reduces the number of observations within each stratum. For example, there are more clinics with no reported performance information in the stratified reports than in the overall reports.
The Oregon Health Care Quality Corporation (http://www.q-corp.org) is also planning to stratify performance for commercial health plan and Medicaid enrollees. Nancy Clarke, formerly Executive Director of Q-Corp, explains: "Trying to make disparities disappear by adjustment won't help anybody with anything. Stratification shines a light on disparities."
D. What strategies will be used to limit the risk of misclassification due to chance?
Even though some amount of misclassification risk will be present whenever performance reports have more than one provider (or more than one performance category), there are ways to minimize the risk of performance misclassification to a level acceptable to CVE stakeholders. CVEs and community stakeholders will need to determine a rate of performance misclassification that is reasonably acceptable to all parties. Ideally, this rate will balance the consequences of misclassification with the purposes and expected benefits of performance measurement and reporting.
There is no mathematically or scientifically "best" rate of misclassification due to chance. A 2006 survey found that patients vary widely in their tolerance for misclassification in physician performance reports. Roughly a third of patients thought misclassification risk needed to be less than 5 percent, another third thought misclassification risks between 6 and 20 percent were acceptable, and the remaining third would tolerate levels of misclassification risk between 21 and 50 percent.12
There are also important tradeoffs associated with limiting the risk of misclassification due to chance. These tradeoffs include potentially:
- Reducing the number of providers for which performance can be reported.
- Reducing the number of measures that can be reported.
- Reducing the precision with which performance can be reported (i.e., reducing the number of performance categories).
In addition, there may be tradeoffs between types of misclassification due to chance. For example, a CVE may want to reduce the probability that providers are reported as low performing when they are actually high performing. One way to accomplish this goal is to report performance using a "zone of uncertainty" (a "buffer zone" that may give the benefit of the doubt to providers just below a performance threshold, which is discussed later in this section). However, this way of reducing the risk of reporting performance in too low a category will raise the risk of reporting a provider's performance in too high a category. How a CVE chooses to weigh the risks involved in this tradeoff is a value judgment.
In this section, we present some of the most commonly used options for limiting the risk of misclassification due to chance. These options work by influencing the factors that determine misclassification risk (i.e., the factors listed in Figure 5 in Appendix 2). Many combinations of these and other options may be used by CVEs. But first, we offer some general guidance on approaching misclassification risk.
General guidance to proceeding through the options.
- If possible, negotiate a maximum risk of misclassification due to chance that is acceptable to stakeholders. This may require making some general decisions about the classification system to be used in performance reports. For example, how many performance categories will there be? How will performance thresholds be determined?
- In addition, discuss the magnitude of misclassification. For example, if there are four reported performance categories (e.g., a report of stars on a 4-star scale), it might be more acceptable to be off by just one star than to be off by two stars.
- Once performance data have been collected and the reporting format has been decided, calculate the risk of misclassification for providers in the report. This step will require consultation with a statistician who has expertise in performance measurement and reporting.
- If the calculated risk of misclassification is higher than CVE stakeholders can accept, then the following options for limiting misclassification risk may be considered.
Option 1: Exclude providers for which the risk of misclassification due to chance is too high from reporting. Once the risk of misclassification has been calculated for each provider, it is possible to exclude providers with high risks of misclassification.
- Limits the risk of misclassification due to chance.
- May result in the exclusion of many providers, limiting the usefulness of performance reports. Go to the section titled "Missing data" for options on including providers without reportable performance data in performance reports. In particular, it may be important to understand how patients, providers, and other users of the report will interpret the absence of performance information for certain providers. Will such providers be presumed to have good performance? Poor performance? How will these interpretations affect the goals that the CVE wants to achieve?
- May result in the exclusion of entire categories of providers (e.g., providers of certain specialized health care services who only manage a small number of patients with a measured clinical condition) from reporting.
Option 2: Exclude measures for which the risk of misclassification due to chance is too high for too many providers. Just as misclassification risk varies from provider to provider, misclassification risk may vary from measure to measure. Those measures for which the risk of misclassification is too high for too many providers can be excluded from public reporting.
- Limits the risk of misclassification due to chance.
- Performance measures that are important to stakeholders may not be available for reporting. This may limit the usefulness of performance reports.
- Option 3: Modify the classification system used in the performance report. We present five general types of options for modifying the classification system. All of these options may require the assistance of a statistical consultant.
Option 3a: Report performance using fewer categories. For example, move from reporting provider rankings (in which the number of categories equals the number of providers) to broader provider categories such as quartiles.
Note that at the extreme option of reporting just one performance category, there is zero risk of misclassification due to chance. Provider rankings, which represent the opposite extreme, maximize the risk of performance misclassification due to chance.
- Limits the risk of misclassification due to chance.
- Moving to a "coarser" scale of reporting may cause small but possibly important differences in performance to be missed. In other words, when performance categories get big, each category may actually contain many distinct levels of performance.
Option 3b: Change the thresholds used for deciding categories. Without changing the number of reported categories, change the performance thresholds used to decide the performance category in which a given provider will be placed. For example, the definition of a 4-star provider may change from "performance above 75%" to "performance above 90%" on a given measure. Because threshold changes may either decrease or increase the risk of misclassification, recalculating misclassification risk after making these changes is recommended. The type of threshold can also be altered. Thresholds can be based on absolute observed performance (relative to some predetermined standard) or on relative performance (e.g., a percentile- or ranking-based approach).
Changing performance thresholds can have complex effects on the risk of misclassification due to chance. It is possible to simultaneously lower one kind of misclassification risk while raising another. For example, moving from the 75 percent to 90 percent performance threshold may decrease the risk that a true 3-star provider is misclassified as a 4-star provider and increase the risk that a true 4-star provider is misclassified as a 3-star provider.
In addition, thresholds can be based on tests of statistical significance.30 Tests of statistical significance compare observed performance to some reference value. This reference value is often, but does not have to be, the average performance of the entire provider population.
Performance thresholds based on statistical significance have a special property. They automatically limit the risk of one kind of misclassification due to chance: Type I statistical error, or the probability that a provider whose true performance is equal to the reference value is misclassified as having performance that is different from the reference value. Statistical significance-based thresholds commonly limit this kind of misclassification risk to 5 percent, but there is nothing special about the 5 percent figure. Other levels of misclassification risk may be acceptable to CVE stakeholders. A recent survey of patients found that only a minority think that a risk of misclassification below 5 percent is necessary for reporting provider performance.12
One notable drawback of significance-based thresholds is that while they limit one type of misclassification risk (Type I statistical error), they may increase Type II statistical error. Type II statistical error is the probability that a provider whose true performance is different from the reference value is misclassified as having performance that is indistinguishable from the reference value. So all else being equal, efforts to reduce Type I statistical error may misclassify more providers as having average performance (i.e., the rate of Type II statistical error will be higher).
- May limit misclassification risk of one type (such as Type I statistical error).
- May increase other types of misclassification risk (such as Type II statistical error). Rechecking misclassification risk is advisable.
Examples: Basing performance thresholds on tests of statistical significance
The Puget Sound Health Alliance (http://www.wacommunitycheckup.org) reports the performance of clinics, medical groups, and hospitals in three categories: above regional average, at regional average, and below regional average. Providers are classified as "above regional average" or "below regional average" only if tests of statistical significance show that there is a less than 5 percent chance that their true performance is at the regional average. In other words, the probability of misclassifying an average provider as above or below average is limited to no more than 5 percent).
The Healthy Memphis Common Table (http://www.healthymemphis.org) calculates a 95 percent confidence interval around providers' scores on performance measures. For each provider, the upper limit of the confidence interval determines which category of performance is reported (i.e., how many stars are reported).
Option 3c: Introduce a "zone of uncertainty" around performance cutpoints.31 As a provider's performance gets closer to a classification threshold, the risk of misclassification due to chance becomes greater. Using a "zone of uncertainty" (also known as a "buffer zone") typically means giving providers the benefit of the doubt when they are just below a performance threshold by reporting them in the performance category that is above this threshold. This option decreases one kind of misclassification risk: the risk of reporting providers in too low a class. Adjusting the width of the "zone of uncertainty" can limit the risk of this type of misclassification to any value that is desired. However, this approach simultaneously increases the risk of reporting providers in too high a class.
- Reduces one kind of misclassification risk: the risk of reporting providers in too low a performance class.
- May address provider concerns about being misclassified into a category that is lower than their true performance.
- Increases another kind of misclassification risk: the risk of reporting providers in too high a performance class.
Examples: Using a "zone of uncertainty"
The California Chartered Value Exchange uses a "buffer zone" in determining the performance categories of medical groups (http://opa.ca.gov/report_card/doctors.aspx) in its Doctors and Medical Groups Quality Report Card. This CVE reports aggregated composite scores on technical quality and patient experience using 4-star scales, so there are four categories of performance on each composite measure. Any group whose overall performance is less than 0.5 percent below the next highest performance category is reported in the higher category. This 0.5 percent zone is the "buffer zone."
Option 3d: Report "shrunken" performance rather than observed performance. "Shrunken" performance refers to performance estimates that are produced by special statistical techniques. These techniques are used to adjust the individual provider's observed "raw" scores by borrowing information from the entire population of providers to reduce the likelihood of misclassifying a provider. Names for these techniques include "smoothed estimates," "random intercepts," "hierarchical model estimates," and "empirical Bayes estimates."32
These shrunken estimates work by taking within-provider error into account. When within-provider error is high (e.g., because of low N), shrunken estimates "shrink" performance estimates back toward the mean of the entire provider distribution. When within-provider error is lower (e.g., large N), the shrunken estimates still pull performance back toward the mean, but the amount of this pulling is lower. In other words, higher reliability estimates borrow less from the mean performance of all providers, while lower reliability estimates borrow more (i.e., the mean performance of all providers receives greater weighting in the construction of the shrunken performance estimate). To generate these "shrunken" performance estimates, consultation by a statistician will be needed.
- Limits the risk of misclassification.
- Providers' own, independent performance is no longer the only thing that determines their performance category. Instead, the performance of the entire provider population plays a role, which may be counterintuitive. Stakeholders who prefer the exclusive use of observed (rather than shrunken) performance may object to this approach.
- "Shrunken" performance can be hard to explain to stakeholders.
Option 3e: Use a "mixed" performance classification system that accounts for both reliability and observed performance. This "mixed" option refers to a family of classification systems that allow the category of reported performance to differ from the category of observed (raw) performance, depending on the reliability of measurement. First, within-provider measurement error is calculated for each provider and used to create a "margin of error" (just like the "range of uncertainty" shown in Figure). Then, the margin of error is combined with observed performance to see whether, for each provider, this margin of error overlaps a performance category threshold (or potentially more than one threshold when the margin is large). When these overlaps occur, performance can be reported in a different category than observed performance.
Generally, reported performance will be somewhere close to the middle of the margin of error, even though observed performance may be closer to one end of the margin than the other. For example, a provider with very low observed performance on a measure might still be reported as having average performance. This reporting would occur when the margin of error for this provider is very wide, overlapping the average level of performance. This situation is especially likely when reliability is low (e.g., because of low N). In a sense, the "shrunken" performance option is just one member of this family of "mixed" classification systems. It is advisable to consult a statistician when constructing a "mixed" classification system.
- Limits the risk of misclassification due to chance.
- Simplifies performance reports, potentially making them more patient friendly, because measurement reliability and observed performance are combined into a single reported performance category for each provider.
- Mixed performance classification systems can be complex to design and hard to explain to stakeholders.
Examples: Using a mixed performance classification system
California Hospital Compare uses a "mixed" performance classification system in reporting the performance categories of hospitals. A table describing how this system works is available at the following link: http://www.calhospitalcompare.org/resources-and-tools/choosing-a-hospital/about-the-ratings.aspx.
The system has five performance categories based on three performance cutoffs. The category of performance reported for each hospital depends on the upper and lower bounds of each hospital's margin of error (rather than just depending on average performance for each hospital).
Option 4: Set a minimum reliability for reporting on each measure. Setting a minimum reliability is an approach to limiting misclassification risk currently used by some CVEs. Frequently, a minimum reliability of 0.7 is used; but again, the decision on where to set this minimum reliability depends on a value judgment about how much risk of misclassification CVE stakeholders can tolerate. Providers whose reliability is below the minimum on a given measure are excluded from reports on that measure.
Because the classification system used in performance reports also determines misclassification risk, on its own, setting a minimum reliability may not guarantee any particular limit on the risk of misclassification due to chance. But once a classification system is decided, setting a minimum reliability will limit the risk of misclassification. The amount of risk will depend on the classification system that is decided; if the classification system is changed, the minimum reliability level may not guarantee the same limits on misclassification risk.
- Coupled with a classification system, limits the risk of misclassification due to chance.
- If a CVE does not know the classification system that will be used, the range of possible misclassification risks is unknown. Although a minimum reliability of 0.7 is frequently used, this may not limit the risk of misclassification to a level that is acceptable to CVE stakeholders.
- Reliability is not an intuitively interpretable number, which may make stakeholder consensus difficult to achieve. Misclassification risk is more intuitively meaningful (it is analogous to the risk of convicting an innocent person or acquitting a guilty one; go to Appendix 2).
- Many providers and measures may be excluded.
Examples: Using a minimum reliability criterion
Massachusetts Health Quality Partners (MHQP; http://www.mhqp.org) and the California CVE (http://opa.ca.gov/report_card/doctors.aspx) both use a minimum reliability criterion for reporting performance on patient experience surveys. For survey results to be reported, both CVEs require the reliability to be more than 0.7. Because of this criterion, some practices (in Massachusetts) and medical groups (in California) have no reported results on some survey domains.
The Pacific Business Group on Health is combining a minimum reliability criterion with a "zone of uncertainty" reporting approach. Ted von Glahn, Director of Performance Information and Consumer Engagement, says that the reporting effort is aiming to achieve a less than 5 percent rate of provider misclassification due to chance.
Implications of using a minimum reliability criterion
Massachusetts Health Quality Partners (MHQP; http://www.mhqp.org) only reports provider performance on a measure when at least 50 percent of all providers meet MHQP's minimum reliability criterion (discussed in the section Task Number 3). This 50 percent requirement means that performance reports contain performance scores on most providers. However, this requirement also means that certain measures generated by MHQP are not publicly reported (even though they may be confidentially reported to the providers).
Similarly, the California Physician Performance Initiative found that of the 17 measures initially tested for public reporting, some measures did not meet the minimum reliability criterion for virtually any physician in the State. Therefore, only 10 measures will be included in the performance report being developed by one of the health plan stakeholders (for use by its members).
- Option 5: Set a minimum N (number of observations). Setting a minimum number of observations is also a popular approach to addressing misclassification risk. This approach generally applies the same lower limit on N to all providers and all measures of performance. However, N is not the only thing that determines reliability, and reliability is not the only thing that determines misclassification risk. Thus, on its own, setting a minimum N does not guarantee that the risk to misclassification due to chance will be acceptable. In other words, the relationship between N and misclassification risk depends on (1) the properties of the measure, (2) the population of providers, and (3) the performance classification system being used. All other things being equal, a greater N will reduce misclassification risk. But simply specifying a minimum N without calculating the risk of misclassification in the provider population being reported can result in a very high (and unappreciated) misclassification risk.
- Limits the risk of misclassification due to chance. However, just setting a minimum N does not, on its own, determine what this limit is. The amount of misclassification is knowable only when: (1) providers' average error per observation is known, (2) between-provider variation in performance is known, and (3) the classification system is decided.
- Intuitive, computationally simple.
- Without information on providers' average error per observation, between-provider variation in performance, or the classification system that will be used, the amount of misclassification due to chance is unknown.
- If a CVE sets the same minimum N for all measures, this may actually produce different levels of misclassification risk for each measure.
- A minimum N may provide false reassurance about the risk of misclassification.
- Many providers and measures may be excluded.
Cautionary Note on Using 25 or 30 Observations as a Minimum N:
Many CVEs and other performance reporting entities have gravitated toward the numbers 25 or 30 as the minimum numbers of observations needed to report a provider's performance score (see below). While these numbers have been widely adopted, they may not be high enough to ensure a level of reliability (≥0.7) that is considered to be adequate to prevent excessive misclassification due to chance. There is no "right" amount of misclassification risk (go to the section on Task Number 1), but CVE stakeholders may benefit from knowing the implications of choosing each minimum N.
A recent paper by Sequist and colleagues presents a helpful set of tables that demonstrate the relationship between reliability and minimum N for ambulatory quality measures in a large sample of Massachusetts primary care practice sites.33 At a minimum N of 30, only 4 of the 14 measures investigated by Sequist and colleagues achieved a reliability of ≥0.7. For some measures, more than 200 observations from each site would be needed to achieve this level of reliability.
Examples: Setting a minimum N
In creating performance reports, most report sponsors use a minimum N.
a. For hospital measures in Wisconsin, performance scores are generally still available even when there are fewer than 25 observations (after a mouse click). When these small-denominator scores are displayed, there is a disclaimer that the scores may have low reliability. b. In Memphis, measures of cardiovascular care were excluded from reporting because very few providers had more than 30 observations. c. Minnesota Healthcare Value Exchange stakeholders arrived at minimum denominators for ambulatory measures by "statistically eyeballing" the performance data; no formal assessment of misclassification risk was performed. d. The Puget Sound Health Alliance originally had a requirement of 250 denominator observations for each of its ambulatory quality measures. This minimum N of 250 was based on analyses of reliability. However, there was pushback from clinics whose performance was not being reported due to this minimum N (and also pushback from health plans and employers). After analyses found little difference between a denominator of 250 and 160 in terms of reliability, the minimum N was changed to 160.
Organization Minimum N CMS Hospital Compare 25 observations (process measures of the technical quality of care) Oregon Health Care Quality Corporation (http://www.q-corp.org) 25 observations (claims-based measures of ambulatory care quality) Wisconsin Healthcare Value Exchange (http://www.wchq.org and http://www.wicheckpoint.org)a 25 observations (claims- and chart review-based measures of hospital care quality); 50 observations for measures of ambulatory care quality New York Quality Alliance (http://www.nyqa.org) 30 observations (claims-based measures of ambulatory care quality) Healthy Memphis Common Tableb (http://www.healthymemphis.org) 30 observations (claims-based measures of ambulatory care quality) Aligning Forces for Quality-South Central Pennsylvania (http://www.aligning4healthpa.org) 30 observations (chart review-based measures of ambulatory diabetes care quality) Leading organizations of the Minnesota Healthcare Value Exchangec (http://www.mnhealthscores.org and http://www.mnhospitalquality.org) 30-60 observations (claims- and chart review-based measures of ambulatory quality); 25 observations (chart review-based measures of hospital quality) Greater Detroit Area Health Council (http://www.gdahc.org) 50-60 observations (claims-based measures of ambulatory care quality) Puget Sound Health Allianced (http://www.wacommunitycheckup.org) 160 observations (claims-based measures of ambulatory care quality); 25 observations (measures of hospital care quality)
- Option 6: Report composite performance measures.34 Composite performance measures mathematically combine provider performance data across multiple measures. While this approach increases N, the construction of composites has some important caveats. To make the best possible use of composites (especially when making a new composite that has not been previously developed by a national body), consultation with a statistician will be needed. This consultation is particularly important if the composite combines different types of measures with different types of statistical distributions. Refer to the section Task Number 6 for further discussion of how composites can be constructed.
- May limit the risk of misclassification due to chance, depending on the type of composite.
- May increase the risk of misclassification due to chance. This paradox can occur with certain combinations of measures. How can this happen? While N may increase when combining individual measures, it is possible for average error per observation to also increase because the nature of the observation changes when creating composites. The amount of between-provider variation on composite measures is also likely to differ from the amount of between-provider variation on individual measures.
- Figure 5 (Appendix 2) shows factors related to misclassification due to chance. Based on the figure, if creating a composite measure results in enough of an increase in within-provider error and decrease in between-provider variation, then this composite will have lower reliability than the individual measures. The composite will therefore carry a higher risk of misclassification due to chance. This paradoxical increase in misclassification risk is most likely to occur when combining measures that are negatively correlated with each other (i.e., when performance on some measures is high, performance on the others tends to be low).
- May limit the interpretability of reported performance. For example, it may be harder to know what is meant by a low score on a composite measure of diabetes performance. Does this mean performance is poor on all of the individual measures of diabetes care, or does it mean that performance is good on some but especially poor on others? Reporting composite performance makes it impossible to tell.
- May reduce the usability of performance data to guide provider performance improvement efforts.
- May unintentionally overemphasize certain individual measures and underemphasize others. Go to the section Task Number 6 for further discussion of how this may occur.
Option 7: In the case of physician ratings, report performance for larger provider groupings. For example, a CVE may "roll up" the performance scores of individual physicians into practice sites or larger physician groups, making these larger organizations the units of reporting. This option is especially important when a CVE is thinking about reporting performance on measures that could be attributed to individual practitioners. When the risk of misclassification on a measure is higher than acceptable for a large proportion of practitioners, reporting performance at higher organizational levels (e.g., practice sites, groups, hospitals) is another way to increase N for each provider reporting unit.
- May limit the risk of misclassification due to chance, relative to reporting the performance of lower levels of organization.
- May increase the risk of misclassification due to chance, relative to reporting the performance of lower levels of organization. As with reporting composites, this paradox can occur because the nature of the observation changes when reporting on aggregations of providers. Although N may increase, the average error per observation may also increase, and between-provider variation may decrease. This paradoxical result can occur, for example, when the performance scores of individual practitioners are negatively correlated (i.e., when some practitioners do well, the others tend to do poorly).
- May limit the usefulness of performance reports to patients and other stakeholders who want performance data on individual practitioners.
- May unintentionally mask good (or poor) performance by individual practitioners or other subunits of provider organizations.
- May dilute individual practitioners' accountability for performance.
Examples: Reporting performance at higher levels of provider organization
Massachusetts Health Quality Partners (MHQP; http:// www.mhqp.org) reports performance on HEDIS measures at the physician group level. The minimum group size is three physicians. Because of this criterion, solo and two-physician practices have no reported HEDIS results unless they are reported within a larger group. On the other hand, the maximum risk of performance misclassification is reduced.
The Greater Detroit Area Health Council (http://www.gdahc.org) reports performance on measures of ambulatory care quality (mostly HEDIS measures) at the physician organization level. There are 16 total physician organizations in the GDAHC's reports, some with thousands of physicians. Reporting at this level results in measure denominators far in excess of GDAHC's minimum N (50-60 observations).
- Option 8: Report performance over a longer time period ("rolling average" performance). When the risk of misclassification on a given measure is higher than acceptable for a large proportion of physicians, reporting performance data accumulated over a longer period is another way to increase N for each provider reporting unit. For example, instead of reporting performance data from just the most recent available year, a CVE may report performance data aggregated over the most recent 3 years. In doing so, a CVE may decide to weight the most recent year's performance more heavily in the "rolling average" or weight each year's performance equally in the calculation. Whether this approach reduces misclassification risk depends on whether "true" performance is stable over time. If "true" performance is changing (maybe because a provider is implementing an improvement strategy), then this approach may paradoxically increase the risk of performance misclassification (because the provider's older performance does not accurately reflect current performance).
- Reduces the risk of misclassification due to chance if "true" performance does not change over time.
- May increase the risk of misclassification if "true" performance changes over time. To determine whether true performance is changing and to address this issue, consultation with a statistician may be needed.
- May limit the usefulness of performance reports to patients and other stakeholders who want only the most recent performance data.
Option 9: Include more data sources for a measure. Including performance data by aggregating data from a greater number of sources—such as multiple commercial plans, Medicare, and Medicaid—is another way to increase N for each provider reporting unit (as discussed in the section Task Number 3). Aggregated multipayer data not only reduce the risk of performance misclassification due to chance, but also may reflect the care delivered to a broader patient population (compared with data from just one payer). However, the issues mentioned in the section Task Number 3 should be addressed when aggregating multipayer data. Care must be taken to avoid combining data in ways that are not valid (i.e., combining data without ensuring that the data have the same interpretation across all sources).
- May reduce the risk of misclassification due to chance.
- May produce a fuller picture of provider performance across a broader patient population.
- May increase the risk of misclassification due to chance. As with reporting composites, this paradox can occur because the nature of the observation changes when aggregating performance data generated by different patient populations. Although N may increase, the average error per observation may also increase, and between-provider variation may decrease.
- Data aggregation across sources creates the possibility of introducing statistical bias when data do not have the same interpretation across all sources (i.e., when data are not combined in a valid way). Go to the section Task Number 3 for guidance on data aggregation. In addition, case mix adjustment methodologies can be used to guard against increasing statistical bias, discussed in the section Task Number 5.
Warning about a pitfall: the "finite population correction" (also known as the "finite population sampling model"). The finite population correction refers to the practice of reporting a lower amount of uncertainty in a performance estimate by incorporating information about the overall size of the patient population from which a data sample is drawn. The practical argument in favor of this technique is: "If I know a provider took care of 10 patients in a year and I sample all 10 patients for a performance measure, then I know the provider's score in that year with complete certainty. So I don't have to worry about misclassification risk, even though the sample size is only 10 patients." A similar argument can be made to reduce the amount of reported uncertainty in performance estimates based on samples that are less than 100 percent of a provider's overall patient population (e.g.,80%, or 50% of a provider's patients).
For the purposes of public reporting, the finite sample correction should not be used. When patients use performance reports to choose a provider, past performance matters only because it gives some indication of what kind of care a patient will receive in the future (with some degree of uncertainty, of course). Past performance would matter on its own only if a patient had a time machine that allowed him or her to actually receive care that happened in the past (i.e., the care that generated the data used to calculate performance scores).
Because every provider has a theoretically infinite population of future patients, the finite sample correction is likely to mislead patients who use public reports of provider performance. Small sample sizes, no matter how completely they capture a provider's past patient population, are likely to produce performance estimates that have probabilities of misclassification due to chance. For example, a performance report based on 100 of a provider's patients out of a total population of 1,000 will have lower misclassification risk than a report based on all 10 of a provider's patients. A more technical explanation of what the finite sample correction is and why it should not be used in performance reporting is available in a paper by Elliott, Zaslavsky, and Cleary.35
- There are no real advantages. The finite sample correction appears to reduce the risk of misclassification due to chance. But this is a mirage: Misclassification risk regarding future performance is not reduced.
- High likelihood of covering up true misclassification risk. Reporting past performance that is not a good predictor of future performance is likely to mislead patients and may alienate providers.
iv Case mix adjustment techniques broadly fall into two categories: adjustment based on regression models, and adjustment based on reweighting the data that are used to calculate performance scores. The technical differences between these two approaches are beyond the scope of this guide. The discussion in this section pertains to both case mix adjustment approaches. v Go to the section Task Number 1 for more on misclassification due to chance.