Appendix 2: Performance Misclassification Due to Chance
Methodological Considerations in Generating Provider Performance Score
This section contains background information on the concept of performance misclassification due to chance. We recommend this section to Chartered Value Exchange (CVE) stakeholders who are interested in learning why this concept is important and understanding how the following more commonly discussed topics relate to each other:
- Measurement error.
- Sample sizes.
For practical guidance on options for limiting the risk of performance misclassification, go to the section on Task Number 5. For a more detailed discussion of reliability and performance misclassification due to chance, we refer interested readers to two technical reports: The Reliability of Provider Profiling: A Tutorial by Adams,51 and Estimating Reliability and Misclassification in Physician Profiling by Adams, Mehrotra, and McGlynn.52
A. What is misclassification due to chance?
Any time performance is measured, there will always be some amount of random measurement error. The unavoidable presence of measurement error means that for every provider in a report, there is a certain probability that due to chance alone, performance is reported in the wrong class or category. In other words, any performance report that contains more than one category (i.e., a report that enables any kind of comparison between providers) will have some degree of misclassification due to chance.
However, patients, providers, and other CVE stakeholders may want to limit the amount of misclassification due to chance. Reports with too much misclassification due to chance may mislead large numbers of patients, and providers may also be concerned about the impact of misclassification, as shown in Figure 3.
Unfortunately, it is impossible to know exactly which providers are misclassified due to chance alone. On the other hand, it generally is possible to know, for each provider, the risk (i.e., the probability) that performance is misclassified.
B. Why focus on the risk of misclassification due to chance?
Some CVE stakeholders may be familiar with the statistical concept of "reliability," which is related to misclassification. Reliability, which is more formally defined later in this appendix, can be conceptualized as the "signal-to-noise ratio" in measuring performance. With larger amounts of measurement "noise" (i.e., greater quantities of random measurement error), it becomes hard to discern the "signal" in performance data (i.e., which providers are truly higher performing and which are truly lower performing).
The reason that this paper focuses on the "risk of misclassification due to chance" rather than solely focusing on reliability is that on its own, reliability does not have a direct, easily understood interpretation in performance reporting. For performance reports, the significance of reliability depends on the system for classifying performance.
The relationship between reliability and performance misclassification due to chance was originally highlighted in the 2006 work of Safran and colleagues.31 As Dr. Safran recalls:
The idea of evaluating risk of misclassification came out of a wonderful question that I was asked by a clinician, who was troubled by our reliability criterion of 70 percent. The clinician asked: "Does that mean there is a 30 percent chance that you have my score wrong? Because if that's what it means, then maybe statisticians think 70 percent is a good standard, but clinicians will find it unacceptable." This question inspired our work to elucidate the "risk of misclassification" construct and to have a methodology that allowed us to operationalize it.
In a performance report that is constructed with misclassification in mind, reliability of 70 percent might translate into a risk of misclassification that is quite low (less than 2.5 percent in the classification system for reporting patient experience results that was presented in Safran's 2006 paper).31 Recent work by Adams and colleagues also provides an example of the relationship between reliability and performance misclassification in reports of physicians' performance on cost measures.29,51-52
C. What determines the risk of misclassification due to chance?
The risk, or probability, of misclassification due to chance is directly determined by the statistical "reliability" of the provider's measured performance and the classification system that is used in the performance report. The reliability of a measure is affected by the number of observations, the average level of "error" per observation, and the amount of provider-to-provider variation in performance.51 The relationship between these factors and the risk of misclassification is illustrated in Figure 5.
To help explain Figure 5, we define the terms below.
Classification system refers to the way provider performance is presented in reports. The kind of classification system used is a value judgment, and there is no single best classification system for all purposes and users. Examples of classification systems include categorizing providers as "below average," "average," and "above average"; giving providers star ratings based on designated performance thresholds; and ranking providers according to their relative performance.
Deciding on the method of classifying provider performance will be influenced by how the results will be used. The classification system is the result of decisions about (1) whether to use performance thresholds, (2) how many thresholds to use, (3) where to set thresholds, and (4) what kind of performance scores to report ("shrunken" or "observed" performance; discussed in section on Task Number 5). Because the classification system used in a performance report is one of the key determinants of misclassification risk, it is impossible to calculate the misclassification risk for providers included in a report without first deciding upon a classification system.
Reliability is a property of the performance measure, the individual provider, and the provider population being measured. Therefore, if a CVE truly wants to know the magnitude of the risk of misclassification in its reports, the CVE will need to compute reliabilities for the measures it applies within its own provider community.
Reliability is a statistical concept that describes how well one can confidently distinguish the performance of one provider from another.xiii Put another way, reliability is determined by the relative amounts of "signal" and "noise" in performance data. In Figure 5, within-provider measurement error (i.e., random measurement error) is the "noise" and between-provider variation in performance is the "signal" a CVE may want to detect. Reliability is very important to determining misclassification risk: For any given classification system, the higher the reliability, the lower the misclassification risk.
For some types of performance measures, each individual provider in a report may have a different level of reliability. In general, when providers can have different numbers of observations (e.g., measures of diabetes quality) or different amounts of error per observation (e.g., cost profiles), reliability can only be calculated on a provider-by-provider basis. An example of how reliability can vary by provider is presented in a recent paper by Adams and colleagues that investigates the reliability of physician cost profiles.29 The technical appendix accompanying Adams' paper contains a more detailed statistical explanation of reliability and how it can vary from provider to provider, even on the same performance measure.53
On the other hand, for measures such as patient experience ratings, reliability may be the same for all providers in a report. This can occur because the amount of error per observation is generally a property of the survey instrument (rather than the provider), and the number of observations (i.e., the number of survey responses) can be equalized across providers.
Variation in performance between providers. Performance variation matters for many reasons, including the usefulness of reports to patients choosing a provider. If all providers have the exact same performance, or zero variation, patients cannot use the report to choose among them.xiv For the sake of misclassification risk, performance variation matters because it affects reliability. All other things being equal, the higher the performance variation between providers, the higher the reliability (or ability to discriminate performance between providers) and the lower the risk of misclassifying providers due to chance.
- Illustration: Imagine that you are trying to report the performance of providers on a measure that goes from 0 (bad performance) to 100 (good performance). Figure 6 gives an example of how performance variation might look for two populations, each containing five providers.
Within-provider measurement error is a statistical term that describes the amount of uncertainty in the performance that is measured for a single provider, taking account of all the available observations for that provider. Although the word "error" is used, it does not mean a mistake is being made in performance measurement and reporting. Instead, measurement error is a natural phenomenon that occurs in all measurement processes, from taking a patient's weight and blood pressure to evaluating a provider's performance. There is a hypothetical (and unobservable) "true" value for all the things we might try to measure. Measurement allows us to determine the range in which this "true" value probably exists.
Provider performance is no different. The lower the within-provider measurement error, the more precise the estimate of "true" performance becomes. Statistical confidence intervals are one example of a technique to calculate the range in which "true" performance probably exists (for a 95% confidence interval, there is a 95% chance that "true" performance is within the interval).
Within-provider measurement error matters to misclassification risk because, all other things being equal, the higher the measurement error, the lower the reliability and the higher the risk of misclassification due to chance.
- Illustration: Figure 7 shows an example of two providers, showing both the observed average performance on a single measure and the amount of uncertainty around about "true" average performance. Even though the observed average performance levels are identical in both examples (so the variation in performance between providers is the same), the within-provider measurement error is different.
Average error per observation is a statistical term that describes how much variation there is in the observation-to-observation performance of a single provider. Due to chance alone, a given provider's performance on a measure may vary from patient to patient, from day to day, from week to week, etc. The more this performance varies, the harder it is to distinguish one provider from other providers. Average error per observation matters to misclassification risk because, all other things being equal, the higher the average error, the higher the within-provider measurement error, the lower the reliability, and the higher the risk of misclassification due to chance.
- Illustration: Figure 8 shows an example of observations for two providers on a single performance measure (e.g., a measure of costs). Each provider has six observations. The average score is the same for both providers, but one has higher average error per observation than the other.
N (number of observations) refers to the number of observations a given provider has on a performance measure. For example, if the performance measure assesses hemoglobin A1c control in patients with diabetes, the number of observations for a provider will be the number of that provider's patients who have diabetes and who qualify for inclusion in the measure. The number of observations matters to performance misclassification because, all other things being equal, the higher the number of observations, the lower the within-provider measurement error, the higher the reliability, and the lower the risk of performance misclassification.
Misclassification risk is affected by factors other than the number of observations (i.e., the average error per observation and the classification system). Therefore, it is impossible to specify a minimum number of observations that will limit the risk of misclassification across all providers, all classification systems, or all measures. In fact, because the average error per observation can vary from provider to provider, different providers may need different numbers of observations to reach the same risk of misclassification.
- Illustration: Figure 9 shows an example of how differing numbers of observations affect the amount of uncertainty around the average performance for two providers. The average error per observation is the same for both providers, and both have the same observed average performance. But because one provider has more observations than the other, the range of uncertainty about the "true" average performance is smaller.
xiiiReliability also describes how close the measured performance of a provider is to the true performance of that provider. Mathematically, these two definitions are identical.
xiv This uniformity of performance would not necessarily be a bad thing. If providers had uniformly high performance, then patients could choose providers based on factors such as out-of-pocket cost and convenience, resting assured that no matter what provider they chose, performance would be above an acceptable threshold.