Pay for Performance: A Decision Guide for Purchasers
Phase 4. Evaluation
P4P programs are a work in progress and, because there is little evidence as to the effects of specific approaches, will need to be monitored and improved on an ongoing basis. Although evaluation will naturally follow implementation, the two questions in this section need to be asked during the design phase to assure that the implementation of the program will support meaningful evaluation. They are:
Learning about the impacts of a P4P program can be particularly challenging because a multitude of additional forces simultaneously affect the quality of patient care and costs. Ideally, purchasers would implement P4P in one market or sub-market and track the same performance measures on a set of comparison providers. Some large purchasers and CMS may be in a position to implement P4P is this way, but most purchasers will not design their programs as controlled trials. Therefore, some care is needed to disentangle the effects of the program from other trends.
At a minimum, purchasers should collect baseline data on the targeted quality measures (this will be a critical part of implementation too, of course, because providers without a clear understanding of their performance can hardly be expected to respond optimally to P4P). Then, as performance data are collected for payment purposes, the main effect of the program can be evaluated in terms of the change in performance, preferably compared either to some comparable but unaffected population or the trend in performance prior to implementation.
Purchasers will have to decide how rigorous an evaluation needs to be to ascertain whether a program is working and how to improve it. To adhere strictly to scientific standards of evidence may be too costly and produce evidence too late to be useful for decisionmaking. On the other hand, erroneous conclusions that may be drawn from anecdotal or incomplete information may have substantial costs as well.
In addition to the hoped-for effects of the program, purchasers will need to monitor, and try to minimize, unintended negative consequences. Three important negative effects to look for are patient selection, diversion of attention away from other important aspects of care, and widening gaps in performance among providers.
- Patient selection. Providers may avoid sicker patients in the belief that risk adjustment is not adequate and that caring for such patients will reduce their measured performance. Surveys done after New York instituted public reporting for coronary bypass found that two-thirds of cardiac surgeons admitted to avoiding the most severely ill patients.53 To minimize the potential for the P4P program to result in selection of the "easiest" patients or exclusion of high-risk or non-adherent patients, purchasers can focus on structural or process measures of quality. Risk adjustment of performance measures, particularly those that relate to patient outcomes such as complication or readmission rates, should help to minimize selection incentives as long as providers believe the risk adjustment is adequate. In addition, including explicit reporting of casemix data—which would show providers who are avoiding or accepting the more difficult cases—or providing differential rewards for meeting performance goals with more difficult patients could increase providers’ willingness to take on these cases. Another possibility would be to collect and report information about patients who change from one provider to another. A provider who was avoiding sicker patients would be identified by the high casemix scores of patients leaving his practice.
- Diverting attention from other aspects of care. Targeting specific performance measures may focus provider attention on the conditions or care processes for which there is measurement and payment, to the detriment of performance in other areas.15 At a minimum, this problem suggests the need for careful measure selection and attention to interrelationships among targeted and untargeted domains of performance. Rewarding providers for performance on some broader measures of outcome, such as patient experience or decubitus ulcer (bed sore) rates and pain scores in hospitals, would mitigate this problem as well.
- Widening performance gaps. This may be particularly likely to occur if the purchaser chooses to reward only providers that meet a high standard of performance or those that are the highest ranked among peers. If P4P results in a substantial redistribution of resources then some providers may actually worsen with respect to quality of care. This will be a particular concern if those providers serve large numbers of beneficiaries/enrollees or are part of the safety net, and/or if there are not enough suitable choices for the population that receives care from these poor-performing providers. If these adverse consequences are anticipated or noted, purchasers can consider the solutions described in Question 16.
These examples give important clues about what evidence to seek in evaluating programs for unintended consequences. Clinician feedback should be sought about unexpected problems with the measures used, including difficulties with both access to care and pressure to offer inappropriate care. Since such data would come from clinician surveys (and unhappy clinicians would be expected to be motivated to respond), getting this feedback should not be too burdensome. Similarly, purchasers should consider tracking a set of performance indicators that are outside of the P4P program to better understand both negative and positive spillover effects from the program onto untargeted clinical domains. Finally, evaluation of the program should not just look at average performance but at the effects of P4P on different parts of the delivery system including providers with high and low baseline performance.
A Final Note—Sustaining Quality Improvement
Even the best-designed P4P program will require maintenance. For example, if the program uses fixed targets, the targets will need to be advanced as providers improve. We note, however, that if providers see that targets are fully adjusted to reflect gains in prior year performance, incentives to improve quality in the current period may be dampened. For most measures, there are also natural "ceiling" effects that will lead to diminished opportunities to improve quality over time. As adherence rates to evidence-based guidelines approach 100 percent, the incremental cost of improving quality is likely to increase as only the cases that failed to respond to initial quality improvement efforts remain.
As clinical evidence about best practices changes, structural (e.g., information technology requirements) and process measures will also need to be updated. Purchasers will have to balance the need to keep P4P programs effective by retiring measures that are no longer useful against the concern that P4P programs provide some stability so that providers can undertake larger investments with the expectation that the reward structure will not be dramatically altered in the short run (and hence a reasonable return on investment can be expected). To this end, explicitly including providers in the decisions about measure selection and retention may be desirable. One approach that has been adopted by some programs, including the California IHA, is to commit to medium-term plans (2 or 3 years) with regard to measure sets and introduce measures in a "testing set" prior to their full inclusion.
To the extent possible, purchasers should use their P4P programs to promote continuous innovation rather than institutionalize a single approach to delivering high-quality care. This concern might be addressed by rewarding, at least in part, outcome measures. Vigorous attempts to keep structure and process measure targets up-to-date with the latest technology will also reduce system rigidity, but political and bureaucratic barriers to change will be inherently limiting.