Chapter 4: Evaluation of Health Care Efficiency Measures
Health Care Efficiency Measures: Identification, Categorization, and Evaluation
In this section we present criteria for evaluating health care efficiency measures, and discuss to what degree existing measures meet these criteria. Our original intention had been to rate each identified measure on the evaluation criteria, but this proved to be not feasible or meaningful since the available evidence is so sparse.
Therefore, we present our evaluation criteria, and then discuss in more general terms the strengths and limitation of available measures in terms of these criteria. We conclude with a discussion of potential next steps.
We suggest that measures of health care efficiency be evaluated using the same framework as measures of quality:
- Important-is the measure assessing an aspect of efficiency that is important to providers, payers, and policymakers? Has the measure been applied at the level of interest to those planning to use the measure? Is there an opportunity for improvement? Is the measure under the control of the provider or health system?
- Scientifically sound-can the measure be assessed reliably and reproducibly? Does the measure appear to capture the concept of interest? Is there evidence of construct or predictive validity?
- Feasible-are the data necessary to construct this measure available? Is the cost and burden of measurement reasonable?
- Actionable - are the results interpretable? Can the intended audience use the information to make decisions or take action?
The ideal set of measures would cover all of the major aspects of efficiency identified in the typology of efficiency measures presented above; would have evidence that they can be measured reliably by different analysts using the same methods, that higher scores are observed in providers that are judged by other means to be more efficient than providers receiving lower scores, and that higher scores are observed for providers after they have successfully implemented changes designed to improve efficiency; and could be calculated using existing data.
This ideal set does not exist, and therefore the selection of measures will involve tradeoffs between these desirable criteria (important, valid, feasible, actionable).
Although the "importance" of measures abstracted from peer-reviewed literature is difficult to assess, it seems that a majority of efficiency measures published in the peer-reviewed literature have not been adopted by providers, payers, and policymakers.
One aspect of efficiency that is important to stakeholders is the relative efficiency of various providers, health plans, or other units of the health system. Many of the articles reviewed did not explicitly report comparisons of the efficiency of the providers or other units of analysis studied.
Only 31 of 158 articles reported such a comparison. The other 127 articles reported efficiency at a grouped level, and often studied the effect of a factor or factor(s) on group efficiency.
For example, an article might compare the relative efficiency of non-profit versus for-profit hospitals. This type of analysis could potentially be used to answer another question of importance to stakeholders-how can efficiency be improved? Although many articles studied factors that were found to influence efficiency, it was unclear if any findings of factors associated with improved efficiency were strong enough to influence policy.
At the same time the utility of existing efficiency measures for policy has been questioned, most explicitly by Newhouse.20 The vendor-developed measures that are most commonly used differ substantially from measures reported in the peer-reviewed literature, suggesting that stakeholders found the measures developed in the academic world inadequate for answering the questions most important to them.
We note, however, that many of the vendor-developed measures are based on methods originally developed in the academic world (e.g., Adjusted Clinical Groups). The measures developed in the academic world are more complex to implement than vendor-developed measures.
These measures often present and test sophisticated statistical or mathematical approaches for constructing a multi-input, multi-output efficiency frontier, but focus relatively little on the specification of inputs and outputs, often using whatever variables are readily available in existing data sources.
In contrast, the vendor-developed measures often include a more complex specification of the outputs used, such as episodes of care. It is not clear that one approach is necessarily superior to the other. A critical question in evaluating importance of a measure is whether it satisfies the intended use.
The vendor-developed measures seem to reflect areas of importance to payers, purchasers, and providers based on how they have been used. The measures have been used by payers and purchasers to profile providers to include in their networks.
In addition, a number of these measures are currently under consideration for various pay-for-performance initiatives. These measures assess efficiency both at the organizational level (e.g., hospitals or medical groups) and at the individual physician level.
They offer both a global perspective on the drivers of total costs and resource utilization, as well as drilled down specifics for individual clinical areas and providers. In this respect, efficiency measures commonly used by health plans and purchasers respond to the perceived needs in the market.
One area of importance that is poorly reflected by existing measures is social efficiency. Despite a widespread acceptance that the allocation of resources in the current health care system is very inefficient, there appear to be no accepted measures of efficiency in this important area.
Very little research on the reliability and validity of efficiency measures has been published to date. This includes measures developed by vendors as well as those published in the peer-reviewed literature.
Of the 158 peer-reviewed articles found containing efficiency measures, only three reported any evidence of the validity of the measures and one reported evidence of reliability. It was slightly more common for articles to test the specifications of SFA or other regression models or DEA models using sensitivity analyses; 59 of 137 measures using DEA, SFA, or other regression-based approaches reported the results of sensitivity analyses.
Vendors typically supply tools (e.g., methods for aggregating claims to construct episodes of care or methods for aggregating the costs of care for a population) from which measures can be constructed; thus, the assessment of scientific soundness requires an evaluation of the application as well as the underlying tools.
Several studies have examined some of the measurement properties of vendor-developed measures, but the amount of evidence available is still limited at this time. Thomas, Grazier, and Ward58 tested the consistency of 6 groupers (some episode-based and some population-based) for measuring the efficiency of primary care physicians.
They found "moderate to high" agreement between physician efficiency rankings using the various measures (weighted kappa = .51 to .73). Thomas and Ward59 tested the sensitivity of measures of specialist physician efficiency to episode attribution methodology and cost outlier methodology.
Thomas60 also tested the effect of risk adjustment on an ETG-based efficiency measure. He found that episode risk scores were generally unrelated to costs and concluded that risk adjustment of ETG-based efficiency measures may be unnecessary.
MedPAC61 compared episode-based measures and population-based measures for area-level analyses and found that they can produce different results. For example, Miami was found to have lower average per-episode costs for coronary artery disease episodes than Minneapolis but higher average per-capita costs due to lower episode volume.
The lack of testing of the scientific soundness of efficiency measures reflects in part the pressure to develop tools that can be used quickly and with relative ease in implementation. One major measurement problem in efficiency measures is the difficulty in observing the full range of outputs a hospital, physician, or other unit produces.
As described in the results section, many measures capture the quantity of health care delivered, but very few are able to capture the quality or outcomes of this care. Most measures are not able to capture the full range of quantities of interest. As we would expect, most measures are based on quantities that are readily observable in existing datasets: hospital days, discharges, physician hours, etc.
In some cases the way these variables are described to "proxy" for the real quantities of interest is questionable. For example, in some studies the number of beds is used as a proxy measure for capital, while no further evidence is presented on the correlation between these two.
A second area that concerns validity is the specification of the econometric models underlying the measures. The literature shows a wide variation here, with some articles estimating just one single model, and others estimating a whole range of models using various combinations of inputs, outputs, and methods.
At a minimum, authors have made some very basic assumptions about the existence and nature of a random component to outputs. It has been shown that efficiency ratings can be very sensitive to the model chosen.62 When there are conflicting results under different models, it is often not obvious which model and results are preferable.
A third area of potential assessment is the reliability and validity of efficiency measures when implemented in different administrative data sets. This becomes particularly challenging when data sets are aggregated or when data from different entities (e.g., health plans, hospitals) are compared for evaluative purposes.
Data sets from multiple insurers may need to be aggregated for the purposes of developing larger samples of patients. Some of the key challenges include: the effect of benefit design differences, the impact of different methods of paying physicians, use of local codes, differential use of carve out/contracted providers, missing data, and so on.
Administrative/billing data are the most common source of information for constructing efficiency measures but users should be aware of the threats to validity when comparing different entities.
A fourth area is whether the measures take into account and adjust for both case mix (i.e., the nature and volume of the types of patients being treated) and risks (i.e., severity of illness of the patients), such as other co-morbidities.
A final area revolves around the implicit assumptions about the comparability of the outputs measured, particularly with regard to quality of care. While most users of efficiency measures are likely to use separate methods for evaluating quality, the methodological work to link these two constructs has not been done.
In the absence of explicit approaches to measuring quality, the efficiency measures assume that the quality of the output is equivalent. In most cases this assumption is likely not valid.
Since most of the efficiency measures abstracted in the literature review are based on existing public-use data sources, they could feasibly be reconstructed. Most articles appeared to specify the best possible measure given the limitations of existing public-use data, rather than collect or compile data sets to construct the best possible measure.
That is, the measures in the peer-reviewed literature generally seemed primarily shaped by feasibility, and secondarily by scientific soundness.
All of the efficiency measures identified through the grey literature also rely on existing data (e.g., insurance claims). Most of the efficiency measures identified through the grey literature have been developed by vendors with feasibility of use by their clients in mind.
However, most vendor-developed measures are proprietary, and therefore may impose cost barriers during implementation. In fact, one of the stakeholders interviewed specifically mentioned feasibility related to the cost of purchasing vendor-developed product as one of the primary reasons for their organization creating their own efficiency measure.
Existing public-use data sets available for research use may pose several difficulties for the specification of scientifically sound, important efficiency measures, however. For example, it may be difficult to assign responsibility for measures to specific providers based on claims, or it may be difficult to group claims into episodes or other units.
MedPAC has tested the feasibility of using episode-based efficiency measures in the Medicare program. They tested MEG and ETG based measures using 100% Medicare claims files for 6 geographic areas.
They found that most Medicare claims could be assigned to episodes, most episodes can be assigned to physicians, and outlier physicians can be identified, although each of these processes is sensitive to the criteria used.
The percentage of claims that can be assigned to episodes and the percentage of episodes that can be assigned to physicians were consistent between the 2 measures.
Stakeholders are using efficiency measures for a variety of applications including internal quality improvement, pay-for-performance, public reporting, and construction of product lines that include differential copayments (tiering) for different providers.
Each of these applications requires that the results of the measures be transmitted in a way that facilitates both understanding and appropriate action on the part of the target audience (actionability).
However, relatively little research has been done to understand the ability of different audiences to interpret and use the information. Two examples are provided here based on interviews with stakeholders.
- Flexible pricing-measures should be flexible to allow plans or groups to add their own pricing information if the measure was originally constructed using standardized prices. In many cases, standardized prices are used instead of the actual prices paid. This approach eliminates differences in prices paid by different providers, which providers often argue are not under their control. Insurers or provider groups may also favor standardized pricing so that they do not reveal the prices they have negotiated with suppliers. However, some users may wish to apply actual prices for certain applications and desire this flexibility.
- Clinical relevance-measures need to provide actionable information to guide improvements in clinical practice. Measures cannot be a "black box" of statistics that lack transparency.
Table 10 presents a matrix framework for evaluation of efficiency measures based on their applications and their importance, scientific soundness, and feasibility. The columns are ordered to reflect the hierarchy of decisionmaking about measures:
- Important-if it is not important, why go any further?
- Scientifically sound-if it is important but not sound then one cannot have confidence in the data.
- Feasible-if it is important and scientifically sound, is it feasible to implement this measure?
- Actionable-if it is important, scientifically sound, and feasible can the target audiences understand and act on the information provided?
Reflecting this hierarchy, these four domains are listed from left to right in the columns of the evaluation framework presented in Table 10. Some applications of measures have a stronger requirement for the availability of rigorous information in these four domains than others because of a greater possibility of unintended consequences.
The rows of Table 10 are ordered to reflect the increasing need for rigor across all four domains. When using a measure for provider network selection or tiered copayments in a health plan, it is more important to ensure that the measure is scientifically sound, actionable, etc., due to the potential effects on provider payment, patient choice, and other potential unintended consequences.
In contrast, using a measure for internal review and improvement or research has less potential for unintended consequences and thus has less stringent requirements for information on measure properties as measures are in the process of being evaluated. As measures are tested in these applications, further information on their properties will be available that can be used to assess their appropriateness in other applications.
For example, if a new measure is developed that assesses physician efficiency, it should first be used for research and possibly internal review and improvement while information on its scientific soundness is collected.
Before it is used for public reporting, pay-for-performance, or other applications, its importance and scientific soundness should be well-established, and feasibility and actionability become increasingly important.
None of the health care efficiency measures we identified met our criteria for use in public reporting, tiered network design, or pay-for-performance, since no identified measure has published evidence of sufficient scientific soundness to make it acceptable to all or even most stakeholders.
To supplement the published evidence, we explicitly requested during the peer review process that reviewers indicate which measures were acceptable for current use. The responses we received ranged from those indicating that all current measures are acceptable for internal use but none are acceptable for public use, to some vendor-developed measures are acceptable for use in tiered network design, to frank skepticism that any of the measures are useful.
We therefore conclude that for many of the uses proposed for efficiency measures, such as public reporting, tiered network design, and pay-for-performance, there is insufficient published evidence and stakeholder consensus for any existing measure.
We contrast this to the field of quality measures, where there exist at least a handful of measures that have broad acceptance internationally among stakeholders as being useful measures of quality, including their use for public reporting and pay-for-performance.
In terms of advancing the field of efficiency measures, measurement scientists would prefer that steps be taken to improve these metrics in the laboratory before implementing them in operational uses. Purchasers and health plans are already using vendor-developed products for a variety of applications and believe that these measures will improve with use.
Although this report will likely not change the current tension between these different stakeholders, we believe that a substantial contribution to the field could be made by investing adequate resources in testing vendor-developed measures, exploring whether academically developed measures could be made feasible and actionable for real world applications, and funding the development of new measures and measurement approaches in this area.
Such work might best be done with multistakeholder advisory groups that can help guide measurement teams to find an appropriate balance between scientific rigor and practical utility.
Select for Table 10: Application of efficiency measures.