# CareScience Risk Assessment Model - Hospital Performance Measurement (

## V. Statistical Significance

To assist users in interpreting outcome comparison reports for a targeted "analysis set" of cases, CareScience tool provides an estimate of the **statistical significance** of each outcome deviation (actual-expected). A "significance flag" indicates the probability that the results could have occurred randomly if there were not a true underlying effect. In the front-end reports, a double asterisk (**) indicates 90% significance while a single asterisk (*) indicates 75% significance. Large deviations tend to be significant, except when great uncertainty surrounds expectations for cases in the analysis set. The choice of relatively "low" significance levels (75% and 90% as opposed to the frequently used 95% and 99% levels) reflects the purpose of the reporting tool—it is highly sensitive but not very specific in tolerating false positives to avoid false negatives.

Statistical significance depends on the prediction error of the CareScience risk model, which derives from the properties of ordinary least squares regression analysis. Prediction error reflects how well the model fits the population on which it is calibrated. Hence, it is a **characteristic of the model calibration** and not purely a feature of the group of patients in the analysis set. As a practical matter, prediction error (hence, statistical significance) can be computed for any number of cases in the analysis set, even just one case.

The basis for computing statistical significance is to **aggregate** (as described below) the case-by-case prediction error. Just as the predicted outcome risk for each case is based on that patient's characteristics as processed by the CMS model, the model generates a prediction error for each case. The prediction error for the group of patients in the analysis set is derived by aggregating the cases in the analysis set.

This aggregation poses a challenge, because it involves combining the uncertainty around the predicted value (risk) and the imprecision of the observed outcome value, especially when the number of cases in the analysis set is small. For this reason, one must be cautious in interpreting "significant" deviations when the number of cases in the analysis set is small. In such cases, the conclusion that the deviation is "significant" is based on an **assumption that the observed rate is measured with little or no error.** Under this assumption all uncertainty in the deviation greater than a "real" **opportunity** is attributed to the prediction error. The assumption may no longer be true when the number of cases is too small. A conventional suggestion of minimum case count is around 15-20. If the deviation between the raw and risk (expected) values is large relative to the prediction error, then the deviation computed from the analysis group is not likely to be due to pure chance.

Remember that the prediction error is based on a model calibration on thousands, if not tens of thousands of patient observations. On the other hand, the number of cases used in certain analysis sets (such as the number of patients treated by a chosen physician for a particular condition) can be quite small. For this type of analysis, the *raw* mean outcome is based on a relatively small sample of cases, which makes the assumption of low measurement error less plausible. In such circumstances, the user must interpret the appearance of significance flags with caution, since the cases in the analysis may be idiosyncratic.

### 5.1 Claim-level Computation

Recall that standardized measures (risks) are calculated using the ordinary least squares (OLS) regression model:

The following subscripts provide clarity to the notation:

i= claim_id (each row in the patient table)

j= provider or grouping

k= ICD-9-CM principal diagnosis (3 digit)

l= outcome (mortality, length of stay, charges, cost, complications, complication morbidity)

Hence, *y^ _{ijkl}*

^{8}is the predicted value for each outcome,

*l*, at the patient level,

*i*, for each provider,

*j*, and diagnosis,

*k*.

*x*is a vector of patient characteristics and other severity measures that are outside the provider's control, including clinical factors, patient demographic characteristics, and patient selection factors.

_{ijkl}*ß^*is the marginal effect of the independent variables on the outcome measure.

_{kl}*x*and

_{ijkl}*ß^*are of dimension

_{kl}*q*equal to the number of linearly independent regressors (including the number of distinct responses for each categorical variable).

The goal of this effort is to calculate a confidence interval around each *y^ _{ijkl}* to determine whether the observed raw value is "close" to the predicted value and assess the effect of the risk factors on the patient outcome. The confidence interval, 100(1-α)%, around each individual patient level predicted value

*y^*is calculated as

_{ijkl}^{9}

where the standard error of the predictor is

and

SSR is the sum of squared residuals from the fitted regression line, *n _{kl}* is the number of observations for principal diagnosis

*k*for outcome

*l*, and

*q*is the number of estimated regression coefficients for diagnosis

_{kl}*k*and outcome

*l*.

### 5.2 Aggregation

The challenge arises when aggregating the standard errors to characterize any targeted grouping of patients, such as all patients admitted by physician *j*.

The expected value for a given outcome aggregated to the provider level, *y^^ _{jl}*, is the average of the patient level expected values for that provider and outcome. Mathematically it is expressed as, , where

*n*= the number of patients treated by provider

_{jl}*j*across all diagnoses, and

*l*is the relevant outcome.

^{10}This is compared with the average of the raw values for the same grouping, , where

*y*are the actual outcome values.

_{ijkl}The estimated variance of the provider level estimator is

assuming iid→cov-0

and the variance of the raw outcome measure is

In CareScience suite of products, we report a deviation score that represents the difference between the average observed and expected values for each outcome, , where *d _{jl}* is the provider level deviation score. To determine the confidence interval around this deviation score, we estimate the variance,

*V(*, around it. The confidence interval allows us to gauge whether the study group's deviation could possibly be zero, indicating no significant difference between the observed and expected outcomes. Given that , the variance of the deviation score is

_{djl})*V(d*). which can be rewritten as .

_{jl}However, we don't know the covariance of *y¯ _{jl}* and

*y^*. We cannot calculate this in Data Manager, since

_{jl}*y¯*and

_{jl}*y^*are aggregated to the provider level and are therefore provider dependent. We also cannot assume that the mean raw outcome rate is independent of the mean risk-adjusted outcome rate, which would yield a covariance of 0. Therefore, for the purposes of this analysis, we will treat

_{jl}*y¯*, the mean observed outcome rate for the provider, as nonstochastic or a non-random variable. Given this simplification,

_{jl}*y¯*is simply a point rather than an estimator with a distribution and therefore has no variance and no covariance with

_{jl}*y^*, the mean risk-adjusted outcome rate for the provider.

_{jl}Given this simplification, the variance of the deviation score reduces to *V(d _{jl}) = V(y^_{jl})*, and we can test the null hypothesis that the deviation is equal to zero:

H_{o}: *d _{jl}* = 0

versus the alternative hypothesis H

_{a}: d

_{jl}≠ 0.

Rejection of the null hypothesis indicates a significant difference between the observed and expected outcome measures. Since we generally work with small sample sizes, we perform a t-test on the null hypothesis. More specifically, we calculate a *t* statistic, , with degrees of freedom, where *n* refers to the total number of observations in the dataset and is the total number of estimated coefficients across all diagnosis groups.

This computed statistic is compared to the critical value for the *t* distribution with degrees of freedom and the desired alpha level.

As a practical matter, the degrees of freedom calculated in this manner will almost always be in the hundreds or thousands, which brings the test statistic to its limiting distribution, the normal. On the other hand, the number of cases used in a narrow aggregation, such as the number of patients treated by a chosen physician for a particular condition, can be quite small. For this type of analysis, the *raw* mean outcome is based on a relatively small sample of these N cases and has a t-distribution with N-1 degrees of freedom. This becomes the relevant number of degrees of freedom in conducting a test for statistical significance.

All deviation scores will have an indicator of whether the deviation is statistically significantly different from zero. For example, a **(90% significance level) indicates that there is less than a 10% probability that the deviation (standardized - raw) is due entirely to chance. Hence, we can reject the null hypothesis of zero deviation with a 10% chance of a (type I) error.

### 5.3 Environmental Description

- The application reports two distinct levels of significance, 75% and 90%.
- Significance levels are not user-controlled and are therefore standard for all clients.
- Significance flags are indicated by the graphic "*."
- A single asterisk (*) indicates that a deviation is significantly different from zero at the 75% confidence level while a double asterisk (**) signals that a deviation is significantly different from zero at the 90% confidence level. Descriptions of these significance notations appear on every deviation report. Deviations that round to 0.0 do not receive a flag.
- The Data Manager program has been modified to calculate the standard error for each outcome at the patient level predicted value, .
- The program's front-end scripts then perform the mathematics to aggregate the standard errors for the specified grouping (e.g. provider, MDC, DRG, or CTC
^{11}). The calculations used are: (a) calculate an average variance V(*y^*, and b) calculate each deviation score,_{jl})*d*._{jl} - In calculating deviation scores on reports, the program's front-end scripts use the nonrounded raw and risk values to calculate the deviation (raw-risk). The front-end report
*then*rounds the raw, risk, and deviation scores to the first decimal place. The reports have a footnote stating "Raw minus risk may differ from deviation due to rounding." - The deviation (computed from the non-rounded raw and risk values), raw standard error, and number of observations (n) generate a
*t*statistic that is compared to a critical value to determine significance. - The
*t*statistic is calculated for each deviation score, at any aggregate level, using the following equation: . - In determining n for the
*t*statistic calculation, only valid raw values are used. (i.e. for ln values ≠ 99^{12}). - The ccms_common.t_distributon table is populated with critical values for a two-tail
*t*test. - To determine significance, locate the critical
*t*value (field name: t_value) from the ccms_common.t_distribution table having the appropriate degrees of freedom (df = n-1) and significance level (sig_level = 0.75 or 0.90). If there is not an exact match for the number of degrees of freedom, choose the closest number that is smaller than the observed number. - If the calculated
*t*statistic exceeds the t_value for a sig_level of 0.90 (α = 0.10), the deviation score receives two asterisks. If the calculated*t*statistic exceeds the*t*_value for a sig_level of 0.75 (α = 0.25) but is less than that for a sig_level of 0.90, the deviation receives one asterisk. - The program's front end imposes a constraint preventing any score with a
*rounded*deviation of 0.0 from receiving a significance flag. (i.e. even if before rounding the deviation ≠ 0 and the score has received a significance flag, the flag will be removed.)

## VI. Select Practice

Select Practice is a collective name for a series of CareScience methodology, product and reports. It was first developed in 2002 and 2003 as an additional feature of Quality Manager. Under Select Practice toggle, a hospital can benchmark its performance, relative to a group of selected hospitals efficiently delivering high quality of care. The methodology of identifying the selected hospitals soon gets widely accepted, and evolves into multiple applications. All of them are under the collective name of Select Practice. This documentation will mainly focus on the methodology. The mathematical details are illustrated in the Appendix C (*Select Practice Formulas*).

### 6.1 Setting

Outcome comparisons have long been viewed as a powerful way to motivate improvement in inpatient quality of care. These comparisons, often called practice profiles, outcomes reports, report cards, or scorecards, have captured the attention of not only health care providers but also payers and consumers. Although no single universally accepted quality of care measure exists, certain key elements are common.

Mortality is the most widely accepted measure of quality of care, but mortality alone can not fully cover all dimensions of quality. The CareScience model of quality is measured by the incidence of three adverse outcomes: mortality, morbidity, and complications, which are combined into a single quality measure using the preference weightings from the Corporate Hospital Rating Project.^{13}

CareScience defines a highly rated hospital as one that delivers excellent health care in an efficient way. In the CareScience rating model, efficiency is captured by length-of-stay (LOS). Length-of-stay serves as a proxy for resource usage, reflecting how efficiently hospitals allocate resources.

For each disease grouping, hospitals are ranked for quality and efficiency separately, with the highest rankings going to hospitals with the lowest risk-adjusted LOS and adverse outcome rates. To qualify as "Select Practice" for a given disease, a facility must be in the top two quintiles (top 40%) for both efficiency and quality measures. *Because the rating system is two-dimensional and takes into account both quality and efficiency, the system makes no trade-off between these two considerations.* The five by five efficiency-quality matrix is illustrated in Figure 1. For the majority of diseases, quality and efficiency rankings are weakly correlated, and the Select Practice facilities ("High") constitute 16% (40% of 40%) of all facilities that qualify for ranking. Other ranking combinations include the following: 1) placing in the bottom two quintiles of both efficiency and quality (four poor performance "Low" cells), 2) placing in the middle three groupings (five average performance "Middle" cells), 3) placing in the six low quality-high efficiency cells ("Cheap"), and 4) placing in the six high quality-low efficiency cells ("Dear").

This hospital rating system is *disease specific* for about 60 conditions (depending on the data type) that cover virtually all cases; hence, it is not explicitly a hospital level ranking. The working hypothesis is that a high performance hospital in a given disease has better chance to be high performance in other related diseases in the same service line. Nevertheless, extension of this disease-specific profiling system to rate hospitals as a whole can only be accomplished by assessing the distribution of the disease counts in Figure 1 for each hospital. A high performance hospital would therefore be one with a large number of "Select Practice" diseases (upper two quintiles for both quality and efficiency). Conversely, a poor performance hospital would have a large number of "Poor Performance" diseases. The other three categories are harder to apply to the hospital as a whole, since it is possible to be "average" in a number of ways, not just by having a preponderance of disease areas in the center of the grid, but by having a great dispersion in performance across diseases. To make a practical judgment, it may be necessary to invoke an explicit trade off between efficiency and quality and between consistency and dispersion.

### Figure 1. Identification of Five Performance Categories Based on CareScience Select Practice™—Preliminary Recommendation for CMS Study

Efficiency | Lower Quality | Slightly Lower Quality | Average Quality | Slightly Higher Quality | Higher Quality |
---|---|---|---|---|---|

Higher Efficiency | High Efficiency & Low Quality | Practice (High) | |||

Slightly Higher Efficiency | High Efficiency & Low Quality | Average Performance (Middle) | |||

Average Efficiency | High Efficiency & Low Quality | Average Performance (Middle) | Quality & Low Efficiency | ||

Slightly Lower Efficiency | Poor Performance (Low) | Average Performance (Middle) | Quality & Low Efficiency | ||

Lower Efficiency | Quality & Low Efficiency |

### 6.2 Methodological Details

#### 6.2.1 Data Source

The Select Practice methodology was applied to two hospital data bases: (1) State hospital association all-payer patient records and (2) MedPAR patient records from the Center for Medicare and Medicaid Services. This first database does not cover all 50 states for two reasons. First, some states do not provide the data, and second, some states charge prohibitively high prices for their data. Usually, the State data obtained contains 15 to 20 million inpatient records from over 2,000 to 2,600 facilities in 14 to 20 states. Fortunately, many states with large population are represented, including AZ, CA, FL, MA, MD, NJ, NY, PA and TX. MedPAR data contains over 12.5 million Medicare inpatient records from almost 6,200 facilities nationwide. The number of patient records in the MedPAR data has increased yearly as the aging population continues to grow. On the other hand, the number of facilities drops as hospital consolidation continues.

#### 6.2.2 Risk Adjustment

The databases were first processed under the CareScience risk assessment methods described in the previous sections. Risk scores were generated for each of the four outcomes: mortality, complications, major morbidity, and length of stay. A risk score represents the expected or 'standard' outcome under typical care based on a patient's health status and other characteristics. Risk scores serve as benchmarks, whereby the quality and efficiency of hospital services can be evaluated across facilities, regardless of case mix. If the raw scores deviate negatively from their risk scores, the facility is considered a better provider than the benchmark.

#### 6.2.3 Quality Index Computation

Based on an earlier developed method from the Corporate Hospital Rating Project (CHRP), risk-adjusted adverse outcome rates for mortality, morbidity, and complications are combined into a single quality measure represented by the function:

Q_{kh}= 0.46(T_{kh})^{0.96}+ 0.29(B_{kh})^{0.91}+ 0.25(C_{kh})^{0.94}.

Q, T, B, C, h, and k represent the quality index, risk-adjusted mortality, risk-adjusted major morbidity, risk-adjusted complications, facility, and disease, respectively. Hospitals then are ranked according to their quality index with smaller values of Q indicating better quality.

Quality index can be normalized with the following formula:

Normalized Index,_{kh}= (Q_{kp}/Q_{kh})*100

where Q_{kp} represents the quality index of population (P) in disease K. After normalization, higher score reflects higher quality.

The volume of discharges per hospital varies greatly across the database. Fewer discharges may not provide a statistically sound analysis and thus necessitates a minimum volume cutoff. The applied criterion is that a facility must have at least 100 discharges in a given disease (defined by principal diagnosis) to qualify for ranking in State data. In MedPAR data, the threshold is cut by half to reflect its smaller size, which only covers the 65 years old and above population. The qualified facilities are divided into five categories based on their ranks. Because all cases from a given facility are, by requirement, sorted into the same quintile, the categorization can not be precisely processed. Each category represents approximately one fifth of the total volume and all facilities.

#### 6.2.4 Efficiency Index Computation

Length-of-stay is used as a proxy for resource usage, based on the assumption that a hospital spends more resources on patients who stay longer in the hospital for a given disease. Since length-of-stay is usually recorded very accurately for each patient, it is an ideal measurement for a patient-level model. For each disease, facilities with cases exceeding the cutoff (100 for State data and 50 for MedPAR data) are ranked according to the function,

RL,_{kh}= exp{(log(LOS)_{kh}- (logLOSrisk)_{kh})

where RL, k, and h represent ratio, disease, and facility, respectively. Lower ratios denote greater efficiency. The efficiency index can also be normalized according to the same formula that is used for the quality index. Facilities are then divided into five categories based on the same criteria used for the quality index.

#### 6.2.5 Cross Tabulation of Quality and Efficiency Index

Neither the quality index nor length-of-stay can alone determine the Select Practice hospitals. Our study shows that hospital rankings for quality and length-of-stay are largely independent for the majority of diseases. In other words, a hospital has roughly the same probability of falling into one of the five quality index categories regardless of its risk-adjusted length-of-stay, and vice versa. In a 5X5 cross tabulation of the quality and efficiency indices, hospitals are relatively evenly distributed in each of the 25 cells.

A Select Practice hospital is expected to deliver high quality healthcare in an efficient manner. For each disease, Select Practice hospitals are identified by choosing facilities that fall into the top 2X2 cells in the 5X5 cross-tabulation matrix. Select Practice hospitals represent roughly 16% (4/25) of cases and facilities for each disease. To prevent small samples from diluting statistical power, 200 is set as the minimum number of facilities needed for ranking. 16% out of 200 is 32. We believe that this cutoff is the minimum number of Select Practice hospitals required to keep Select Practice statistically meaningful.

No matrix is constructed if there are fewer than 200 qualifying facilities for a given disease. Diseases with low volumes are rolled up into one of the 18 major diagnosis groups (Broad Diagnosis Group), defined by the ICD9-CM classification system. For example, ICD9 codes 001 to 139 are rolled up into BDG 1 (infectious and parasitic diseases) with the exception of 038, which has sufficient volume to stand alone. All BDGs are then processed in the same manner, and a list of Select Practice hospitals is generated for each of the Broad Diagnosis Groups.

### 6.3 Scaling Factors

#### 6.3.1 Scaling Factor Calculation

After the Select Practice hospitals are identified, their performance is measured as the extent to which their performance differs from the overall level. For mortality, morbidity and complications, the overall performance level for both the Select Practice hospitals and the entire population of hospitals is captured by their case-weighted arithmetic means. By comparing the risk-adjusted outcome of the Select Practice hospitals (*Select Practice case-weighted mean deviation + Population case-weighted mean raw rate*) to the population's overall case-weighted mean raw rate, a ratio is obtained for each of the three clinical outcomes for each disease. For LOS, the ratio is simply the case-weighted RL_{kh} of the Select Practice hospitals. These ratios are called 'Scaling Factors.' They are applied to 'scale' down standard risk to Select Practice risk. Scaling factors are numbers between zero and one, and they usually fall into the range between 0.75 and 0.95. The smaller the ratio is, the greater the difference is between the performance of the Select Practice hospitals and the overall hospitals.

Scaling factors are also calculated for Cost although they are not used in Select Practice ranking. The ratio of actual cost and cost risk is first calculated for all hospitals by disease, using the following function:

R$,_{kh}= exp{(log(Cost)_{kh}- (logCostrisk)_{kh})

where R$, k and h represent ratio, disease and facility, respectively.

The cost scaling factor is then calculated as the case-weighted R$kh of the Select Practice hospitals.

#### 6.3.2 Scaling Factor Range

Mathematically, the scaling factors for mortality, morbidity and complication may be greater than one, because the quality index that combines them allows trade-offs among them.

Extremely excellent performance in one or two outcomes may compensate for bad performance in other outcomes. Therefore, a Select Practice hospital may theoretically have worse performance than the overall population's performance in one or two outcomes. In reality, Select Practice hospitals often have balanced performance in all three clinical outcomes. Even if a few Select Practice hospitals perform badly in one or two outcomes, the group volume (at least 32 Select Practice hospitals) can reduce the outlier effect, thus, largely guaranteeing that the scaling factors of the clinical outcomes are within the reasonable range of 0.75 to 0.95.

In a few disease groups, the mortality rate is very low (e.g., ICD9-Diag 303 - alcohol dependence). For these disease groups, complications become the dominant factor in the quality index. Because mortality is hardly relevant for these rankings, the scaling factor may actually exceed 1.0, and consequently, it is capped at 1.0. At this level, Select Practice hospitals are on par with all other hospitals.

For efficiency, LOS alone determines hospital ranking. No hospital with worse performance than the population's overall performance level can rank in the top two quintiles. Therefore the LOS scaling factors are always below 'one.' Since cost is highly correlated with LOS, the cost scaling factors often trail LOS. This, however, does not mathematically guarantee that the cost scaling factors are within the reasonable range, and consequently they are capped at 'one.'

#### 6.3.3 Scaling Factor Implementation

The detailed calculation of scaling factors and how to apply them are described in Appendix C (*Select Practice Formulas*). The following is a simplified example of how to apply the scaling factors and interpret Select Practice risk. Mortality rate for AMI patients in a given hospital is 7.5% while the standard mortality risk is 8.0%. The deviation is -0.5%. In other words, the hospital has saved slightly more patient lives than a typical hospital with the average performance level and given the case mix. For AMI, the Mortality scaling factor is 0.89. The product of the standard mortality risk and the scaling factor is 7.1%, which is called the Select Practice risk. The Select Practice risk is the predicted outcome for a Select Practice hospital given the case mix. Compared to the Select Practice risk, the hospital's deviation now becomes 0.4%, which indicates that the hospital has saved fewer patient lives than a typical hospital in the Select Practice group with the given case mix.

Scaling factors are created and updated by Research, using SAS language. Because State data is different from MedPAR data, the SAS program differs accordingly. The Select Practice programs can not be automatically executed. The name and location of the database has to be changed every time a new database is processed. Some table and column names may also change as the Data Manager program evolves. The first half of the Select Practice program runs by 3-digit principal diagnosis. After the major 3-digit diagnoses are identified and processed, the second half of the program begins processing the broad diagnosis groups (BDGs). Because the volume of each diagnosis may vary yearly, the list of major diagnoses is subject to slight changes. It is therefore necessary to manually code the rolling up of minor diagnoses.

Scaling factors from the latest State data are applied to the production data. Before they are handed to DAU, they must be saved in a table in which 3-digit principal diagnosis is the primary key. Minor diagnoses that are rolled up in the same BDG share the same scaling factors. The actual application of scaling factors occurs in the front end report developed by Software Engineering and monitored by Product Management.

### 6.4 Other Implementation of Select Practice Method

Since its debut in 2002, the Select Practice method has been widely accepted for hospital ranking. It has been used as a powerful marketing tool by the Sales team. Based on MedPAR data, CareScience Select Practice Hospital List has been formally announced in 2005. The list is updated annually. Select Practice methodology has also been used for multiple consulting and research projects, bound to care providers, regulatory agencies and academia. Depending on the purpose of these activities, Select Practice method is continuously updated, and the SAS programs are continuously reshaped to accommodate new requirements.

**References:**

^{8} For the sake of consistency, this description carries the full range of subscripts, since outcome measures are calculated by principal diagnosis for each patient and each outcome and then aggregated to the desired group level.

^{9} Note that we aggregate individual standard errors that include the random error in the population. Without the *l*, , represents only the sampling error around the fit of the regression line, which is generally much smaller.

^{10} In general, *n _{jl} = n_{j}* except where certain cases are excluded for a given outcome. For example, if a physician treats a certain number of cases in a period in which some of those cases were not in a diagnosis that observed at least one death, the total number of cases,

*n*will be different for that provider's LOS risk than for his mortality risk. That said, using the new method developed to fill null values with the mean risk ensures that

_{j}*n*will always equal

_{jl}*n*whenever observations are complete.

_{j}^{11}MDC = Major Diagnosis Category; DRG = Diagnosis Related Group; CTC = Common Treatment Category

^{12}Refer to N:\Analytix\develop\methodologies\Lntrans\logreq2.doc for information on column requirements.

^{13}Pauly MV, Brailer DJ, and Kroch EA, "The Corporate Hospital Rating Project: Measuring Hospital Outcomes from a Buyers Perspective,"

*American Journal of Medical Quality*11(3):112-122.

Return to Contents

Proceed to Appendix A