Skip Navigation Archive: U.S. Department of Health and Human Services U.S. Department of Health and Human Services
Archive: Agency for Healthcare Research Quality
Archival print banner

This information is for reference purposes only. It was current when produced and may now be outdated. Archive material is no longer maintained, and some links may not work. Persons with disabilities having difficulty accessing this information should contact us at: Let us know the nature of the problem, the Web address of what you want, and your contact information.

Please go to for current information.

CareScience Risk Assessment Model - Hospital Performance Measurement

Presentations from a November 2008 meeting to discuss issues related to mortality measures.

Appendix B—Semilog Modeling

Certain outcome measures, notably costs and length-of-stay (LOS), are distributed with a rightward (positive) skew, as depicted below in Figure 1(a). Applying linear regression to models with skewed dependent variables gives rise to a number of pathologies, including inefficient, often biased, parameter estimates and predictions outside logical bounds, such as negative values for LOS and costs. When outcome measures are not symmetrically distributed, analysis of performance can be disproportionately influenced by outliers and special or extreme cases. This phenomenon can require a manual procedure for identifying and removing outliers, a subjective technique at best.

A more robust solution is to take the natural log of the dependent variable, which results in an approximately symmetric distribution and contracts the outliers inward toward the center of the data, as shown in Figure 1(b). It also ensures that all predicted values will be positive. (No matter how negative the log value is, taking the anti-log to restore the values will guarantee that they are positive.)

We conducted a systematic review of non-adverse outcome measures—LOS, charges, and costs—by three-digit ICD-9 code to monitor the positive skew and measure its magnitude. In symmetric distributions two measures of central tendency, geometric mean and arithmetic mean (se below), are equal. As the skew increases in unimodal distributions the ratio of the arithmetic mean to the geometric mean grows from unity.

To illustrate skew: Total cost is skewed right but the natural log of total cost - ln(cost) - is approximately symmetrically distributed, therefore using linear regression to forecast ln(cost) will result in much better estimates with smaller error.

Figures 1a and 1b are line graphs showing the total cost in unimodal distributions. The description of the graphs are explained in the paragraph above.

A numeric illustration:
Depicted below is the total cost frequency distribution for a sample of 200 hospital discharges. It displays the characteristic positive skew (skew coefficient = 2.6).

Figure 2(a) Total cost for 200 discharges

Bar chart showing the total cost frequency distribution for a sample of 200 hospital discharges. The cost interval is shown from <$300 up to $39,000-42,000. The y axis shows the frequency. The distribution is skewed towards the lower end.

Figure 2(b) Log of cost for 200 discharges

Bar chart showing the log of the total cost frequency distribution for a sample of 200 hospital discharges. The interval bounds (logs) is shown from < 5 up to > 9. The y axis shows the frequency. The distribution is skewed towards the middle.

Geometric vs. arithmetic means:

The arithmetic mean is the simple average, computed by adding up all values (xi) in the sample and dividing by the number of such values (n):

arithmetic mean Mean = 1 over n times sum of x sub i for i = 1 through n.

The geometric mean follows the same principle, but instead of adding the values, they are multiplied together and instead of dividing by n, the nth root of the product is taken:

geometric mean Geometric mean = nth root of product of x sub i for i = 1 through n.

An equivalent way to compute the geometric mean is to take advantage of natural logarithms. Defining y as the natural log of x [y = ln(x)], the geometric mean is the anti-log (exp) of the arithmetic mean of y:

geometric mean Geometric mean = exp of mean of y, where mean of y = 1 over n times sum of y sub i for i = 1 through n.

Because the geometric mean is based on log values and the log transformation tends to draw extreme values toward the center of the data, the geometric mean is more "robust" than the arithmetic mean. "Robust" here means less influenced by outliers.

Back to the cost example from 200 hospital discharges:

Transforming cost from Fig. 2(a) by taking the natural log gives the frequency distribution in Fig. 2(b), which exhibits the typical symmetric bell shape of the normal distribution. The arithmetic mean cost is marked on the first (skewed) frequency histogram, which in this illustration is $1670. The mean of the log(cost) is marked on the second histogram at 6.95. Taking the anti-log of this value yields the geometric mean equal to $1043, which is much closer to the mode of the original (untransformed) histogram. The pronounced positive skew in the original cost distribution guarantees that the arithmetic mean is much larger than the geometric mean, which tends to pull back the extreme values in the upper tail. In this illustration the ratio of the arithmetic mean to the geometric mean is $1670/$1043 = 1.60.

raw arithmetic mean Mean of x sub j l = 1 over n sub j l sum of x sub i j k l for i k through n sub j l
raw geometric mean Exp mean of y sub j l where mean of y sub j l = 1 over n sub j l sum of y sub i j k l for i k through n sub j l.
risk value X hat sub i j k l = exp y hat sub i j k l for all complete cases (including zeros). Exp mean of y sub j l for all incomplete cases.
arithmetic mean risk Mean of x hat sub j l = 1 over n sub j l sum of x hat sub i j k l for i k through n sub j l
geometric mean risk Exp mean of y hat sub j l prime where Mean of y hat sub j l = 1 over n sub j l sum of y hat i j k l for i k through n sub j l
where xijkl = patient.total_charges, patient.comparative_costs, and patient.length_of_stay yijkl = ln(xijkl)
and y^ijkl = ln(total_charges) risk, ln(comparative_cost) risk, and ln(length of stay) risk
i = patient (each row in the patient table)
j = provider or grouping
k = icd9 diagnosis (3 digit)
l = outcome (length of stay, charges, cost)
n = all observations including zeros

Modeling Requirements:

  1. Create an additional column in the patient/episode table to hold ln_x, ln_x_risk, x_risk(eln_x_risk) and ln_x_stderr where x represents the dependent variables.
  2. Populate this column with the ln(total_charge), ln(comparative_costs), and ln(ccms_length_of_stay) respectively. The ln values will be populated with a '99' when costs and charges are zero.20
  3. Regress the ln(total_charge), ln(comparative_costs), and ln(length of stay) on the original vector of independent variables. Cases with a null value or a 99 for the dependent variable as well as incomplete cases will not be included in the regression. Nevertheless fitted values (risks) and their standard errors will be generated for all complete cases.21 We shall use n to designate the number of complete observations including those with null or '99' dependent values; m indicates the number of observations included in the regression (excluding incomplete and those with null or '99' dependent values). To illustrate, suppose a given model stratum has 100 observations of which 95 are complete; and of these 95, ten have cost equal to zero (are given a value of '99' in the log column). Then n=95 and m=85. The regression is run on m=85 cases and fitted values (risks) together with their standard errors are generated for n=95 cases.
  4. The back-end values are left in log form and antilogs are applied only after aggregation on the front end.
  5. The front-end software application then performs the appropriate calculation (sum, average, etc.) on the log values to display the raw, standardized, and deviation results in the reports. (The calculations that are relevant to this conversion are on found above.)
  6. Deviations are based on geometric means:
    Geometric Deviation = Exp mean of y sub j l minus exp mean of y hat sub j l
  7. The front-end software calculates the p-value to determine significance with all measures in logs (i.e. without converting raw, risk, or standard errors to the original units.) The calculation will not change from what is currently in use but will be based on the m nonzero cases.
    T exponent p sub j l = mean of y sub j l minus mean of y hat sub j l over square root of v mean of y hat sub j l
  8. The deviations column on all CareScience Quality Manager reports must equal the raw minus the standardized values up to rounding error in the first decimal place, such that the deviation is no more than 0.1 different from the difference between raw and the standardized value.

Implementation Comments:

Implementation is a combination of front-end and back-end changes. The database must hold logarithmic values—ln(total_charges), ln(comparative_costs), and ln(length of stay)—and standard errors in log form. All computations of confidence intervals and significance are in logs, including necessary aggregations. All computations on risk values are done before conversion back to "levels" (in log units), hence excluding cases with zero values in the raw data. This approach to aggregation generates geometric (not arithmetic) means. Moreover, the log transformation method guarantees that expected level values (after taking the antilog) be positive, which eliminates the need for front-end data trimming.

Comparative Costs as an example:

  1. Assignments:
    a = exp(avg(ln_comparative_costs))
    b = exp(avg(ln_comp_cost_risk))
    c = sum(decode(ln_comparative_costs,null,0,1))
    d = sqrt[ sum(ln_comp_cost_risk_stderr ^ 2) ]
    k = avg(ln_comparative_costs) - avg(ln_comp_cost_risk)
  2. Computations:
    Charge deviation = a - b
    Charge sig flags: t-value= k * c/d (with degrees of freedom: c - 1)

Addendum on LOS

Within CareScience database, patients discharged the same day as admitted are assigned length-of-stay = 1, not 0. That conforms to most billing practices. LOS is defined as the number of days present, not including the day of discharge with a minimum LOS = 1. This algorithm eliminates the possibility of undefined value of ln(length_of_stay) when LOS = 0.

20 99 is a placeholder used by Data Manager to identify observations that should be excluded from the regression because the dependent variable is undefined (ln of 0 is undefined).
21 Complete cases are defined as having values for all independent variables required for the regression.

Return to Article Contents
Proceed to Appendix C


Page last reviewed March 2009
Internet Citation: CareScience Risk Assessment Model - Hospital Performance Measurement. March 2009. Agency for Healthcare Research and Quality, Rockville, MD.


The information on this page is archived and provided for reference purposes only.


AHRQ Advancing Excellence in Health Care