Analysing indicators of performance, satisfaction, or safety using empirical logit transformation

Analysing indicators of erformance, satisfaction, or safety using emirical logit transformation Sarah Stevens,, Jose M Valderas, Tim Doran, Rafael Perera,, Evangelos Kontoantelis,5 Nuffield Deartment of Primary Care Health Sciences, University of Oxford, Oxford, UK National Institute for Health Research School for Primary Care Research, Oxford, UK APEx Collaboration for Academic Primary Care, Institute for Health Services Research, University of Exeter Medical School, University of Exeter, Exeter, UK Deartment of Health Sciences, University of York, UK 5 Centre for Health Informatics, Institute of Poulation Health, University of Manchester, Manchester M 9GB, UK Corresondence to: E Kontoantelis e.kontoantelis@ manchester.ac.uk Cite this as: BMJ 06;5:i htt://dx.doi.org/0.6/bmj.i Acceted: 0 January 06 Performance, satisfaction, and safety indicators are commonly measured on a ercentage scale. Such indicators are often subject to ceiling or floor effects and erformance may be inherently non-linear. For examle, imroving from 85% to 95% might be more difficult and need more effort than imroving from 55% to 65%. As such, analysis of these indicators is not always straightforward and standard linear analysis could be roblematic. We resent the most common aroach to dealing with this roblem: a logit transformation of the score, following which standard linear analysis can be conducted on the transformed score. We also demonstrate how estimates can be back-transformed to ercentages for easier communication of findings. In this aer, we discuss the benefits of this method, use algebra to describe the relevant stes in the transformation rocess, rovide guidance on interretation, and rovide a tool for analysis. In recent years, efforts to imrove the quality and safety of healthcare have resulted in the introduction of systems for monitoring the erformance of healthcare roviders and the satisfaction and safety of atients. New quality and erformance indicators have been created, to which financial and reutational rewards for roviders are often attached. Although erformance indicators Summary box Performance, satisfaction, or safety indicators in healthcare are commonly measured on a ercentage scale Standard linear analysis could be roblematic owing to ceiling or floor effects or non-linearity A logit transformation of the score is the most common solution Estimates can be back-transformed to ercentages for a more intuitive interretation are measured for each atient, they are often only reorted in aggregate form (eg, at the ractice or hosital level). Therefore, an indicator that begins as a binary outcome (that is, the target is either met or not met for each atient), becomes a roortion (that is, the ercentage of atients for whom the quality target is met). Such summary indicators are usually analysed by linear models. This is aroriate in many scenarios where the scores retain linear roerties, for examle, in the analysis of referral rates and their redictors. However, for aggregate analyses of erformance indicators, two articular roblems can emerge. Firstly, it is common for individual indicators within a set to vary in intrinsic difficulty (eg, recording blood ressure is easier than controlling blood ressure) or vary in the size of associated incentives. This frequently results in healthcare roviders achieving targets for 00% of atients for easier indicators and, less often, for 0% of atients for more difficult indicators, or for indicators with smaller incentives. Maximum (00%) and minimum (0%) scores are more common when atient grous are small (the roblem of small denominators). These ceiling and floor effects can cause roblems in analyses of data at the atient level, but also make the use of aggregate erformance scores in linear models roblematic. This is a articular roblem for rediction modelling (eg, in interruted time series designs) where redictions might fall outside the 0-00% range. Secondly, there is inherent non-linearity in erformance indicators, because the effort required by a health worker is not uniform across atients. For examle, some atients might attend clinic aointments infrequently while others might be ersistent in refusing a measurement or treatment. Similarly, satisfaction is subjective and different levels of effort are needed to satisfy different atients, whereas in terms of safety, risk management is inexact and some atients might be more difficult to manage clinically. Therefore, it is generally more difficult to achieve an imrovement from 85% to 95%, than from 55% to 65%. Analogously, an imrovement from 0% to 0% should ose very little difficulty. Box resents some examles of erformance, satisfaction, and safety 5 indicators. One otential solution to these issues is for researchers to dichotomise the indicator by classifying healthcare roviders simly as high or low achievers (in terms of erformance, satisfaction, or safety), based on a secified threshold of achievement. For examle, assume that a healthcare rovider has met a target (=) if the relevant erformance score is over 85%, and not met the target (=0) otherwise. Analyses are then ossible by use of logistic models, and odds ratios would be used to quantify effects. However, odds ratios are intuitively difficult to concetualise and are frequently interreted as the bmj BMJ 06;5:i doi: 0.6/bmj.i

Box : Examles of ceiling or floor effects Quality and Outcomes Framework erformance indicator DM (006-07 to 0-) Measured the ercentage of atients with diabetes who have a record of estimated glomerular filtration rate or serum creatinine testing in the revious 5 months. In 006-07 and with 865 general ractices reorting the indicator, one ractice had a score of 0% and 700 a score of 00%. The mean score was 96.%. GP atient satisfaction survey, from 807 general ractices in 008 Survey on how easy it was for atients to get through on the hone at own doctor s surgery (no v yes): 8 ractices scored 00%, when the mean score was 87.%. Survey also asked about the ability for atients to get aointment within two days (no v yes): 9 ractices scored 00%, when the mean score was 85.7% Investigation of rescribing safety using the Clinical Practice Research Datalink, in 0 5 Investigation looked at the roortion of women with a breast cancer diagnosis who were rescribed oral or transdermal oestrogens: 9 (6%) of 5 ractices had a revalence of 0%, when the mean revalence was.%. For atients rescribed reeated amiodarone without a thyroid function test within the recommended time eriod, (9%) of 505 ractices had a revalence of 00%, when the mean revalence was %. relative risks. Although such an aroach could be accetable in a scenario where few roviders are low scoring (where the rare event aroximation stands), more generally odd ratios overestimate the relative risk and their interretation tends to be flawed. 6 In addition, such a simlification discards a considerable amount of information and reduces statistical ower, while the choice of the threshold value used to dichotomise erformance may be arbitrary. A more suitable aroach is based on a transformation that uses the logit function to linearise a erformance, satisfaction, or safety indicator. This transformation effectively takes a scale that ranges from 0 to (or 0% to 00%), and exands the scale so that it ranges from minus ( ) infinity to lus (+) infinity. The indicator can then be analysed by standard linear models or similar methods, in a frequentist or Bayesian framework. Given the articular challenges associated with this aroach, in this aer we rovide guidance for researchers in conducting and interreting analyses of logit transformed erformance scores of quality indicators. For ractical examles, we draw on our exerience of analysing family ractice erformance under the United Kingdom s Quality and Outcomes Framework (QOF). 7-9 The framework is a financial incentive scheme introduced by the UK government in 00, which rewards ractices on the basis of their erformance on more than 00 quality indicators related to the clinical management of chronic disease, ractice organisation and, atient exerience. 0 For the clinical indicators, which are regularly reviewed and could be withdrawn, ractices are assessed on the basis of the ercentage of eligible atients for whom each target is met (eg, the ercentage of atients with coronary heart disease who give a blood ressure recording of 50/90 mm Hg). The main research questions in relation to erformance on the QOF indicators relate to how ractice erformance varies between ractices with different characteristics and atient rofiles, and how erformance changes over time. We have focused on erformance indicators in rimary care to rovide ractical examles and lace the methods into context. However, the methods are relevant for the analysis of any ercentage score that aggregates binary or continuous information from a lower level unit (eg, atient) to a higher level unit (eg, general ractice oulation) and where non-linearity is resent. Examles of satisfaction and safety include the ercentage of eole who are hay with access to their referred general ractitioner or who are rescribed a drug that uts them at risk. Aroach Logit transformation The first ste is to assemble ractice scores (in the case of the QOF, the roortion of eligible atients for whom a given target has been achieved) and model these scores on the logit (log-odds) scale. There are two main otions to do this and to achieve an exansion in the scale from [0, ] to [± infinity]: simle and emirical transformation. In the simle logit transformation, the score (0 P ) is transformed into a log odds: logit()=ln(/( )) (fig ). For examle, a difference in the untransformed score () from 0.97 to 0.98 (that is, 97% to 98%) reresents the same difference in transformed score (logit()=0.) as a difference from 0.55 to 0.65 (that is, 55% to 65%). The transformed score can then be modelled with standard linear models and the analysis of transformed scores also ensures that redicted achievement scores lie between 0% and 00%. The main drawback of the simle logit transformation is that achievement scores of 0% and 00% become minus and lus infinity, resectively, following transformation. As a consequence, these observations will be interreted as missing values by statistical software ackages and removed from analyses. If there are a large number of scores at the ceiling or floor values, this effectively renders simle logit transformation ineffective. The emirical logit transformation offers an imrovement over the simle logit transformation at the ceiling and floor oints by making a searate transformation at these values. For scores where is strictly greater than 0 and less than, the simle logit transformation is alied as above. For scores where is equal to 0 or, the emirical logit transformation is given by formula below, where n is the number of observations over which is calculated: 5 Details of equation 0.5 0.5 Logit() = In + n / + n In the QOF setting, n would be the number of atients for which an indicator is evaluated (the denominator for the indicator), for examle, the number of atients diagnosed with diabetes. Not only does this transformation overcome the roblems described above, but it also has an additional benefit; scoring 00% on an indicator evaluated on a large number of eole (n) is rewarded with a higher transformed score doi: 0.6/bmj.i BMJ 06;5:i the bmj

Logit scale the bmj BMJ 06;5:i doi: 0.6/bmj.i 6 0 - - Logit curve 98-97% difference of 0. on logit scale 9-90% difference of 0. on logit scale 65-55% difference of 0. on logit scale.9.5.6. -6 0 0. 0. 0. 0. 0.5 0.6 0.7 0.8 0.9.0 Performance Fig Simle logit transformation of a erformance roortion () comared with scoring 00% based on a smaller n. The effect is to further exand the scale of quality scores, allowing for even greater discrimination between ractices (fig ). For examle, a 00% score on five atients for a ractice would be transformed to a score of.0 on the logit scale, whereas the same score on 0 atients for another ractice would corresond to a score of.0 on the logit scale. Although the denominator adjustment made by emirical logit transformation (where =0 or ) could also be useful for values strictly greater than 0 and less than, the justification is less clear and interretation is roblematic (eg, a score of =0.8 (80%) and a denominator of n=5 would be equivalent to a score of =0.85 (85%) and a denominator of n=5 on the emirical logit scale). Use of emirical logit transformation across the range of values in [0, ] may be attractive, where a score of 00% on one atient would be transformed to.; under simle logit transformation, a score of 99% on 99 atients would be transformed to much higher score of.6. Therefore, it might be reasonable to use emirical logit transformation across all scores for consistency, which would effectively act as an adjustment for the different effort needed to meet the same score across varying denominators. However, an alternative and erhas simler aroach would be to aly both methods of logit transformation as described and then use the denominator as an additional redictor in a multile regression model to control for effort in the analysis stage. Two other asects should be considered. Firstly, on the logit scale, a unit with a very high score will lose Logit scale 6 5.0...7.6 0 5 0 0 50 00 No of eligible atients within health unit/ractice Fig Emirical logit transformation when erformance score () is 00%, over various denominator sizes (n) 0. 0.6 5. much more from the next failure than it will gain from a success (with the icture reversed for a unit with a very low score). Secondly, emirical logit transformation assumes that all data are available or that the reorted denominators across the higher level units are comarable to the true denominators. In other words, the units should reort a reresentative samle of their data, and the roortion reorted needs to be similar across units. If that is not the case, units that reort a smaller roortion are enalised under the emirical logit. Back-transformation The interretation of regression coefficients on the logit scale is intuitively difficult, and hence a back-transformation of the effects to roortions (ercentages) is desirable. However, the non-linear nature of the transformed scores and the resulting effects comlicates the back-transformation to a linear roortion scale. As shown in figure, a fixed size effect on the logit scale corresonds to a smaller effect on the untransformed scale at the extremes, and hence the back-transformed effect deends on the underlying achievement score. Therefore, an anchor achievement score must be chosen on which the back-transformation is to be based. Figure exlains this rincile formulaically. To demonstrate this effect in ractice, consider recent research on the quality of diabetes care (measured as a ercentage achievement score) and the revalence of disease at the ractice level. 9 A % higher revalence of diabetes at the ractice level was found in regression analyses to be associated with a 0.0 lower achievement score on the logit scale. As shown in table, the effect of differences in revalence on the untransformed achievement score differs, deending on the anchor achievement rate selected. Assuming that we want to quantify the effect of a % increase in the revalence rate of diabetes on the back-transformed scale, we observe a larger effect for ractices whose underlying achievement score is 0.5 (50%). The same difference in revalence has a much smaller effect on achievement for ractices with a median achievement score of 0.95 (9.5%) or with very low achievement scores. Choice of anchor score As demonstrated in table, while the choice of a secific anchor score over another has no bearing on the statistical significance of results, it does affect the relative clinical or ractical significance of the factor of interest. The anchor value should not be arbitrary, but rather it should be based on a lausible value for the erformance, satisfaction, or safety score, for examle, it can be based on its mean or median. Use of the median or mean achievement score is intuitively sensible if researchers want to describe the relation between achievement and other factors in the average or tyical case. However, if researchers are examining these factors with a view to develoing interventions for imrovement, then an anchor score reflecting oor erformance is a sensible choice, assuming any intervention is aimed

Assuming that revalence (x%) is a redictor of achievement, the achievement score () for a ractice on the logit scale can be exressed though a simle regression as: Logit() = α + βx, where α is the intercet and β is the regression coefficient quantifying the association between x and. Achievement on the logit scale for the same ractice, if revalence increases by %, becomes: Logit( new ) = α + β(x + ) The difference gives the change in achievement (on the logit scale) for a % increase in revalence (beta values in a linear regression model): Logit( new ) Logit() = β Achievement on the logit scale is related to a articular value of ercentage achievement (as defined by the logit transformation): Logit() = In () new Similarly, Logit( new ) = In new To work backwards to get a change in ercentage achievement, we assume that ercentage achievement at revalence (x+)% is equal to the ercentage achievement at revalence x% lus a constant c, the change in ercentage achievement er % increase in revalence. new = + c Logit( new ) Logit() = β Solving for c, we obtain: - c = ex ( β) + To obtain c, we need to assume an anchor value for, and the average achievement score across clinical units or ractices is as good an assumtion as any. Table Quantification of back-transformed effects of a % increase in diabetes revalence, at various anchors Increase in revalence (%) () + c In In = β ( + c) Effect of redictor (revalence) on logit scale (95% CI) Fig Back-transformation exlained Anchor achievement score () Back-transformed effect of redictor on achievement score (absolute difference c) (95% CI) 0.0 ( 0.0 to 0.0) 0.9500 0.005 ( 0.000 to 0.000) 0.95 (median) 0.00 ( 0.009 to 0.005) 0.7500 0.0059 ( 0.0078 to 0.000) 0.5000 0.0077 ( 0.00 to 0.005) 0.500 0.0058 ( 0.0076 to 0.009) 0.0500 0.005 ( 0.009 to 0.000) at imroving erformance, satisfaction, or safety in a low achieving setting only. We would therefore argue that choice of an anchor score is largely at the discretion of researchers, who should use the research aims to inform their choice. However, the mean or median scores should be suitable in most scenarios and a riori justification would be needed for alternative anchor choices. Another aroach could be to resent back-transformed results obtained using several different anchor scores to stimulate discussion around this issue, although attention needs to be given to the interretation of the grou of results. It should also be noted that transformed scores, like ercentage scores, do not account for the difficulty in meeting a secific indicator and that investigators should be careful with comarisons across indicators of varying difficulty levels. In these cases, the anchor score can be chosen to reflect the inherent difficulty for an indicator, although the relation between the anchor score and difficulty is not intuitive. To aid researchers with use of these methods, we have made available an Excel workbook with the transformation and back-transformation formulas, given different anchor scores (available from the corresonding author on request or from his ersonal website (www.statanalysis.co.uk/files/logit_transformation. xlsx)). Discussion We have demonstrated the use of emirical logit transformation for the analysis of erformance, satisfaction, or safety indicators that are subject to ceiling or floor effects. We have argued the benefits of this method, algebraically described the rocesses, rovided guidance on interretation, and have made available a simle tool to aid researchers in using the method. These methods have broad alicability in health services research, but can also be alied in other settings, for examle, citizen satisfaction with urban services 6 or hotel websites. 7 We thank Jill Stokes, researcher at the University of Manchester, who rovided the rescrition safety examles; and Ben Amies, a GP registrar, who rovided feedback on the aer. Contributorshi: SS wrote the manuscrit, with hel from EK. JMV, TD, and RP critically edited the manuscrit. SS is the guarantor of this work and, as such, had full access to all the data in the study and takes resonsibility for the integrity of the data and the accuracy of the data analysis. In this article, we resent examles on alications of these methods to a range of research questions and studies as ublished in major clinical journals, including The BMJ. SS is an early career statistician who has recently been involved with analysing erformance measures in a rimary care setting. TD is a clinical researcher with interests in quality of care and exerience in alying these methods. JMV is a clinician with exertise on the measurement of quality of care and in sychometric methods. RP is a medical statistician whose research rogram focuses, not exclusively, on monitoring in rimary care. EK is a biostatistician and health services researcher who has used the reorted methods to answer research questions ertaining to incentivisation in rimary care. Funding: UK Medical Research Council Health eresearch Centre grant MR/K006665/ suorted the time and facilities of EK. Cometing interests: All authors have comleted the ICMJE uniform disclosure form at www.icmje.org/coi_disclosure.df and declare: no suort from any organisation for the submitted work; no financial relationshis with any organisations that might have an interest in the submitted work in the revious three years; no other relationshis or activities that could aear to have influenced the submitted work. Provenance and eer review: Not commissioned; externally eer reviewed. Kontoantelis E, Reeves D, Valderas JM, Cambell S, Doran T. Recorded quality of rimary care for atients with diabetes in England before and after the introduction of a financial incentive scheme: a longitudinal observational study. BMJ Qual Saf 0;:5-6. doi:0.6/bmjqs-0-000. Hiisley-Cox J, Hardy C, Pringle M, Fielding K, Carlisle R, Chilvers C. The effect of derivation on variations in general ractitioners referral rates: a cross sectional study of comuterised data on new medical and surgical outatient referrals in Nottinghamshire. BMJ 997;:58-6. doi:0.6/bmj..709.58. Doran T, Fullwood C, Gravelle H, et al. Pay-for-erformance rograms in family ractices in the United Kingdom. N Engl J Med 006;55:75-8. doi:0.056/nejmsa055505. Kontoantelis E, Doran T, Sringate DA, Buchan I, Reeves D. Regression based quasi-exerimental aroach when randomisation is not an otion: interruted time series analysis. BMJ 05;50:h750. doi:0.6/bmj.h750. 5 Stocks SJ, Kontoantelis E, Akbarov A, Rodgers S, Avery AJ, Ashcroft DM. Examining variations in rescribing safety in UK general ractice: cross sectional study using the Clinical Practice Research Datalink. BMJ 05;5:h550. doi:0.6/bmj.h550. 6 Davies HTO, Crombie IK, Tavakoli M. When can odds ratios mislead?bmj 998;6:989-9. doi:0.6/bmj.6.76.989. 7 Cambell SM, Reeves D, Kontoantelis E, Sibbald B, Roland M. Effects of ay for erformance on the quality of rimary care in England. N Engl J Med 009;6:68-78. doi:0.056/nejmsa080765. doi: 0.6/bmj.i BMJ 06;5:i the bmj

8 Doran T, Kontoantelis E, Valderas JM, et al. Effect of financial incentives on incentivised and non-incentivised clinical activities: longitudinal analysis of data from the UK Quality and Outcomes Framework [correction in: BMJ 0;7:f599]. BMJ 0;:d590. doi:0.6/bmj.d590. 9 Ricci-Cabello I, Stevens S, Kontoantelis E, et al. Imact of the revalence of concordant and discordant conditions on the quality of diabetes care in family ractices in England. Ann Fam Med 05;:5-. doi:0.70/afm.88. 0 Roland M. Linking hysicians ay to the quality of care--a major exeriment in the United kingdom. N Engl J Med 00;5:8-5. doi:0.056/nejmhr09. Reeves D, Doran T, Valderas JM, et al. How to identify when a erformance indicator has run its course. BMJ 00;0:c77. doi:0.6/bmj.c77. Cox DR, Snell EJ. Analysis of binary data.nd ed. Chaman and Hall, 989. Abraham B, Unnikrishnan Nair N, eds. Quality imrovement through statistical methods.birkhäuser, 998doi:0.007/978--6-776-. Berkson J. Maximum Likelihood and Minimum X Estimates of the Logistic Function. J Am Stat Assoc 955;50:0-6. 5 Anscombe FJ. On Estimating Binomial Resonse Relations. Biometrika 956;:6-doi:0.09/biomet/.-.6. 6 Stiak B. Citizen Satisfaction with Urban Services - Potential Misuse as a Performance Indicator. Public Adm Rev 979;9:6-5doi:0.07/078. 7 Chung T, Law R. Develoing a erformance indicator for hotel websites. Int J Hosit Manag 00;:9-5doi:0.06/ S078-9(0)00076-. BMJ Publishing Grou Ltd 06 No commercial reuse: See rights and rerints htt://www.bmj.com/ermissions Subscribe: htt://www.bmj.com/subscribe