Teaching old tools new tricks—preparing emergency medicine for the impact of machine learning-based risk prediction models

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

What is machine learning? Machine learning (ML) is a branch of artificial intelligence aimed at developing computer programs that ‘learn’ a mathematical relationship between input and output data. ML applications have been developed for use across many areas in emergency medicine from pre-hospital to emergency department (ED) care including staffing, triage, work up and diagnosis, documentation, and disposition planning. There are three major categories of ML approaches: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves learning an input–output relationship using labelled data. Labels are either categorical (e.g., presence or absence of a disease) or continuous variables (e.g., length of stay) that the ML model aims to predict. Sax et al. tested the ability of ML models to predict 30-day adverse events (e.g., death, myocardial infarction, etc.) in patients with an ED visit for acute heart failure from over 26,000 ED encounters [1]. Their gradient boosted decision tree model achieved higher discrimination (c-statistic 0.85, 95% CI 0.83–0.86) relative to the baseline STRATIFY rule (c-statistic 0.76, 95% CI 0.74–0.77) and was well-calibrated. Many clinical prediction rules that are well established in emergency medicine, such as the decision-tree-based Ottawa ankle rules, can be considered supervised learning models. Unsupervised learning, commonly called “clustering”, involves deriving groups of similar observations using data without labels. White et al. used hierarchical clustering analysis to identify three distinct patterns of hemostatic responses based on blood product requirement using vital signs, injury, and coagulation testing data from 84 trauma patients [2]. Patients in cluster three were found to have high fibrinolytic activation, increased inflammatory markers, and required the highest amount of blood products. Reinforcement learning involves training ‘agents’ to make sequential decisions to maximize a reward function within changing environments. Liu et al. use multi-agent reinforcement learning to optimize ambulance dispatch within a network of ambulance centres [3]. Using a deep Q network, they achieved lower normalized wait times and higher normalized assistance response rates than various baselines (i.e., random, location-based, time-based, and request-based allocation). How can clinicians appraise machine learning solutions? Clinicians in EDs must be able to critically appraise ML research and commercially available models that are integrated into EMR solutions. For example, Epic has integrated a model to predict sepsis into their EMR that uses predictors such as demographics, vital signs, and lab values. Although this model has become widely used in the United States, a recent study found that it had poor sensitivity, poor positive predictive value, and created considerable alarm fatigue [4]. Critically appraising ML solutions builds upon the well-established principles of evidence-based medicine. The performance of ML models is measured in terms of discrimination (e.g., C-statistic/area under the curve (AUC), sensitivity, specificity, positive and negative predictive value) as well as calibration (the agreement between observed and predicted risk, e.g., calibration curves) which are familiar to ED physicians. ML papers may refer to positive predictive value as “precision” and sensitivity as “recall". The relative importance of these metrics may vary based on the task at hand. For example, a model for detecting patients at risk of adverse events due to heart failure may prioritize high sensitivity to ensure that the fewest possible number of patients are missed and high negative predictive value to ensure that the wrong patients are not targeted. Just like with traditional risk prediction models, clinicians must ask if ML models are generalizable to their patients and if using them is likely to improve patient outcomes. Four areas warrant special attention. First, while ML algorithms are powerful because they can learn non-linear relationships in data, they can “overfit” data used during development and form spurious correlations. This is especially true with models developed using datasets that are small or from one/few sites. For example, a model trained to detect pneumonia on chest x-rays may learn a site’s prevalence of pneumonia from recognizing site-specific markers placed by x-ray technicians during image acquisition rather than learning to identify areas of opacification [5]. Second, systemic and structural inequities in health systems can permeate the data used to develop ML models and ultimately contribute to bias [6, 7]. For example, Indigenous patients in Canada often face stereotyping and systemic racism in EDs and have been found to have lower odds of receiving high-acuity triage scores across a range of presentations [6]. Model bias frequently manifests as gaps in subgroup performance based on demographics such as age, sex, and race. While traditional clinical risk prediction models may also be biased (e.g., race correction in eGFR calculations) it may be much more difficult to detect and correct biases in ML models. Third, exogenous events (e.g., the emergence of new SARS-CoV-2 variants) that are not captured during ML model development may compromise the relationship a model learns between input and output data. This phenomenon, known as dataset shift, can reduce ML model performance in sudden and significant ways. Finally, ML’s ability to learn non-linear relationships among a wide range of predictors often contributes to high empirical performance but the inability to discern how and why certain predictions are made. Researchers have thus developed both ML models that are inherently “interpretable” as well as methods to create post hoc “explanations” of this “black box” model behaviour. While interpretability and explainability are areas of active debate, explainability methods may be more appropriate during the model development stage to ensure models are generally behaving appropriately as opposed to explaining individual predictions at the time of inference. For example, Singh et al. use Shapley values to confirm that their model to automate the ordering of diagnostic tests at triage for paediatric patients was leveraging predictors in intuitive and defensible ways (e.g., the model associated concern for appendicitis with the need to order an abdominal ultrasound) [8]. While explainability and interpretability can facilitate building trust in ML models, clinicians should not value them over robust, empirical evidence of clinical benefit or harm from prospective, external validation studies and randomized trials. What is needed to translate machine learning solutions into clinical practice? Prior to the widespread adoption of electronic medical records (EMRs), traditional clinical prediction models were commonly developed using data abstracted from paper charts or collected prospectively. The time-consuming nature of this process often resulted in models containing a handful of predictors that were causally associated with an outcome of interest. Resulting models were then implemented as nomograms or points-based risk scores where clinicians could quickly determine a given patient’s likelihood of having a condition (diagnostic models) or experiencing an outcome of interest (prognostic models, e.g., MEWS for clinical deterioration, Canadian Syncope Risk Score). In contrast, contemporary ML models often requires much more data than traditional clinical prediction models. Hospitals and health systems have been developing warehouses for their electronic medical records, demographic, laboratory, and imaging data. If data are cleaned and organized in a reliable and recurring manner, these warehouses can enable the development of ML models across a wide range of situations. Strong collaboration with information technology departments as well as data governance and data engineering teams are important institutional prerequisites for translating ML into practice. Once a model is developed, it should be evaluated in a “silent trial” where the model is generating predictions but these predictions are not acted upon by end users [9]. Silent trials can ensure that performance is robust in a cohort of patients separate from those used to develop the model. The existence of performance deficits at this stage may indicate the need to revisit model development with more data, more predictors, or a different modelling approach altogether. As ML models can involve large numbers of predictors that are combined in non-linear ways, it is infeasible for end-users to manually compute their outputs and integrate them into clinical practice. Thus, the successful translation of ML models also requires attention to user interfaces and experiences as models are embedded into existing tools and workflows. For example, Verma et al. notified physicians of which patients were at high risk of deterioration using the hospital’s existing electronic sign out tool and text paging alerts [9]. Healthcare workflows involve a wide range of professions such as emergency physicians and nurses, radiologists, medical imaging technicians, social workers, and more. Interdisciplinary collaboration between all stakeholders who are impacted by a machine learning model throughout the stages of problem identification, solution development, and evaluation and deployment are vital to maximize trust in the developed model and the likelihood of success. Ultimately, the translation of ML models into practice is a highly iterative process, with many models requiring revisiting data curation, model evaluation, or interface design throughout their life cycle. Quality improvement approaches such as “plan-do-study-act” cycles and statistical process control charting are increasingly being explored to translate ML models into practice [9]. A summary of recommendations for clinicians to consider prior to implementing ML models into practice is presented in Table 1. Table 1 Summary of key considerations for clinicians considering adopting an ML model into practice at their institution Consideration Recommendation 1. Is the model needed for the outcome of interest? Is there a perceived need among frontline clinicians to use a prediction model in this context? Is the problem sufficiently common, challenging, and does it have enough practice variation to warrant the use of an ML model?a 2. Was the model robustly developed? What predictors were used to develop the model? Are they clinically plausible or sensible?b Are these predictors difficult to capture? (e.g., family history of sudden death) Are the reported performance metrics appropriate? Keep in mind the choice of performance metric is task specific and requires clinical insight Has the model demonstrated potential to improve meaningful clinical outcomes? (e.g., morbidity, mortality, length-of-stay) Checklists such as TRIPOD-AI, STARD-AI, DECIDE-AI, and SPIRIT/CONSORT-AI can help prospective end-users appraise studies of ML models 3. Is there sufficient technical capacity available to support potential deployment? Consult existing organizational information technology and data governance groups regarding feasibility and how the model would be integrated within existing information technology systems (e.g., the electronic medical record, cloud computing systems, etc.) If the model is commercially available, determine what supports the vendor can provide through the process 4. Is model performance robust in the target patient population? Conduct a silent trial at the site of potential deployment whereby the model is generating predictions, but these predictions are not being used to impact patient carec The length of the silent trial is dependent on the prevalence of the condition of interest (e.g., rarer diseases need more time) and duration needed for outcomes to accrue (e.g., 30-day mortality) Use results from silent trial to determine if the model needs to be retrained or if predictors need to be updatedc Define a cut-off for determining poor model performance, this is highly task specific and may involve measuring the baseline performance of clinicians in that taskc 5. Does the model appear to be trustworthy (i.e., in terms of bias, interpretability/explainability)? Conduct a bias assessment in key population subgroups (e.g., by age, sex, race if that data are available, socioeconomic status, etc.)c Audit the predictors used in the model to determine which are most important, are they clinically plausible or sensible?b,c Are there any signs that may suggest overfitting? Audit model false positive and negatives, would clinicians make similar errors on these examples?c 6. Have all relevant stakeholders considered how the model will impact clinical workflows? Create an interdisciplinary working group to explore how to navigate concerns and potential barriers Ensure that team members have expertise in overlooked areas, such as human factors and user interface/experience design 7. Is there a means of continually monitoring model performance once it has been deployed? Keep in mind that dataset shifts can cause deterioration in model performance once a model is deployed Create automated alerts and a protocol to revert to baseline operating procedures if model performance drops below a clinically significant thresholdc Note that following some of these recommendations may require collaborating with local information technology departments, those with relevant scholarly expertise (e.g., computer science, epidemiology, quality improvement, implementation science, human factors, etc.) or the vendors of ML solutions aML algorithms are often well-suited for problems with high signal-to-noise ratios (i.e., those where the data are images, text, or waveforms), very large cohort sizes, or strong temporal patterns. If input data are primarily tabular and represent one time point (e.g., predictors are from one ED visit), traditional, regression-based methods can yield comparable performance with less complexity (Christodoulou et al. J Clin Epidemiol 2019) bThere is considerable debate on whether or not predictors need to be clinically plausible in ML models, as a pure “data mining” or “big data” approach would include a wide range of predictors without a priori selection based on causal criteria. However, it is certainly still possible to overfit ML models and pruning for correlated or otherwise non-influential features can improve the likelihood that a given model will generalize in performance to other sites (Sanchez et al. Royal Society Open Science 2022). Moreover, using post-hoc explainability methods (e.g., Shapley values) to affirm that the most important predictors are used sensibly can be an important robustness check (Singh et al. JAMA Network Open 2022, Ghassemi et al. Lancet Digital Health 2021) cRequires collaboration with groups with data access, local information technology departments, those with scholarly expertise, or the ML solution vendor How can clinicians prepare for the growing impact of machine learning? All clinicians in emergency medicine must have the skills to critically appraise ML solutions, while a subset of clinicians will also be interested in gaining the skills to champion translation efforts. Educational offerings at different stages of training can foster these competencies. Clinical informatics, including familiarity with ML, has been added as a learning objective for the Medical Council of Canada Qualifying Exam Part 1 as of 2022. As such, while undergraduate medical programs should provide all learners with a foundational understanding of key concepts, curriculum in postgraduate medical education can be more focused towards their discipline and ultimate scope of practice. Ehrmaan et al. identified four key curricular objectives: foundational ML concepts from development to deployment, ethical and legal considerations, proper usage of EMR and biomedical data, and the critical appraisal of ML systems [10]. These objectives can be delivered by leveraging existing educational strategies such as seminars during academic half day (e.g., interdisciplinary panel with ethicists and informaticians), journal club (e.g., critical appraisal), and even simulation if sites have models in deployment. In summary, emergency medicine has a long track record of success in developing and translating clinical prediction models into practice. As machine learning offers some unique considerations and challenges, fostering collaborations between clinicians, informaticians, computer scientists, quality improvement leaders, and knowledge translation specialists is vital to responsibly leverage machine learning for improving outcomes in acutely ill patients and the health systems that care for them.

Related collections

Most cited references 6

Record: found
Abstract: found
Article: not found

Dissecting racial bias in an algorithm used to manage the health of populations

Ziad Obermeyer, Brian Powers, Christine Vogeli … (2019)

Health systems rely on commercial prediction algorithms to identify and help patients with complex health needs. We show that a widely used algorithm, typical of this industry-wide approach and affecting millions of patients, exhibits significant racial bias: At a given risk score, Black patients are considerably sicker than White patients, as evidenced by signs of uncontrolled illnesses. Remedying this disparity would increase the percentage of Black patients receiving additional help from 17.7 to 46.5%. The bias arises because the algorithm predicts health care costs rather than illness, but unequal access to care means that we spend less money caring for Black patients than for White patients. Thus, despite health care cost appearing to be an effective proxy for health by some measures of predictive accuracy, large racial biases arise. We suggest that the choice of convenient, seemingly effective proxies for ground truth can be an important source of algorithmic bias in many contexts.

0 comments Cited 934 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: found

Is Open Access

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study

John Zech, Marcus Badgeley, Manway Liu … (2018)

Background There is interest in using convolutional neural networks (CNNs) to analyze medical imaging to provide computer-aided diagnosis (CAD). Recent work has suggested that image classification CNNs may not generalize to new data as well as previously believed. We assessed how well CNNs generalized across three hospital systems for a simulated pneumonia screening task. Methods and findings A cross-sectional design with multiple model training cohorts was used to evaluate model generalizability to external sites using split-sample validation. A total of 158,323 chest radiographs were drawn from three institutions: National Institutes of Health Clinical Center (NIH; 112,120 from 30,805 patients), Mount Sinai Hospital (MSH; 42,396 from 12,904 patients), and Indiana University Network for Patient Care (IU; 3,807 from 3,683 patients). These patient populations had an age mean (SD) of 46.9 years (16.6), 63.2 years (16.5), and 49.6 years (17) with a female percentage of 43.5%, 44.8%, and 57.3%, respectively. We assessed individual models using the area under the receiver operating characteristic curve (AUC) for radiographic findings consistent with pneumonia and compared performance on different test sets with DeLong’s test. The prevalence of pneumonia was high enough at MSH (34.2%) relative to NIH and IU (1.2% and 1.0%) that merely sorting by hospital system achieved an AUC of 0.861 (95% CI 0.855–0.866) on the joint MSH–NIH dataset. Models trained on data from either NIH or MSH had equivalent performance on IU (P values 0.580 and 0.273, respectively) and inferior performance on data from each other relative to an internal test set (i.e., new data from within the hospital system used for training data; P values both <0.001). The highest internal performance was achieved by combining training and test data from MSH and NIH (AUC 0.931, 95% CI 0.927–0.936), but this model demonstrated significantly lower external performance at IU (AUC 0.815, 95% CI 0.745–0.885, P = 0.001). To test the effect of pooling data from sites with disparate pneumonia prevalence, we used stratified subsampling to generate MSH–NIH cohorts that only differed in disease prevalence between training data sites. When both training data sites had the same pneumonia prevalence, the model performed consistently on external IU data (P = 0.88). When a 10-fold difference in pneumonia rate was introduced between sites, internal test performance improved compared to the balanced model (10× MSH risk P < 0.001; 10× NIH P = 0.002), but this outperformance failed to generalize to IU (MSH 10× P < 0.001; NIH 10× P = 0.027). CNNs were able to directly detect hospital system of a radiograph for 99.95% NIH (22,050/22,062) and 99.98% MSH (8,386/8,388) radiographs. The primary limitation of our approach and the available public data is that we cannot fully assess what other factors might be contributing to hospital system–specific biases. Conclusion Pneumonia-screening CNNs achieved better internal than external performance in 3 out of 5 natural comparisons. When models were trained on pooled data from sites with different pneumonia prevalence, they performed better on new pooled data from these sites but not on external data. CNNs robustly identified hospital system and department within a hospital, which can have large differences in disease burden and may confound predictions.

0 comments Cited 327 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: not found
Article: not found

External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients

Andrew Wong, Erkin Otles, John P. Donnelly … (2021)

0 comments Cited 173 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Vinyas Harish:

ORCID: http://orcid.org/0000-0001-6364-2439

v.harish@mail.utoronto.ca

Journal

Journal ID (nlm-ta): CJEM

Journal ID (iso-abbrev): CJEM

Title: Cjem

Publisher: Springer International Publishing (Cham )

ISSN (Print): 1481-8035

ISSN (Electronic): 1481-8043

Publication date (Electronic): 18 March 2023

Pages: 1-5

Affiliations

[1 ]GRID grid.17063.33, ISNI 0000 0001 2157 2938, Temerty Faculty of Medicine, , University of Toronto, ; Toronto, ON Canada

[2 ]GRID grid.17063.33, ISNI 0000 0001 2157 2938, Institute of Health Policy, Management and Evaluation, Dalla Lana School of Public Health, , University of Toronto, ; Toronto, ON Canada

[3 ]GRID grid.17063.33, ISNI 0000 0001 2157 2938, Temerty Centre for Artificial Intelligence Research and Education in Medicine (T-CAIREM), Temerty Faculty of Medicine, , University of Toronto, ; Toronto, ON Canada

[4 ]GRID grid.494618.6, Vector Institute for Artificial Intelligence, ; Toronto, ON Canada

[5 ]GRID grid.17063.33, ISNI 0000 0001 2157 2938, Division of Emergency Medicine, Department of Medicine, Temerty Faculty of Medicine, , University of Toronto, ; Toronto, ON Canada

[6 ]GRID grid.492573.e, ISNI 0000 0004 6477 6457, Schwartz/Reisman Emergency Medicine Institute, , Sinai Health, ; Toronto, ON Canada

[7 ]GRID grid.415502.7, Li Ka Shing Knowledge Institute, , St. Michael’s Hospital, Unity Health Toronto, ; Toronto, ON Canada

[8 ]GRID grid.415502.7, Data Science and Advanced Analytics, , St. Michael’s Hospital, Unity Health Toronto, ; Toronto, ON Canada

[9 ]GRID grid.28046.38, ISNI 0000 0001 2182 2255, Department of Emergency Medicine, , University of Ottawa, ; Ottawa, ON Canada

[10 ]GRID grid.412687.e, ISNI 0000 0000 9606 5108, Clinical Epidemiology Program, , Ottawa Hospital Research Institute, ; Ottawa, ON Canada

[11 ]GRID grid.28046.38, ISNI 0000 0001 2182 2255, School of Epidemiology and Public Health, , University of Ottawa, ; Ottawa, ON Canada

Author information

Vinyas Harish http://orcid.org/0000-0001-6364-2439

Article

Publisher ID: 480

DOI: 10.1007/s43678-023-00480-8

PMC ID: 10024279

PubMed ID: 36933121

SO-VID: 6f4a6f6c-c4bb-4523-954b-b201415af72d

License:

This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.

History

Date received : 22 September 2022

Date accepted : 24 February 2023

Funding

Funded by: Canadian Institutes for Health Research

Award ID: Banting

Award ID: Best CGS D

Award Recipient : Vinyas Harish

Funded by: FundRef http://dx.doi.org/10.13039/501100019117, Vector Institute;

Award ID: Postgraduate Affiliate Award

Award Recipient : Vinyas Harish

Funded by: Dalla Lana School of Public Health

Award ID: Graduate Award in Data Science for Public Health

Award ID: Health Systems

Award Recipient : Vinyas Harish

Teaching old tools new tricks—preparing emergency medicine for the impact of machine learning-based risk prediction models

Read this article at

Abstract

Related collections

SAXS:Recent

Most cited references 6

Dissecting racial bias in an algorithm used to manage the health of populations

Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study

External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients

Author and article information

Contributors

Journal

Affiliations

Author information

Article

History

Funding

Categories

Comments

Comment on this article

Similar content 1,718

Most referenced authors 107