What is machine learning?
Machine learning (ML) is a branch of artificial intelligence aimed at developing computer
programs that ‘learn’ a mathematical relationship between input and output data. ML
applications have been developed for use across many areas in emergency medicine from
pre-hospital to emergency department (ED) care including staffing, triage, work up
and diagnosis, documentation, and disposition planning. There are three major categories
of ML approaches: supervised learning, unsupervised learning, and reinforcement learning.
Supervised learning involves learning an input–output relationship using labelled
data. Labels are either categorical (e.g., presence or absence of a disease) or continuous
variables (e.g., length of stay) that the ML model aims to predict. Sax et al. tested
the ability of ML models to predict 30-day adverse events (e.g., death, myocardial
infarction, etc.) in patients with an ED visit for acute heart failure from over 26,000
ED encounters [1]. Their gradient boosted decision tree model achieved higher discrimination
(c-statistic 0.85, 95% CI 0.83–0.86) relative to the baseline STRATIFY rule (c-statistic
0.76, 95% CI 0.74–0.77) and was well-calibrated. Many clinical prediction rules that
are well established in emergency medicine, such as the decision-tree-based Ottawa
ankle rules, can be considered supervised learning models.
Unsupervised learning, commonly called “clustering”, involves deriving groups of similar
observations using data without labels. White et al. used hierarchical clustering
analysis to identify three distinct patterns of hemostatic responses based on blood
product requirement using vital signs, injury, and coagulation testing data from 84
trauma patients [2]. Patients in cluster three were found to have high fibrinolytic
activation, increased inflammatory markers, and required the highest amount of blood
products.
Reinforcement learning involves training ‘agents’ to make sequential decisions to
maximize a reward function within changing environments. Liu et al. use multi-agent
reinforcement learning to optimize ambulance dispatch within a network of ambulance
centres [3]. Using a deep Q network, they achieved lower normalized wait times and
higher normalized assistance response rates than various baselines (i.e., random,
location-based, time-based, and request-based allocation).
How can clinicians appraise machine learning solutions?
Clinicians in EDs must be able to critically appraise ML research and commercially
available models that are integrated into EMR solutions. For example, Epic has integrated
a model to predict sepsis into their EMR that uses predictors such as demographics,
vital signs, and lab values. Although this model has become widely used in the United
States, a recent study found that it had poor sensitivity, poor positive predictive
value, and created considerable alarm fatigue [4]. Critically appraising ML solutions
builds upon the well-established principles of evidence-based medicine. The performance
of ML models is measured in terms of discrimination (e.g., C-statistic/area under
the curve (AUC), sensitivity, specificity, positive and negative predictive value)
as well as calibration (the agreement between observed and predicted risk, e.g., calibration
curves) which are familiar to ED physicians. ML papers may refer to positive predictive
value as “precision” and sensitivity as “recall". The relative importance of these
metrics may vary based on the task at hand. For example, a model for detecting patients
at risk of adverse events due to heart failure may prioritize high sensitivity to
ensure that the fewest possible number of patients are missed and high negative predictive
value to ensure that the wrong patients are not targeted. Just like with traditional
risk prediction models, clinicians must ask if ML models are generalizable to their
patients and if using them is likely to improve patient outcomes.
Four areas warrant special attention. First, while ML algorithms are powerful because
they can learn non-linear relationships in data, they can “overfit” data used during
development and form spurious correlations. This is especially true with models developed
using datasets that are small or from one/few sites. For example, a model trained
to detect pneumonia on chest x-rays may learn a site’s prevalence of pneumonia from
recognizing site-specific markers placed by x-ray technicians during image acquisition
rather than learning to identify areas of opacification [5].
Second, systemic and structural inequities in health systems can permeate the data
used to develop ML models and ultimately contribute to bias [6, 7]. For example, Indigenous
patients in Canada often face stereotyping and systemic racism in EDs and have been
found to have lower odds of receiving high-acuity triage scores across a range of
presentations [6]. Model bias frequently manifests as gaps in subgroup performance
based on demographics such as age, sex, and race. While traditional clinical risk
prediction models may also be biased (e.g., race correction in eGFR calculations)
it may be much more difficult to detect and correct biases in ML models.
Third, exogenous events (e.g., the emergence of new SARS-CoV-2 variants) that are
not captured during ML model development may compromise the relationship a model learns
between input and output data. This phenomenon, known as dataset shift, can reduce
ML model performance in sudden and significant ways.
Finally, ML’s ability to learn non-linear relationships among a wide range of predictors
often contributes to high empirical performance but the inability to discern how and
why certain predictions are made. Researchers have thus developed both ML models that
are inherently “interpretable” as well as methods to create post hoc “explanations”
of this “black box” model behaviour. While interpretability and explainability are
areas of active debate, explainability methods may be more appropriate during the
model development stage to ensure models are generally behaving appropriately as opposed
to explaining individual predictions at the time of inference. For example, Singh
et al. use Shapley values to confirm that their model to automate the ordering of
diagnostic tests at triage for paediatric patients was leveraging predictors in intuitive
and defensible ways (e.g., the model associated concern for appendicitis with the
need to order an abdominal ultrasound) [8]. While explainability and interpretability
can facilitate building trust in ML models, clinicians should not value them over
robust, empirical evidence of clinical benefit or harm from prospective, external
validation studies and randomized trials.
What is needed to translate machine learning solutions into clinical practice?
Prior to the widespread adoption of electronic medical records (EMRs), traditional
clinical prediction models were commonly developed using data abstracted from paper
charts or collected prospectively. The time-consuming nature of this process often
resulted in models containing a handful of predictors that were causally associated
with an outcome of interest. Resulting models were then implemented as nomograms or
points-based risk scores where clinicians could quickly determine a given patient’s
likelihood of having a condition (diagnostic models) or experiencing an outcome of
interest (prognostic models, e.g., MEWS for clinical deterioration, Canadian Syncope
Risk Score).
In contrast, contemporary ML models often requires much more data than traditional
clinical prediction models. Hospitals and health systems have been developing warehouses
for their electronic medical records, demographic, laboratory, and imaging data. If
data are cleaned and organized in a reliable and recurring manner, these warehouses
can enable the development of ML models across a wide range of situations. Strong
collaboration with information technology departments as well as data governance and
data engineering teams are important institutional prerequisites for translating ML
into practice.
Once a model is developed, it should be evaluated in a “silent trial” where the model
is generating predictions but these predictions are not acted upon by end users [9].
Silent trials can ensure that performance is robust in a cohort of patients separate
from those used to develop the model. The existence of performance deficits at this
stage may indicate the need to revisit model development with more data, more predictors,
or a different modelling approach altogether.
As ML models can involve large numbers of predictors that are combined in non-linear
ways, it is infeasible for end-users to manually compute their outputs and integrate
them into clinical practice. Thus, the successful translation of ML models also requires
attention to user interfaces and experiences as models are embedded into existing
tools and workflows. For example, Verma et al. notified physicians of which patients
were at high risk of deterioration using the hospital’s existing electronic sign out
tool and text paging alerts [9].
Healthcare workflows involve a wide range of professions such as emergency physicians
and nurses, radiologists, medical imaging technicians, social workers, and more. Interdisciplinary
collaboration between all stakeholders who are impacted by a machine learning model
throughout the stages of problem identification, solution development, and evaluation
and deployment are vital to maximize trust in the developed model and the likelihood
of success. Ultimately, the translation of ML models into practice is a highly iterative
process, with many models requiring revisiting data curation, model evaluation, or
interface design throughout their life cycle. Quality improvement approaches such
as “plan-do-study-act” cycles and statistical process control charting are increasingly
being explored to translate ML models into practice [9]. A summary of recommendations
for clinicians to consider prior to implementing ML models into practice is presented
in Table 1.
Table 1
Summary of key considerations for clinicians considering adopting an ML model into
practice at their institution
Consideration
Recommendation
1. Is the model needed for the outcome of interest?
Is there a perceived need among frontline clinicians to use a prediction model in
this context?
Is the problem sufficiently common, challenging, and does it have enough practice
variation to warrant the use of an ML model?a
2. Was the model robustly developed?
What predictors were used to develop the model? Are they clinically plausible or sensible?b
Are these predictors difficult to capture? (e.g., family history of sudden death)
Are the reported performance metrics appropriate? Keep in mind the choice of performance
metric is task specific and requires clinical insight
Has the model demonstrated potential to improve meaningful clinical outcomes? (e.g.,
morbidity, mortality, length-of-stay)
Checklists such as TRIPOD-AI, STARD-AI, DECIDE-AI, and SPIRIT/CONSORT-AI can help
prospective end-users appraise studies of ML models
3. Is there sufficient technical capacity available to support potential deployment?
Consult existing organizational information technology and data governance groups
regarding feasibility and how the model would be integrated within existing information
technology systems (e.g., the electronic medical record, cloud computing systems,
etc.)
If the model is commercially available, determine what supports the vendor can provide
through the process
4. Is model performance robust in the target patient population?
Conduct a silent trial at the site of potential deployment whereby the model is generating
predictions, but these predictions are not being used to impact patient carec
The length of the silent trial is dependent on the prevalence of the condition of
interest (e.g., rarer diseases need more time) and duration needed for outcomes to
accrue (e.g., 30-day mortality)
Use results from silent trial to determine if the model needs to be retrained or if
predictors need to be updatedc
Define a cut-off for determining poor model performance, this is highly task specific
and may involve measuring the baseline performance of clinicians in that taskc
5. Does the model appear to be trustworthy (i.e., in terms of bias, interpretability/explainability)?
Conduct a bias assessment in key population subgroups (e.g., by age, sex, race if
that data are available, socioeconomic status, etc.)c
Audit the predictors used in the model to determine which are most important, are
they clinically plausible or sensible?b,c Are there any signs that may suggest overfitting?
Audit model false positive and negatives, would clinicians make similar errors on
these examples?c
6. Have all relevant stakeholders considered how the model will impact clinical workflows?
Create an interdisciplinary working group to explore how to navigate concerns and
potential barriers
Ensure that team members have expertise in overlooked areas, such as human factors
and user interface/experience design
7. Is there a means of continually monitoring model performance once it has been deployed?
Keep in mind that dataset shifts can cause deterioration in model performance once
a model is deployed
Create automated alerts and a protocol to revert to baseline operating procedures
if model performance drops below a clinically significant thresholdc
Note that following some of these recommendations may require collaborating with local
information technology departments, those with relevant scholarly expertise (e.g.,
computer science, epidemiology, quality improvement, implementation science, human
factors, etc.) or the vendors of ML solutions
aML algorithms are often well-suited for problems with high signal-to-noise ratios
(i.e., those where the data are images, text, or waveforms), very large cohort sizes,
or strong temporal patterns. If input data are primarily tabular and represent one
time point (e.g., predictors are from one ED visit), traditional, regression-based
methods can yield comparable performance with less complexity (Christodoulou et al.
J Clin Epidemiol 2019)
bThere is considerable debate on whether or not predictors need to be clinically plausible
in ML models, as a pure “data mining” or “big data” approach would include a wide
range of predictors without a priori selection based on causal criteria. However,
it is certainly still possible to overfit ML models and pruning for correlated or
otherwise non-influential features can improve the likelihood that a given model will
generalize in performance to other sites (Sanchez et al. Royal Society Open Science
2022). Moreover, using post-hoc explainability methods (e.g., Shapley values) to affirm
that the most important predictors are used sensibly can be an important robustness
check (Singh et al. JAMA Network Open 2022, Ghassemi et al. Lancet Digital Health
2021)
cRequires collaboration with groups with data access, local information technology
departments, those with scholarly expertise, or the ML solution vendor
How can clinicians prepare for the growing impact of machine learning?
All clinicians in emergency medicine must have the skills to critically appraise ML
solutions, while a subset of clinicians will also be interested in gaining the skills
to champion translation efforts. Educational offerings at different stages of training
can foster these competencies. Clinical informatics, including familiarity with ML,
has been added as a learning objective for the Medical Council of Canada Qualifying
Exam Part 1 as of 2022. As such, while undergraduate medical programs should provide
all learners with a foundational understanding of key concepts, curriculum in postgraduate
medical education can be more focused towards their discipline and ultimate scope
of practice. Ehrmaan et al. identified four key curricular objectives: foundational
ML concepts from development to deployment, ethical and legal considerations, proper
usage of EMR and biomedical data, and the critical appraisal of ML systems [10]. These
objectives can be delivered by leveraging existing educational strategies such as
seminars during academic half day (e.g., interdisciplinary panel with ethicists and
informaticians), journal club (e.g., critical appraisal), and even simulation if sites
have models in deployment.
In summary, emergency medicine has a long track record of success in developing and
translating clinical prediction models into practice. As machine learning offers some
unique considerations and challenges, fostering collaborations between clinicians,
informaticians, computer scientists, quality improvement leaders, and knowledge translation
specialists is vital to responsibly leverage machine learning for improving outcomes
in acutely ill patients and the health systems that care for them.