ml in medicine
Contents
some rough notes on ml in medicine
1.8. ml in medicine#
1.8.1. datasets#
-
mimic-iv
clinicaltrials.gov - has thousands of active trials with long plain text description
fairhealth - (paid) custom datasets that can include elements such as patients’ age and gender distribution, ICD-9 and ICD-10 procedure codes, geographic locations, professionals’ specialties and more
other claims data is available but not clean
prospero - website for registering systematic reviews / meta-analyses
1.8.1.1. nlp#
-
MedNLI - NLI task grounded in patient history (romanov & shivade, 2018)
derived from Mimic, but expertly annotated
i2b2 named entity recognition tasks
i2b2 2006, 2010, 2012, 2014
CASI dataset - collection of abbreviations and acronyms (short forms) with their possible senses (long forms), along with other corresponding information about these terms
some extra annotations by agrawal…sontag, 2022
PMC-Patients - open-source patient snippets, but no groundtruth labels besides age, gender
EBM-NLP - annotates PICO (Participants, Interventions, Comparisons and Outcomes) spans in clinical trial abstracts
task - identify the spans that describe the respective PICO elements
review paper on clinical IE (wang…liu, 2017)
mimic-iv-benchmark (xie…liu, 2022)
3 tabular datasets derived from MIMIC-IV ED EHR
hospitalization (versus discharged) - met with an inpatient care site admission immediately following an ED visit
critical - inpatient portality / transfer to an ICU within 12 hours
reattendance - patient’s return visit to ED within 72 hours
preprocessing for outliers / missing values (extended descriptions of variables here)
patient history
past ed visits, hospitalizations, icu admissions, comorbidities
ICD codes give patients comorbidities (CCI charlson comorbitidy index, ECI elixhauser comorbidity index)
info at triage
temp., heart rate, pain scale, ESI, …
Emergency severity index (ESI) - 5-level triage system assigned by nurse based on clinical judgments (1 is highest priority)
top 10 chief complaints
No neurological features (e.g. GCS)
info before discharge
vitalsigns
edstays
medication prescription
1.8.1.2. CDI bias#
Race/sex overviews
Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms (vyas, eisenstein, & jones, 2020)
Now is the Time for a Postracial Medicine: Biomedical Research, the National Institutes of Health, and the Perpetuation of Scientific Racism (2017)
A Systematic Review of Barriers and Facilitators to Minority Research Participation Among African Americans, Latinos, Asian Americans, and Pacific Islanders (george, duran, & norris, 2014)
The Use of Racial Categories in Precision Medicine Research (callier, 2019)
Field Synopsis of Sex in Clinical Prediction Models for Cardiovascular Disease (paulus…kent, 2016) - supports the use of sex in predicting CVD, but not all CDIs use it
Race Corrections in Clinical Models: Examining Family History and Cancer Risk (zink, obermeyer, & pierson, 2023) - family history variables mean different things for different groups depending on how much healthcare history their family had
ML papers
When Personalization Harms Performance: Reconsidering the Use of Group Attributes in Prediction (suriyakumar, ghassemi, & ustun, 2023) - group attributes to improve performance at a population level but often hurt at a group level
Coarse race data conceals disparities in clinical risk score performance (movva…pierson, 2023)
CDI guidelines
Reporting and Methods in Clinical Prediction Research: A Systematic Review (Bouwmeester…moons, 2012) - review publications in 2008, mostly about algorithmic methodology
Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement (collins…moons, 2015)
Framework for the impact analysis and implementation of Clinical Prediction Rules (CPRs) (IDAPP group, 2011) - stress validating old rules
Predictability and stability testing to assess clinical decision instrument performance for children after blunt torso trauma (kornblith…yu, 2022) - stress the use of stability, application to IAI
Methodological standards for the development and evaluation of clinical prediction rules: a review of the literature (cowley…kemp, 2019)
Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities (paulus & kent, 2020)
Translating Clinical Research into Clinical Practice: Impact of Using Prediction Rules To Make Decisions (reilly & evans, 2006)
Individual CDIs
Reconsidering the Consequences of Using Race to Estimate Kidney Function. (eneanya, yang, & reese, 2019)
Dissecting racial bias in an algorithm used to manage the health of populations (obermeyer et al. 2019) - for one algorithm, at a given risk score, Black patients are considerably sicker than White patients, as evidenced by signs of uncontrolled illnesses
Race, Genetic Ancestry, and Estimating Kidney Function in CKD (CRIC, 2021)
Prediction of vaginal birth after cesarean delivery in term gestations: a calculator without race and ethnicity (grobman et al. 2021)
LLM bias
Coding Inequity: Assessing GPT-4’s Potential for Perpetuating Racial and Gender Biases in Healthcare (zack…butte, alsentzer, 2023)
biased outcomes
On the Inequity of Predicting A While Hoping for B (mullainathan & obermeyer, 2021)
Algorithm was specifically trained to predict health-care costs
Because of structural biases and differential treatment, Black patients with similar needs to white patients have long been known to have lower costs
real goal was to “determine which individuals are in need of specialized intervention programs and which intervention programs are likely to have an impact on the quality of individuals’ health.”
1.8.1.3. ucsf de-id data#
black-box
predict postoperative delirium prediction (bishara, …, donovan, 2022)
intrepretable
predict multiple sceloris by incorporating domain knowledge into biomedical knowledge graph (nelson, …, baranzini, 2022)
predict mayo endoscopic subscores from colonoscopy reports (silverman, …, 2022)
3 types
disease and patient categorization (e.g. classification)
fundamental biological study
treatment of patients
philosophy
want to focus on problems doctors can’t do
alternatively, focus on automating problems parents can do to screen people at home in cost-effective way
pathology - branch of medicine where you take some tissue from a patient (e.g. tumor), look at it under a microscope, and make an assesment of what the disease is
websites are often easier than apps for patients
The clinical artificial intelligence department: a prerequisite for success (cosgriff et al. 2020) - we need designated departments for clinical ai so we don’t have to rely on 3rd-party vendors and can test for things like distr. shift
challenges in ai healthcare (news)
adversarial examples
things can’t be de-identified
algorithms / data can be biased
correlation / causation get confused
healthcare is 20% of US GDP
prognosis is a guess as to the outcome of treatment
diagnosis is actually identifying the problem and giving it a name, such as depression or obsessive-compulsive disorder
AI is a technology, but it’s not a product
health economics incentives align with health incentives: catching tumor early is cheaper for hospitals
1.8.1.4. high-level#
focus on building something you want to deploy
clinically useful - more efficient, cutting costs?
effective - does it improve the current baseline
focused on patient care - what are the unintended consequences
need to think a lot about regulation
USA: FDA
Europe: CE (more convoluted)
intended use
very specific and well-defined
1.8.2. medical system#
1.8.2.1. evaluation#
doctors are evaluated infrequently (and things like personal traits are often included)
US has pretty good care but it is expensive per patient
expensive things (e.g. Da Vinci robot)
even if ml is not perfect, it may still outperform some doctors
The impact of inconsistent human annotations on AI driven clinical decision making (sylolypavan…sim, 2023) - labels / majority vote are often very inconsistent
1.8.2.2. medical education#
rarely textbooks (often just slides)
1-2% miss rate for diagnosis can be seen as acceptable
-
2 years: memorizing facts about physiology, pharmacology, and pathology
2 years learning practical applications for this knowledge, such as how to decipher an EKG and how to determine the appropriate dose of insulin for a diabetic
little emphasis on metal logic for making a correct diagnosis and avoiding mistakes
see work by pat croskerry
there is limited data on misdiagnosis rates
representativeness error - thinking is overly influenced by what is typically true
availability error - tendency to judge the likelihood of an event by the ease with which relevant examples come to mind
common infections tend to occur in epidemics, afflicting large numbers of people in a single community at the same time
confirmation bias
affective error - decisions based on what we wish were true (e.g. caring too much about patient)
See one, do one, teach one - teaching axiom
1.8.2.3. political elements#
big pharma
day-to-day
Doctors now face a burnout epidemic: thirty-five per cent of them show signs of high depersonalization
according to one recent report, only thirteen per cent of a physician’s day, on average, is spent on doctor-patient interaction
study during an average, eleven-hour workday, six hours are spent at the keyboard, maintaining electronic health records.
medicare’s r.v.u - changes how doctors are reimbursed, emphasising procedural over cognitive things
ai could help - make simple diagnoses faster, reduce paperwork, help patients manage their own diseases like diabetes
ai could also make things worse - hospitals are mostly run by business people
1.8.3. medical communication#
1.8.3.1. “how do doctors think?”#
easy to misinterpret things to be causal
often no intuition for even relatively simple engineered features, such as averages
doctors require context for features (e.g. this feature is larger than the average)
often have some rules memorized (otherwise memorize what needs to be looked up)
unclear how well doctors follow rules
some rules are 1-way (e.g. only follow it if it says there is danger, otherwise use your best judgement)
2-way rules are better
without proper education 1-way rules can be dangerously used as 2-way rules
doesn’t make sense to judge 1-way rules on both sepcificity and sensitivity
rules are often ambiguous (e.g. what constitutes vomiting)
doctors adapt to personal experience - may be unfair to evaluate them on larger dataset
sometimes said that doctors know 10 medications by heart
Overconfidence in Clinical Decision Making (croskerry 2008)
most uncertainty: family medicine [FM] and emergency medicine [EM]
some uncertainty: internal medicine
little uncertainty: specialty disciplines
2 systems at work: intuitive (uses context, heuristics) vs analytic (systematic, rule-based)
a combination of both performs best
doctors are often black boxes as well - validated infrequently, unclear how closely they follow rules
doctors adapt to local conditions - should be evaluated only on local dataset
potential liabilities for physicians using ai (price et al. 2019)
What’s the trouble. How doctors think. New Yorker. 2007
[TRIPOD 22 points paper](https://www.tripod-statement.org/Portals/0/Tripod Checklist Prediction Model Development and Validation PDF.pdf)
How to Read Articles That Use Machine Learning: Users’ Guides to the Medical Literature (liu et al. 2019
Carmelli et al. 2018 - primer for CDRs but also a good example of what sort of article I have envisioned creating.
Looking through the retrospectoscope: reducing bias in emergency medicine chart review studies. (kaji et al. 2018)
1.8.3.2. communicating findings#
internal/external validity = training/testing error
model -> fitted model
retrospective (more confounding, looks back) vs prospective study
internal/external validity = train/test (although external was usually using different patient population, so is stronger)
specificity/sensitivity = precision/recall
1.8.4. examples#
1.8.4.1. succesful examples of ai in medicine#
ECG (NEJM, 1991)
EKG has a small interpretation on it
there used to be bayesian networks / expert systems but they went away…
1.8.4.2. icu interpretability example#
goal: explain the model not the patient (that is the doctor’s job)
want to know interactions between features
some features are difficult to understand
e.g. max over this window, might seem high to a doctor unless they think about it
some features don’t really make sense to change (e.g. was this thing measured)
doctors like to see trends - patient health changes over time and must include history
feature importance under intervention
1.8.4.3. high-performance ai studies#
chest-xray: chexnet
echocardiograms: madani, ali, et al. 2018
skin: esteva, andre, et al. 2017
pathology: campanella, gabriele, et al.. 2019
mammogram: kerlikowske, karla, et al. 2018
1.8.5. medical imaging#
Medical Imaging and Machine Learning
medical images often have multiple channels / are 3d - closer to video than images
1.8.6. improving medical studies#
Machine learning methods for developing precision treatment rules with observational data (Kessler et al. 2019)
goal: find precision treatment rules
problem: need large sample sizes but can’t obtain them in RCTs
recommendations
screen important predictors using large observational medical records rather than RCTs
important to do matching / weighting to account for bias in treatment assignments
alternatively, can look for natural experiment / instrumental variable / discontinuity analysis
has many benefits
modeling: should use ensemble methods rather than individual models