datasets

physionet
- mimic-iv
nih datasets
mdcalc datasets
pecarn
openneuro
clinicaltrials.gov - has thousands of active trials with long plain text description
fairhealth - (paid) custom datasets that can include elements such as patients’ age and gender distribution, ICD-9 and ICD-10 procedure codes, geographic locations, professionals’ specialties and more
- other claims data is available but not clean
prospero - website for registering systematic reviews / meta-analyses

nlp

n2c2 tasks
- MedNLI - NLI task grounded in patient history (romanov & shivade, 2018)
  - derived from Mimic, but expertly annotated
- i2b2 named entity recognition tasks
  - i2b2 2006, 2010, 2012, 2014
CASI dataset - collection of abbreviations and acronyms (short forms) with their possible senses (long forms), along with other corresponding information about these terms
- some extra annotations by agrawal…sontag, 2022
PMC-Patients - open-source patient snippets, but no groundtruth labels besides age, gender
EBM-NLP - annotates PICO (Participants, Interventions, Comparisons and Outcomes) spans in clinical trial abstracts
- task - identify the spans that describe the respective PICO elements
review paper on clinical IE (wang…liu, 2017)
mimic-iv-benchmark (xie…liu, 2022)
- 3 tabular datasets derived from MIMIC-IV ED EHR
  - hospitalization (versus discharged) - met with an inpatient care site admission immediately following an ED visit
  - critical - inpatient portality / transfer to an ICU within 12 hours
  - reattendance - patient’s return visit to ED within 72 hours
- preprocessing for outliers / missing values (extended descriptions of variables here)
  - patient history
    - past ed visits, hospitalizations, icu admissions, comorbidities
    - ICD codes give patients comorbidities (CCI charlson comorbitidy index, ECI elixhauser comorbidity index)
  - info at triage
    - temp., heart rate, pain scale, ESI, …
    - Emergency severity index (ESI) - 5-level triage system assigned by nurse based on clinical judgments (1 is highest priority)
    - top 10 chief complaints
    - No neurological features (e.g. GCS)
  - info before discharge
    - vitalsigns
    - edstays
    - medication prescription

CDI bias

Race/sex overviews
- Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms (vyas, eisenstein, & jones, 2020)
  - Now is the Time for a Postracial Medicine: Biomedical Research, the National Institutes of Health, and the Perpetuation of Scientific Racism (2017)
- A Systematic Review of Barriers and Facilitators to Minority Research Participation Among African Americans, Latinos, Asian Americans, and Pacific Islanders (george, duran, & norris, 2014)
- The Use of Racial Categories in Precision Medicine Research (callier, 2019)
- Field Synopsis of Sex in Clinical Prediction Models for Cardiovascular Disease (paulus…kent, 2016) - supports the use of sex in predicting CVD, but not all CDIs use it
- Race Corrections in Clinical Models: Examining Family History and Cancer Risk (zink, obermeyer, & pierson, 2023) - family history variables mean different things for different groups depending on how much healthcare history their family had
ML papers
- When Personalization Harms Performance: Reconsidering the Use of Group Attributes in Prediction (suriyakumar, ghassemi, & ustun, 2023) - group attributes to improve performance at a population level but often hurt at a group level
- Coarse race data conceals disparities in clinical risk score performance (movva…pierson, 2023)
CDI guidelines
- Reporting and Methods in Clinical Prediction Research: A Systematic Review (Bouwmeester…moons, 2012) - review publications in 2008, mostly about algorithmic methodology
- Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement (collins…moons, 2015)
- Framework for the impact analysis and implementation of Clinical Prediction Rules (CPRs) (IDAPP group, 2011) - stress validating old rules
- Predictability and stability testing to assess clinical decision instrument performance for children after blunt torso trauma (kornblith…yu, 2022) - stress the use of stability, application to IAI
- Methodological standards for the development and evaluation of clinical prediction rules: a review of the literature (cowley…kemp, 2019)
- Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities (paulus & kent, 2020)
- Translating Clinical Research into Clinical Practice: Impact of Using Prediction Rules To Make Decisions (reilly & evans, 2006)
Individual CDIs
- Reconsidering the Consequences of Using Race to Estimate Kidney Function. (eneanya, yang, & reese, 2019)
- Dissecting racial bias in an algorithm used to manage the health of populations (obermeyer et al. 2019) - for one algorithm, at a given risk score, Black patients are considerably sicker than White patients, as evidenced by signs of uncontrolled illnesses
- Race, Genetic Ancestry, and Estimating Kidney Function in CKD (CRIC, 2021)
- Prediction of vaginal birth after cesarean delivery in term gestations: a calculator without race and ethnicity (grobman et al. 2021)
LLM bias
- Coding Inequity: Assessing GPT-4’s Potential for Perpetuating Racial and Gender Biases in Healthcare (zack…butte, alsentzer, 2023)
biased outcomes
- On the Inequity of Predicting A While Hoping for B (mullainathan & obermeyer, 2021)
  - Algorithm was specifically trained to predict health-care costs
    - Because of structural biases and differential treatment, Black patients with similar needs to white patients have long been known to have lower costs
  - real goal was to “determine which individuals are in need of specialized intervention programs and which intervention programs are likely to have an impact on the quality of individuals’ health.”

ucsf de-id data

black-box
- predict postoperative delirium prediction (bishara, …, donovan, 2022)
intrepretable
- predict multiple sceloris by incorporating domain knowledge into biomedical knowledge graph (nelson, …, baranzini, 2022)
- predict mayo endoscopic subscores from colonoscopy reports (silverman, …, 2022)
3 types
- disease and patient categorization (e.g. classification)
- fundamental biological study
- treatment of patients
philosophy
- want to focus on problems doctors can’t do
- alternatively, focus on automating problems parents can do to screen people at home in cost-effective way
pathology - branch of medicine where you take some tissue from a patient (e.g. tumor), look at it under a microscope, and make an assesment of what the disease is
websites are often easier than apps for patients
The clinical artificial intelligence department: a prerequisite for success (cosgriff et al. 2020) - we need designated departments for clinical ai so we don’t have to rely on 3rd-party vendors and can test for things like distr. shift
challenges in ai healthcare (news)
- adversarial examples
- things can’t be de-identified
- algorithms / data can be biased
- correlation / causation get confused
healthcare is 20% of US GDP
prognosis is a guess as to the outcome of treatment
diagnosis is actually identifying the problem and giving it a name, such as depression or obsessive-compulsive disorder
AI is a technology, but it’s not a product
health economics incentives align with health incentives: catching tumor early is cheaper for hospitals

high-level

focus on building something you want to deploy
- clinically useful - more efficient, cutting costs?
- effective - does it improve the current baseline
- focused on patient care - what are the unintended consequences
need to think a lot about regulation
- USA: FDA
- Europe: CE (more convoluted)
intended use
- very specific and well-defined

medical system

evaluation

doctors are evaluated infrequently (and things like personal traits are often included)
US has pretty good care but it is expensive per patient
expensive things (e.g. Da Vinci robot)
even if ml is not perfect, it may still outperform some doctors
The impact of inconsistent human annotations on AI driven clinical decision making (sylolypavan…sim, 2023) - labels / majority vote are often very inconsistent

medical education

rarely textbooks (often just slides)
1-2% miss rate for diagnosis can be seen as acceptable
how doctors think
- 2 years: memorizing facts about physiology, pharmacology, and pathology
- 2 years learning practical applications for this knowledge, such as how to decipher an EKG and how to determine the appropriate dose of insulin for a diabetic
- little emphasis on metal logic for making a correct diagnosis and avoiding mistakes
- see work by pat croskerry
- there is limited data on misdiagnosis rates
- representativeness error - thinking is overly influenced by what is typically true
- availability error - tendency to judge the likelihood of an event by the ease with which relevant examples come to mind
  - common infections tend to occur in epidemics, afflicting large numbers of people in a single community at the same time
  - confirmation bias
- affective error - decisions based on what we wish were true (e.g. caring too much about patient)
- See one, do one, teach one - teaching axiom

political elements

why doctors should organize
big pharma
day-to-day
- Doctors now face a burnout epidemic: thirty-five per cent of them show signs of high depersonalization
- according to one recent report, only thirteen per cent of a physician’s day, on average, is spent on doctor-patient interaction
- study during an average, eleven-hour workday, six hours are spent at the keyboard, maintaining electronic health records.
- medicare’s r.v.u - changes how doctors are reimbursed, emphasising procedural over cognitive things
- ai could help - make simple diagnoses faster, reduce paperwork, help patients manage their own diseases like diabetes
- ai could also make things worse - hospitals are mostly run by business people

medical communication

“how do doctors think?”

easy to misinterpret things to be causal
often no intuition for even relatively simple engineered features, such as averages
doctors require context for features (e.g. this feature is larger than the average)
often have some rules memorized (otherwise memorize what needs to be looked up)
- unclear how well doctors follow rules
- some rules are 1-way (e.g. only follow it if it says there is danger, otherwise use your best judgement)
  - 2-way rules are better
  - without proper education 1-way rules can be dangerously used as 2-way rules
  - doesn’t make sense to judge 1-way rules on both sepcificity and sensitivity
rules are often ambiguous (e.g. what constitutes vomiting)
doctors adapt to personal experience - may be unfair to evaluate them on larger dataset
sometimes said that doctors know 10 medications by heart
Overconfidence in Clinical Decision Making (croskerry 2008)
- most uncertainty: family medicine [FM] and emergency medicine [EM]
- some uncertainty: internal medicine
- little uncertainty: specialty disciplines
- 2 systems at work: intuitive (uses context, heuristics) vs analytic (systematic, rule-based)
  - a combination of both performs best
- doctors are often black boxes as well - validated infrequently, unclear how closely they follow rules
- doctors adapt to local conditions - should be evaluated only on local dataset
potential liabilities for physicians using ai (price et al. 2019)
What’s the trouble. How doctors think. New Yorker. 2007
JAMA Users’ Guide to the Medical Literature
TRIPOD 22 points paper
basic stats in the step1 exam
How to Read Articles That Use Machine Learning: Users’ Guides to the Medical Literature (liu et al. 2019
Carmelli et al. 2018 - primer for CDRs but also a good example of what sort of article I have envisioned creating.
Looking through the retrospectoscope: reducing bias in emergency medicine chart review studies. (kaji et al. 2018)

communicating findings

don’t use ROC curves, use deciles
need to evaluate use, not just metric
internal/external validity = training/testing error
model -> fitted model
retrospective (more confounding, looks back) vs prospective study
internal/external validity = train/test (although external was usually using different patient population, so is stronger)
specificity/sensitivity = precision/recall

examples

succesful examples of ai in medicine

ECG (NEJM, 1991)
EKG has a small interpretation on it
there used to be bayesian networks / expert systems but they went away…

icu interpretability example

goal: explain the model not the patient (that is the doctor’s job)
want to know interactions between features
some features are difficult to understand
- e.g. max over this window, might seem high to a doctor unless they think about it
some features don’t really make sense to change (e.g. was this thing measured)
doctors like to see trends - patient health changes over time and must include history
feature importance under intervention

high-performance ai studies

chest-xray: chexnet
echocardiograms: madani, ali, et al. 2018
skin: esteva, andre, et al. 2017
pathology: campanella, gabriele, et al.. 2019
mammogram: kerlikowske, karla, et al. 2018

medical imaging

Medical Imaging and Machine Learning
- medical images often have multiple channels / are 3d - closer to video than images

improving medical studies

Machine learning methods for developing precision treatment rules with observational data (Kessler et al. 2019)
- goal: find precision treatment rules
- problem: need large sample sizes but can’t obtain them in RCTs
- recommendations
  - screen important predictors using large observational medical records rather than RCTs
    - important to do matching / weighting to account for bias in treatment assignments
    - alternatively, can look for natural experiment / instrumental variable / discontinuity analysis
    - has many benefits
  - modeling: should use ensemble methods rather than individual models