ml in medicine view markdown

Some rough notes on ml in medicine


  • physionet
    • mimic-iv
  • nih datasets
  • mdcalc datasets
  • pecarn
  • openneuro
  • - has thousands of active trials with long plain text description
  • fairhealth - (paid) custom datasets that can include elements such as patients’ age and gender distribution, ICD-9 and ICD-10 procedure codes, geographic locations, professionals’ specialties and more
    • other claims data is available but not clean
  • prospero - website for registering systematic reviews / meta-analyses


  • n2c2 tasks
    • MedNLI - NLI task grounded in patient history (romanov & shivade, 2018)
      • derived from Mimic, but expertly annotated
    • i2b2 named entity recognition tasks
      • i2b2 2006, 2010, 2012, 2014
  • CASI dataset - collection of abbreviations and acronyms (short forms) with their possible senses (long forms), along with other corresponding information about these terms
  • PMC-Patients - open-source patient snippets, but no groundtruth labels besides age, gender
  • EBM-NLP - annotates PICO (Participants, Interventions, Comparisons and Outcomes) spans in clinical trial abstracts
    • task - identify the spans that describe the respective PICO elements
  • review paper on clinical IE (wang…liu, 2017)
  • mimic-iv-benchmark (xie…liu, 2022)
    • 3 tabular datasets derived from MIMIC-IV ED EHR
      • hospitalization (versus discharged) - met with an inpatient care site admission immediately following an ED visit
      • critical - inpatient portality / transfer to an ICU within 12 hours
      • reattendance - patient’s return visit to ED within 72 hours
    • preprocessing for outliers / missing values (extended descriptions of variables here)
      • patient history
        • past ed visits, hospitalizations, icu admissions, comorbidities
        • ICD codes give patients comorbidities (CCI charlson comorbitidy index, ECI elixhauser comorbidity index)
      • info at triage
        • temp., heart rate, pain scale, ESI, …
        • Emergency severity index (ESI) - 5-level triage system assigned by nurse based on clinical judgments (1 is highest priority)
        • top 10 chief complaints
        • No neurological features (e.g. GCS)
      • info before discharge
        • vitalsigns
        • edstays
        • medication prescription

CDI bias

  • Race/sex overviews
    • Hidden in Plain Sight — Reconsidering the Use of Race Correction in Clinical Algorithms (vyas, eisenstein, & jones, 2020)
      • Now is the Time for a Postracial Medicine: Biomedical Research, the National Institutes of Health, and the Perpetuation of Scientific Racism (2017)
    • A Systematic Review of Barriers and Facilitators to Minority Research Participation Among African Americans, Latinos, Asian Americans, and Pacific Islanders (george, duran, & norris, 2014)
    • The Use of Racial Categories in Precision Medicine Research (callier, 2019)
    • Field Synopsis of Sex in Clinical Prediction Models for Cardiovascular Disease (paulus…kent, 2016) - supports the use of sex in predicting CVD, but not all CDIs use it
    • Race Corrections in Clinical Models: Examining Family History and Cancer Risk (zink, obermeyer, & pierson, 2023) - family history variables mean different things for different groups depending on how much healthcare history their family had
  • ML papers
    • When Personalization Harms Performance: Reconsidering the Use of Group Attributes in Prediction (suriyakumar, ghassemi, & ustun, 2023) - group attributes to improve performance at a population level but often hurt at a group level
    • Coarse race data conceals disparities in clinical risk score performance (movva…pierson, 2023)
  • CDI guidelines
    • Reporting and Methods in Clinical Prediction Research: A Systematic Review (Bouwmeester…moons, 2012) - review publications in 2008, mostly about algorithmic methodology
    • Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement (collins…moons, 2015)
    • Framework for the impact analysis and implementation of Clinical Prediction Rules (CPRs) (IDAPP group, 2011) - stress validating old rules
    • Predictability and stability testing to assess clinical decision instrument performance for children after blunt torso trauma (kornblith…yu, 2022) - stress the use of stability, application to IAI
    • Methodological standards for the development and evaluation of clinical prediction rules: a review of the literature (cowley…kemp, 2019)
    • Predictably unequal: understanding and addressing concerns that algorithmic clinical prediction may increase health disparities (paulus & kent, 2020)
    • Translating Clinical Research into Clinical Practice: Impact of Using Prediction Rules To Make Decisions (reilly & evans, 2006)
  • Individual CDIs
    • Reconsidering the Consequences of Using Race to Estimate Kidney Function. (eneanya, yang, & reese, 2019)
    • Dissecting racial bias in an algorithm used to manage the health of populations (obermeyer et al. 2019) - for one algorithm, at a given risk score, Black patients are considerably sicker than White patients, as evidenced by signs of uncontrolled illnesses
    • Race, Genetic Ancestry, and Estimating Kidney Function in CKD (CRIC, 2021)
    • Prediction of vaginal birth after cesarean delivery in term gestations: a calculator without race and ethnicity (grobman et al. 2021)
  • LLM bias
  • biased outcomes
    • On the Inequity of Predicting A While Hoping for B (mullainathan & obermeyer, 2021)
      • Algorithm was specifically trained to predict health-care costs
        • Because of structural biases and differential treatment, Black patients with similar needs to white patients have long been known to have lower costs
      • real goal was to “determine which individuals are in need of specialized intervention programs and which intervention programs are likely to have an impact on the quality of individuals’ health.”

ucsf de-id data

  • black-box
  • intrepretable
  • 3 types
    • disease and patient categorization (e.g. classification)
    • fundamental biological study
    • treatment of patients
  • philosophy
    • want to focus on problems doctors can’t do
    • alternatively, focus on automating problems parents can do to screen people at home in cost-effective way
  • pathology - branch of medicine where you take some tissue from a patient (e.g. tumor), look at it under a microscope, and make an assesment of what the disease is
  • websites are often easier than apps for patients
  • The clinical artificial intelligence department: a prerequisite for success (cosgriff et al. 2020) - we need designated departments for clinical ai so we don’t have to rely on 3rd-party vendors and can test for things like distr. shift
  • challenges in ai healthcare (news)
    • adversarial examples
    • things can’t be de-identified
    • algorithms / data can be biased
    • correlation / causation get confused
  • healthcare is 20% of US GDP
  • prognosis is a guess as to the outcome of treatment
  • diagnosis is actually identifying the problem and giving it a name, such as depression or obsessive-compulsive disorder
  • AI is a technology, but it’s not a product
  • health economics incentives align with health incentives: catching tumor early is cheaper for hospitals


  • focus on building something you want to deploy
    • clinically useful - more efficient, cutting costs?
    • effective - does it improve the current baseline
    • focused on patient care - what are the unintended consequences
  • need to think a lot about regulation
    • USA: FDA
    • Europe: CE (more convoluted)
  • intended use
    • very specific and well-defined

medical system


  • doctors are evaluated infrequently (and things like personal traits are often included)
  • US has pretty good care but it is expensive per patient
  • expensive things (e.g. Da Vinci robot)
  • even if ml is not perfect, it may still outperform some doctors
  • The impact of inconsistent human annotations on AI driven clinical decision making (sylolypavan…sim, 2023) - labels / majority vote are often very inconsistent

medical education

  • rarely textbooks (often just slides)
  • 1-2% miss rate for diagnosis can be seen as acceptable
  • how doctors think
    • 2 years: memorizing facts about physiology, pharmacology, and pathology
    • 2 years learning practical applications for this knowledge, such as how to decipher an EKG and how to determine the appropriate dose of insulin for a diabetic
    • little emphasis on metal logic for making a correct diagnosis and avoiding mistakes
    • see work by pat croskerry
    • there is limited data on misdiagnosis rates
    • representativeness error - thinking is overly influenced by what is typically true
    • availability error - tendency to judge the likelihood of an event by the ease with which relevant examples come to mind
      • common infections tend to occur in epidemics, afflicting large numbers of people in a single community at the same time
      • confirmation bias
    • affective error - decisions based on what we wish were true (e.g. caring too much about patient)
    • See one, do one, teach one - teaching axiom

political elements

  • why doctors should organize
  • big pharma
  • day-to-day
    • Doctors now face a burnout epidemic: thirty-five per cent of them show signs of high depersonalization
    • according to one recent report, only thirteen per cent of a physician’s day, on average, is spent on doctor-patient interaction
    • study during an average, eleven-hour workday, six hours are spent at the keyboard, maintaining electronic health records.
    • medicare’s r.v.u - changes how doctors are reimbursed, emphasising procedural over cognitive things
    • ai could help - make simple diagnoses faster, reduce paperwork, help patients manage their own diseases like diabetes
    • ai could also make things worse - hospitals are mostly run by business people

medical communication

“how do doctors think?”

communicating findings

  • don’t use ROC curves, use deciles
  • need to evaluate use, not just metric
  • internal/external validity = training/testing error
  • model -> fitted model
  • retrospective (more confounding, looks back) vs prospective study
  • internal/external validity = train/test (although external was usually using different patient population, so is stronger)
  • specificity/sensitivity = precision/recall


succesful examples of ai in medicine

  • ECG (NEJM, 1991)
  • EKG has a small interpretation on it
  • there used to be bayesian networks / expert systems but they went away…

icu interpretability example

  • goal: explain the model not the patient (that is the doctor’s job)
  • want to know interactions between features
  • some features are difficult to understand
    • e.g. max over this window, might seem high to a doctor unless they think about it
  • some features don’t really make sense to change (e.g. was this thing measured)
  • doctors like to see trends - patient health changes over time and must include history
  • feature importance under intervention

high-performance ai studies

  • chest-xray: chexnet
  • echocardiograms: madani, ali, et al. 2018
  • skin: esteva, andre, et al. 2017
  • pathology: campanella, gabriele, et al.. 2019
  • mammogram: kerlikowske, karla, et al. 2018

medical imaging

improving medical studies

  • Machine learning methods for developing precision treatment rules with observational data (Kessler et al. 2019)
    • goal: find precision treatment rules
    • problem: need large sample sizes but can’t obtain them in RCTs
    • recommendations
      • screen important predictors using large observational medical records rather than RCTs
        • important to do matching / weighting to account for bias in treatment assignments
        • alternatively, can look for natural experiment / instrumental variable / discontinuity analysis
        • has many benefits
      • modeling: should use ensemble methods rather than individual models