Chandan Singh | causal inference

causal inference

view markdown

Some notes on causal inference both from introductory courses following neyman-rubin framework (+ the textbook “What if?”) + based on Judea Pearl’s ladder of causality (+ “The book of why?”). Also includes notes from this chapter of the fairml book.



  • confounding = difference between groups other than the treatment which affects the response
    • this is the key problem when using observational (non-experimental) data to make causal inferences
    • problem occurs because we don’t get to see counterfactuals
    • ex from Pearl:Screen Shot 2019-04-07 at 7.01.55 PM
  • randomized control trial (RCT) - controls for any possible confounders


  • epiphenomenon - a correlated effect (not a cause)
    • a secondary effect or byproduct that arises from but does not causally influence a process
  • propensity score - probability that a subject recieving a treatment is valid after conditioning on appropriate covariates
  • case-control study - retrospective - compares “cases” (people with a disease) to controls
  • sensitivity analysis - instead of drawing conclusions by assuming the absence of certain causal relationships, challenge such assumptions and evaluate how strong altervnative relationships must be in order to explain the observed data
  • regression-based adjustment - if we know the confounders, can just regress on the confounders and the treatment and the coefficient for the treatment (the partial regression coefficient) will give us the average causal effect)
    • works only for linear models
  • instrumental variables - variable which can be used to effectively due a RCT because it was made random by some external factor
    • ex. army draft, john snow’s cholera study


  • bradford hill criteria - some simple criteria for establishing causality (e.g. strength, consistency, specificity)
    • association is circumstantial evidence for causation
  • no causation without manipulation (Holland, 1986)
    • in this manner, something like causal effect of race/gender doesn’t make sense
    • can partially get around this by changing race $\to$ perceived race
    • weaker view (e.g. of Pearl) is that we only need to be able to understand how entities interact (e.g. write an SEM)
  • different levels
    • levels of experiment: experiment, RCT, natural experiment, observation
    • levels of evidence: marginal correlation, regression, invariance, causal
    • levels of inference (pearl’s ladder of causality): prediction/association, intervention, counterfactuals
      • kosuke imai’s levels of inference: descriptive, predictive, causal

common examples

  • HIP trial of mammography - want to do whole treatment group v. whole control group
  • John Snow on cholera - water
  • causes of poverty - Yul’s model, changes with lots of things
  • liver transplant
    • maximize benefit (life with - life without)
    • currently just goes to person who would die quickest without
  • monty hall problem: why you should switch
    graph LR
    A(Your Door) -->B(Door Opened)
    C(Location of Car) --> B
  • berkson’s paradox - diseases in hospitals are correlated even when they are not in the general population
    • possible explanation - only having both diseases together is strong enough to put you in the hospital
  • simpson’s paradox - trend appears in several different groups but disappears/reverses when groups are combined
    • e.g. overall men seemed to have higher acceptance rates, but in each dept. women seemed to have higher acceptance rates - explanation is that women selectively apply to harder depts.


potential outcome framework (neyman-rubin)

  • advantages over DAGs: easy to express some common assumptions, such as monotonicity / convexity
  • 3 frameworks
    1. neyman-rubin model: $Y_i = T_i a_i + (1-T_i) b_i$
      • $\widehat{ATE} = \hat{a}_A - \hat{b}_B$
      • $\widehat{ATE}_{adj} = [\bar{a}_A - (\bar{x}_A - \bar{x})^T \hat{\theta}_A] - [\bar{b}_B - (\bar{x}_B - \bar{x})^T \hat{\theta}_B]$
      • $\hat{\theta}A = argmin \sum{i \in A} (a_i - \bar{a}_A - (x_i - \bar{x}_A)^T \theta)^2$
    2. neyman-pearson
      • null + alternative hypothesis
      • null is favored unless there is strong evidence to refute it
    3. fisherian testing framework
      • small p-values evidence against null hypothesis
      • null hypothesis
  • 3 principles of experimental design: replication, randomization, conditioning
  • action = intervention, exposure, treatments
  • action $A$ and outcome $Y$
  • Screen Shot 2020-05-05 at 10.50.28 AM
  • potential outcomes = counterfactual outcomes $Y^{a=1}, Y^{a=0}$
  • average treatment effect ATE: $E[Y^{a=1} - Y^{a=0}]$
  • key assumptions: SUTVA, consistency, ignorability

DAGs (pearl et al.)

  • advantages over potential outcomes
    • easy to make clear exactly what is independent, particularly when there are many variables
      • however, often very difficult to come up with proper causal graph
    • do-calculus allows for answering some specific questions easily
  • blog post on causal ladder
  • intro to do-calculus post and subsequent posts

causal ladder (different levels of inference)

  1. prediction/association - just need to have the joint distr. of all the variables
    • basically just $p(y x)$
  2. intervention - we can change things and get conditionals based on evidence after intervention
    • $p(y do(x))$ - which represents the conditional distr. we would get if we were to manipulate $x$ in a randomized trial
    • to get this, we assume the causal structure (can still kind of test it based on conditional distrs.)
    • having assumed the structure, we delete all edges going into a do operator and set the value of $x$
    • then, do-calculus yields a formula to estimate $p(y do(x))$ assuming this causal structure
    • see introductory paper here, more detailed paper here (pearl 2013) - by assuming structure, we learn how large impacts are
  3. counterfactuals - we can change things and get conditionals based on evidence before intervention
    • instead of intervention $p(y do(x))$ we get $p(y^* x^, z=z)$ where z represents fixing all the other variables and $y^$ and $x^*$ are not observed
    • averaging over all data points, we’d expect to get something similar to the intervention $p(y|do(x))$ - probabalistic answer to a “what would have happened if” question
    • e.g. “Given that Hillary lost and didn’t visit Michigan, would she win if she had visited Michigan?”
    • e.g. “What fraction of patients who are treated and died would have survived if they were not treated?”
    • this allows for our intervention to contradict something we condition on
    • simple matching is often not sufficient (need a very good model for how to match, hopefully a causal one) - key difference with standard intervention is that we incorporate available evidence into our calculation
    • available evidence influences exogenous variables
    • this is for a specific data point, not a randomly sampled data point like an intervention would be
    • requires SEM, not just causal graph

sem (structured equation model)

  • gives a set of variables $X_1, … X_i$ and and assignments of the form $X_i := f_i(X_{parents(i)}, \epsilon_i)$, which tell how to compute value of each node given parents

    • $\epsilon_i$ = noise variables = exogenous nodes - node in the network that represents all the data not collected
    • parent nodes = direct causes
  • again, fix value of $x$ (and values of $\epsilon$ seend in the data) and use SEM to set all downstream variables
  • ex. sem

    • in this ex, W and H are usually correlated, so conditional distrs. are similar, but do operator of changing W has no effect on H (and vice versa)
    • notation: $P(H do(W:=1))$ or $P_{M[W:=1]}(h)$
    • ATE of $W$ on $H$ would be $P(H do(W:=1)) - P(H do(W:=0))$

causal graphs

  • common graphs
    • absence of edges often corresponds to qualitative judgements | forks | mediators | colliders | | ———————————————————— | ———————————————————— | ———————————————————— | | Screen Shot 2020-10-21 at 11.57.13 PM | Screen Shot 2020-10-21 at 11.57.17 PM | Screen Shot 2020-10-21 at 11.57.22 PM | | confounder $z$, can be adjusted for | confounder can vary causal effect | conditioning on confounder z can explain away a cause |
  • controlling for a variable (when we have a causal graph):
  • $P(Y=y do(X:=x)) = \sum_z \underbrace{P(Y=y X=x, X_{parents}=z)}{\text{effect for slice}} \underbrace{P(X{parents}=z)}_{\text{weight for slice}}$
  • counterfactual - given structural causal model M, observed event E, action X:=x, target variable Y, define counterfactual $Y_{X:=x}(E)$ in 3 steps:
    • abduction - adjust noise variables to be consistent with observation
    • action - perform do-intervention
    • prediction - compute target counterfactual
    • counterfactual can be a random variable or deterministic
  • back-door criterion - establishes if 2 variables X, Y are confounded
    • more details:
    • ensure that there is no path which points to X which allows dependence between X and Y ( paths which point to X are non-causal, representing confounders )
    • remember, in DAG junctions conditioning makes things independent unless its at a V junction
  • front-door criterion - want to deconfound treatment from outcome, even without info on the confounder
    • only really need to know about treatment, M, and outcome
graph LR
C(Confounder) -->Y(Outcome)
C --> X(Treatment)
X --> M
M --> Y
  • mediation analysis - identify a mechanism through which a cause has an effect
    • if there are multiple possible paths by which a variable can exert influence, can figure out which path does what, even with just observational data


  • stable unit treatment value assumption (SUTVA) - treatment one unit receives dosn’t change effect of action for any other unit
    • exchangeability = exogeneity: $\color{orange}{Y^{a}} \perp !!! \perp A$ for all $a$ - $\textcolor{orange}{\text{the value of the counterfactuals}}$ doesn’t change based on the choice of the action
  • consistency: $Y=Y^{a=0}(1-A) + Y^{a=1}A$ - outcome agrees with the potential outcome corresponding to the treatment indicator
  • ignorability - potential outcomes are conditionally independent of treatment given some deconfounding varibles
    • very hard to check!
  • background
    • very hard to decide what to include and what is irrelevant
    • ontology - study of being, concepts, categories
      • nodes in graphs must refer to stable concepts
      • ontologies are not always stable
        • world changes over time
        • “looping effect” - social categories (like race) are constantly chainging because people who putatively fall into such categories to change their behavior in possibly unexpected ways
    • epistemology - theory of knowledge

modeling approaches

ladder of evidence

  • RCT
  • natural experiment
  • instrumental variable
  • discontinuity analysis - look for points near a threshold treatment assignment


  • Matching methods for causal inference: A review and a look forward (stuart 2010)
    • matching methods choose try to to equate (or ``balance’’) the distribution of covariates in the treated and control groups
      • they do this by picking well-matched samples of the original treated and control groups
      • this may involve 1:1 matching, weighting, or subclassification
      • linear regression adjustment (so noto matching) can actually increase bias in the estimated treatment effect when the true relationship between the covariate and outcome is even moderately non-linear, especially when there are large differences in the means and variances of the covariates in the treated and control groups
    • matching distance measures
      • propensity scores summarize all of the covariates into one scalar: the probability of being treated
        • defined as the probability of being treated given the observed covariates
        • propensity scores are balancing scores: At each value of the propensity score, the distribution of the covariates X defining the propensity score is the same in the treated and control groups – usually this is logistic regresion
        • if treatment assignment is ignorable given the covariates, then treatment assignment is also ignorable given the propensity score: propensity
      • hard constraints are called “exact matching” - can be combined with other methods
      • mahalanabois distance
    • matching methods
      • stratification = cross-tabulation - only compare samples when confounding variables have same value
      • nearest neighbor matching - we discard many samples this way (but samples are more similar, so still helpful)
        • optimal matching - consider all potential matches at once, rather than one at a time
        • ratio matching - could match many to one (especially for a rare group), although picking the number of matches can be tricky
        • with/without replacement - with seems to have less bias, but more practical issues
      • subclassification/weighting: use all the data - this is nice because we have more samples, but we also get some really poor matches
        • subclassification - stratify score, like propensity score, into groups and measure effects among the groups
        • full matching - automatically picks the number of groups
        • weighting - use propensity score as weight in calculating ATE (also know as inverse probability of treatment weighting)
      • common support - want to look at points which are similar, and need to be careful with how we treat points that violate similarity
      • genetic matching - find the set of matches which minimize the discrepancy between the distribution of potential confounders
    • diagnosing matches - are covariates balanced after matching?
      • ideally we would look at all multi-dimensional histograms, but since we have few points we end up looking at 1-d summaries
      • one standard metric is difference in means of each covariate, divided by its stddev in the whole dataset
    • analysis of the outcome - can still use regression adjustment after doing the matching to clean up residual covariances
      • unclear how to propagate variance from matching to outcome analysis
  • Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching Package for R (sekhon 2011)

weighting / regression adjustments

  • regression adjustments use models like a linear model to account for confounders
  • requires unconfoundedness = omitted variable bias
  • if there are no confounders, correlation is causation


classic studies

stability / invariance

  • Invariance, Causality and Robustness (buhlmann 18)
    • predict $Y^e$ given $X^e$ such that the prediction “works well” or is “robust” for all $e ∈ \mathcal F$ based on data from much fewer environments $e \in \mathcal E$
      • assumption: ideally $e$ changes only the distr. of $X^e$ ( so doesn’t act directly on $Y^e$ or change the mechanism between $X^e$ and $Y^e$)
      • assumption (invariance): there exists a subset of “causal” covariates - when conditioning on these covariates, the loss is the same across all environments $e$
      • when these assumptions are satisfied, then minimizing a worst-case risk over environments $e$ yields a causal parameter
    • identifiability issue: we typically can’t identify the causal variables without very many perturbations $e$
      • Invariant Causal Prediction (ICP) only identifies variables as causal if they appear in all invariant sets
    • anchor regression model helps to relax assumptions
  • Invariant Risk Minimization (arjovsky, bottou, gulrajani, & lopez-paz 2019)
    • random splitting causes problems with our data
    • what to perform well under different distributions of X, Y
    • can’t be solved via robust optimization
    • a correlation is spurious when we do not expect it to hold in the future in the same manner as it held in the past
      • i.e. spurious correlations are unstable
    • assume we have infinite data, and know what kinds of changes our distribution for the problem might have (e.g. variance of features might change)
      • make a model which has the minimum test error regardless of the distribution of the problem
    • adds a penalty inspired by invariance (which can be viewed as a stability criterion)
  • The Hierarchy of Stable Distributions and Operators to Trade Off Stability and Performance (subbaswamy, chen, & saria 2019)
    • different predictors learn different things
    • only pick the stable parts of what they learn (in a graph representation)
    • there is a tradeoff between stability to all shifts and average performance on the shifts we expect to see
    • different types of methods
      • transfer learning - given unlabelled test data, match training/testing representations
      • proactive methods - make assumptions about possible set of target distrs.
      • data-driven methods - assume independence of cause and mechanism, like ICP, and use data from different shifts to find invariant subsets
      • explicit graph methods - assume explicit knowledge of graph representing the data-generating process
    • hierarchy
      • level 1 - invariant conditional distrs. of the form $P(Y \mathbf Z)$
      • level 2 - conditional interventional distrs. of the form $P(Y do(\mathbf W), \mathbf Z)$
      • level 3 - distributions corresponding to counterfactuals



  • Incremental causal effects (rothenhausler & yu, 2019)
    • instead of considering a treatment, consider an infinitesimal change in a continuous treatment
    • use assumption of local independence and can prove some nice things
      • local ignorability assumption states that potential outcomes are independent of the current treatment assignment in a neighborhood of observations
  • link to iclr talk (bottou 2019)
  • The Blessings of Multiple Causes (wang & blei, 2019) - having multiple causes can help construct / find all the confounders

different problems

heterogenous treatment effects

Heterogenous treatment effects refer to effects which differ for different subgroups / individuals in a population and requires more refined modeling.

  • conditional average treatment effect (CATE) - get treatment effect for each individual conditioned on its covariates
    • meta-learners - break down CATE into regression subproblems
      • e.g. T-learner (foster et al. 2011, simplest) - fit one model for conditional expectation of each potential outcome and then subtract
      • e.g. X-learner (kunzel et al. 19)
      • e.g. R-learner (nie-wager, 20)
      • e.g. S-learner (hill 11)
    • tree-based methods
      • e.g. causal tree (athey-imbens, 16) - like decision tree, but change splitting criterion for differentiating 2 outcomes
      • e.g. causal forest (wager-athery, 18)
      • e.g. BART (hill, 12)
  • subgroup analysis - identify subgroups with treatment effects far from the average
    • generally easier than CATE
  • staDISC (dwivedi, tan et al. 2020) - learn stable / interpretable subgroups for causal inference
    • CATE - estimate with a bunch of different models
      • meta-learners: T/X/R/S-learners
      • tree-based methods: causal tree/forest, BART
      • calibration to evaluate subgroup CATEs
        • main difficulty: hard to do model selection / validation (especially with imbalanced data)
          • often use some kind of proxy loss function
        • solution: compare average CATE within a bin to CATE on test data in bin
          • actual CATE doesn’t seem to generalize
          • but ordering of groups seems pretty preserved
        • stability: check stability of this with many CATE estimators
    • subgroup analysis
      • use CATE as a stepping stone to finding subgroups
      • easier, but still linked to real downstream tasks (e.g. identify which subgroup to treat)
      • main difficulty: can quickly overfit
      • cell-search - sequential
        • first prune features using feature importance
        • target: maximize a cell’s true positive - false positive (subject to using as few features as possible)
        • sequentially find cell which maximizes target
          • find all cells which perform close to as good as this cell
          • remove all cells contained in another cell
          • pick one randomly, remove all points in this cell, then continue
      • stability: rerun search multiple times and look for stable cells / stable cell coverage

causal discovery

  • overview
    • goal of causal discovery is to identify the causal relationships (sometimes under some smoothness / independence assumptions)
      • basics: conditional indep. checks can only determine graphs up to markov equivalence
    • 2 approaches
      • test noise distr. of relationships in different directions
      • check variables which reduce entropy the most
  • Discovering Causal Signals in Images (lopez-paz et al. 2017)
    • C(A, B) - count number of images in which B would disappear if A was removed
    • we say A causes B when C(A, B) is (sufficiently) greater than the converse C(B, A)
    • basics
      • given joint distr. of (A, B), we want to know if A -> B, B-> A
        • with no assumptions, this is nonidentifiable
      • requires 2 assumptions
        • ICM: independence between cause and mechanism (i.e. the function doesn’t change based on distr. of X) - this usually gets violated in anticausal direction
        • causal sufficiency - we aren’t missing any vars
      • ex. Screen Shot 2019-05-20 at 10.04.03 PM
        • here noise is indep. from x (causal direction), but can’t be independent from y (non-causal direction)
        • in (c), function changes based on input
      • can turn this into binary classification and learn w/ network: given X, Y, does X->Y or Y-X?
    • on images, they get scores for different objects (w/ bounding boxes)
      • eval - when one thing is erased, does the other also get erased?
  • Visual Causal Feature Learning (chalupka, perona, & eberhardt, 2015)
    • assume the behavior $T$ is a function of some hidden causes $H_i$ and the image
      • Screen Shot 2020-02-03 at 2.27.27 PM
    • Causal Coarsening Theorem - causal partition is coarser version of the observational partition
      • observational partition - divide images into partition where each partition has constant prediction $P(T I)$
      • causal partition - divide images into partition where each partition has constant $P(T man(I))$
        • $man(I)$ does visual manipulation which changes $I$, while keeping all $H_i$ fixed and $T$ fixed
          • ex. turn a digit into a 7 (or turn a 7 into not a 7)
    • can further simplify the problem into $P(T I) = P(T C, S)$
      • $C$ are the causes and $S$ are the spurious correlates
      • any other variable $X$ such that $P(T I) = P(T X)$ has Shannon entropy $H(X) \geq H(C, S)$ - these are the simplest descriptions of $P(T I$)
    • causal effect prediction
      • first, create causal dataset of $P(T man(I))$ and train, so the model can’t learn spurious correlations
      • then train on this - very similar to adversarial training
  • Visual Physics: Discovering Physical Laws from Videos
    • 3 steps
      • Mask R-CNN finds bounding box of object and center of bounding box is taken to be location
      • $\beta-VAE$ compresses the trajectory to some latent repr. (while also being able to predict held-out points of the trajectory)
      • Eureqa package does eq. discovery on latent repr + trajectory
        • includes all basic operations, such as addition, mult., sine function
        • R-squared value measures goodness of fit
    • see also SciNet - Discovering physical concepts with neural networks (iten et al. 2020)
    • see also the field of symbolic regression
      • genetic programming is the most pervalent method here
      • alternatives: sparse regression, dimensional function synthesis