causal inference

view markdown


classic studies

natural experiments

instrumental variables

  • Identification of Causal Effects Using Instrumental Variables (angrist, imbens, & rubin 1996)
    • bridges the literature of instrumental variables in econometrics and the literature of causal inference in statistics
    • applied paper with delicate statistics
    • carefully discuss the assumptions
    • instrumental variables - regression w/ constant treatment effects
    • effect of veteran status on mortality, using lottery number as instrument

matching

“paradoxes”

  • simpson’s paradox = yule-simpson paradox - trend appears in several different groups but disappears/reverses when groups are combined
    • Sex Bias in Graduate Admissions: Data from Berkeley (bickel et al. 1975)
    • e.g. overall men seemed to have higher acceptance rates, but in each dept. women seemed to have higher acceptance rates - explanation is that women selectively apply to harder depts.
      graph LR
      A(Gender) -->B(Dept Choice)
      B --> C(Acceptance rate)
      A --> C
      
  • monty hall problem: why you should switch
    graph LR
    A(Your Door) -->B(Door Opened)
    C(Location of Car) --> B
    
  • berkson’s paradox - diseases in hospitals are correlated even when they are not in the general population
    • possible explanation - only having both diseases together is strong enough to put you in the hospital

problems beyond ATE

causal mechanisms

  • treatment effect variation?
  • principal stratification
  • interference

mediation analysis

Mediation analysis aims to identify a mechanism through which a cause has an effect. Direct effects measure when the treatment varies as mediators are held constant.

  • if there are multiple possible paths by which a variable can exert influence, can figure out which path does what, even with just observational data

  • cannot just condition on $M$! This can lead to spurious associations

  • which pathway do causes flow through from X to Y (direct/indirect?)

  • mediation_graph

  • consider potential outcomes with hypothetical intervention on $T$:

    • ${M(t), Y(t)}$
  • hypothetical intervention on $T$ and $M$:

    • ${Y(t, m)}$
  • hypothetical intervention on $T$ fixing $M$ to $M(t’) = M_{t’}$ (nested potential outcome, robs & greenland, 1992; pearl, 2001)
    • ${Y(t, M_{t’})}$
    • has also been called a priori counterfactual (frangakis & rubin, 2002)
    • when $t \neq t’$, this can’t be observed and can’t be falsified
  • total effect $\tau=E{Y(1)-Y(0)} = \textrm{NDE + NIE}$

    • assumes composition assumption $Y(1, M_1) = Y(1)$, very reasonable
  • natural direct effect $\mathrm{NDE}=E\left{Y\left(1, M_{0}\right)-Y\left(0, M_{0}\right)\right}$

    • controlled direct effect $\mathrm{CDE}=E\left{Y\left(1, m\right)-Y\left(0, m\right)\right}$ is simpler: sets mediator to some assumed value $m$ rather than the actual value seen in the data $M_0$
    • w/ composition: $=E\left{Y\left(1, M_{0}\right)-Y\left(0\right)\right}$
  • natural indirect effect $\mathrm{NIE}=E\left{Y\left(1, M_1\right)-Y\left(1, M_{0}\right)\right}$

  • w/ composition: $=E\left{Y\left( 1 \right)-Y\left(1, M_0\right)\right}$

  • mediation formula

    • can condition effects on $x$

      • $\operatorname{NDE}(x)=E\left{Y\left(1, M_{0}\right)-Y\left(0, M_{0}\right) \mid X=x\right}$
      • $\operatorname{NIE}(x)=E\left{Y\left(1, M_{1}\right)-Y\left(1, M_{0}\right) \mid X=x\right}$
    • estimators

      • $\widehat{NDE}(x) = E\left{Y\left(t, M_{t^{\prime}}\right) \mid X=x\right}=\sum_{m} E(Y \mid T=t, M=m, X=x) \operatorname{pr}\left(M=m \mid T=t^{\prime}, X=x\right)$
      • $\widehat{NIE}(x) = E\left{Y\left(t, M_{t^{\prime}}\right)\right}=\sum_{x} E\left{Y\left(t, M_{t^{\prime}}\right) \mid X=x\right} P(X=x)$
    • estimators depend on 4 assumptions

      1. no treatment-outcome confounding: $T \perp Y(t, m) \mid X$

      2. no mediator-outcome confounding: $M \perp Y(t, m) \mid (X, T)$

      3. assumption 3: no treatment-mediator confounding: $T \perp M(t) \mid X$

      4. no cross-world independence between potential outcomes and potential mdediators: $Y(t, m) \perp M(z’) \; \forall \; t, t’, m$

    • assumption notes
      • 1 + 2 are equivalent to $T, M) \perp Y(t, m) \mid X$
      • first three essentially assume that $T$ and $M$ are both randomized
      • 1-3 are very strong but hold with squentially randomized treatment + mediator
      • 4 cannot be verified
    • baron-kenny method (assumes linear models): baron_kenny

heterogenous treatment effects

Heterogenous treatment effects refer to effects which differ for different subgroups / individuals in a population and requires more refined modeling.

  • conditional average treatment effect (CATE) - get treatment effect for each individual conditioned on its covariates $\mathbb E [y x, t=1] - \mathbb E[y x, t=0]$ (different from ITE $Y^{T=1}_i - Y^{T=0}_i$)
    • meta-learners - break down CATE into regression subproblems
      • e.g. S-learner (hill 11) - “S” stands for “single” and fits a single statistical model for $\mu_1 - \mu_0$
        • can be biased towards 0
      • e.g. T-learner (foster et al. 2011) - “T” stands for “two” because we fit 2 models: one model for conditional expectation of each potential outcome: $\hat \mu_1(x), \hat \mu_0(x)$
        • can have issues, e.g. different effects are regularized differently
        • doesn’t do well with variation in the propensity score. If $e(x)$ varies considerably, then our estimates of $\hat \mu(0)$ will be driven by data in areas with many control units (i.e., with $e(x)$ closer to 0), and those of $\hat \mu (1)$ by regions with more treated units (i.e., with e(x) closer to 1).
      • e.g. X-learner (kunzel et al. 19) - “X” stands for crossing between estimates and conditional outcomes for each group
        • first, fit $\hat \mu_1(x), \hat \mu_0(x)$
        • second, compute effects using all the data $\begin{aligned} \hat{\tau}{1, i} &=Y{i}(1)-\hat{\mu}{0}\left(x{i}\right) \ \hat{\tau}{0, i} &=\hat{\mu}{1}\left(x_{i}\right)-Y_{i}(0) \end{aligned}$
        • finally, combine effects $\hat{\tau}(x)=g(x) \hat{\tau}{0}(x)+(1-g(x)) \hat{\tau}{1}(x)$
          • $g(x)$ is weighting function, e.g. estimated propensity score
      • e.g. R-learner (robinson, 1988; nie-wager, 20) - regularized semiparametric learner
        • $\hat{\tau}{R}(\cdot)=\operatorname{argmin}{\hat \tau}\left{\frac{1}{n} \sum_{i=1}^{n}\left(\underbrace{Y_{i}-\hat \mu\left(X_{i}\right)}{\text{Y residual}}-\left(T{i}-\hat e\left(X_{i}\right)\right) \hat \tau\left(X_{i}\right)\right)^{2}\right. \left.+ \underbrace{\Lambda_{n}\left(\hat \tau(\cdot)\right)}_{\text{regularization}}\right}$
          • $\hat \mu(x) = E[Y_i X=x]$
          • use cross-fitting to estimate $\hat \tau$ and $\hat \mu$
          • $\tau$ takes a form, e.g. LASSO
    • tree-based methods
      • e.g. causal tree (athey & imbens, 16) - like decision tree, but change splitting criterion for differentiating 2 outcomes + compute effects for each leaf on out-of-sample data
      • e.g. causal forest (wager & athey, 18) - extends causal tree to forest
      • e.g. BART (hill, 12) - takes treatment as an extra input feature
    • neural-net based methods
  • validation
    • can cross-validate CATE on R-loss (sampling variability is high, but may not always be an issue (wager, 2020))
    • indirect approach - use CATE to identify subgroups, and then use out-of-sample data to evaluate these subgroups
    • fit $\hat \tau$ then rerun semiparametric model and see if coefficient for $\hat \tau$ ends up close to 1
    • more discussion in (athey & wager, 2019) and (chernozhukov et al. 2017)
  • subgroup analysis - identify subgroups with treatment effects far from the average
    • generally easier than CATE
  • staDISC (dwivedi, tan et al. 2020) - learn stable / interpretable subgroups for causal inference
    • CATE - estimate with a bunch of different models
      • meta-learners: T/X/R/S-learners
      • tree-based methods: causal tree/forest, BART
      • calibration to evaluate subgroup CATEs
        • main difficulty: hard to do model selection / validation (especially with imbalanced data)
          • often use some kind of proxy loss function
        • solution: compare average CATE within a bin to CATE on test data in bin
          • actual CATE doesn’t seem to generalize
          • but ordering of groups seems pretty preserved
        • stability: check stability of this with many CATE estimators
    • subgroup analysis
      • use CATE as a stepping stone to finding subgroups
      • easier, but still linked to real downstream tasks (e.g. identify which subgroup to treat)
      • main difficulty: can quickly overfit
      • cell-search - sequential
        • first prune features using feature importance
        • target: maximize a cell’s true positive - false positive (subject to using as few features as possible)
        • sequentially find cell which maximizes target
          • find all cells which perform close to as good as this cell
          • remove all cells contained in another cell
          • pick one randomly, remove all points in this cell, then continue
      • stability: rerun search multiple times and look for stable cells / stable cell coverage
  • Estimating individual treatment effect: generalization bounds and algorithms (shalit, johansson, & sontag, 2017)
    • bound the ITE estimation error using (1) generalization err of the repr. and (2) the distance between the treated and control distrs., e.g. MMD

reinforcement (policy) learning

  • rather than estimating a treatment effect, find a policy that maximizes some expected utility (e.g. can define utility as the potential outcome $\mathbb E[Y_i(\pi(X_i))]$)
  • in this case, policy is like an intervention

causal discovery

Causal discovery aims to identify causal relationships (sometimes under some smoothness / independence assumptions. This is often impossible in general. Also called causal relation learning, causal search.

  • a lot of our science does not actually rest on experiments (e.g. physics, geology)
  • constraint-based algorithms - use conditional indep. checks to determine graphs up to markov equivalence
    • faithfulness means the statistical dependence between variables estimated from the data does not violate the independence defined by any causal graph which generates the data
    • extensions to more general distrs., unobserved confounders
    • peter-clark (PC) algorithm - first learns undirected graph, then detects edge directions and returns equivalence class
      • assumes that there is no confounder (unobserved direct common cause of two measured variables)
    • fast causal inference (FCI) (spirtes et al. 200)
      • can deal with confounders - instead of edge/no edge have 3 possibilities: edge, no edge, confounding by unobserved missing common cause (+possibly another possibility for “unknown”)
  • score-based algorithms - replace conditional indep. tests with godness of fit tests (e.g. BIC)
    • assume there are no confounders
    • still can only determine graphs up to markov equivalence
    • optimizing goodness of fit is NP-hard, so often use heuristics such as greedy equivalence search (GES) (chickering, 2002)
  • functional causal models - assume a variable can be written as a function of its direct causes and some noise term
    • can distinguish between different DAGs in same equivalence class
    • e.g. (hyavarinen & zhang, 2016) assume additive noise and that $p(E C)$ can be modeled while $P(C E)$ cannot
    • e.g. LiNGAM (Shimizu et al., 2006), ICA-LINGAM - linear relations between different variables and noise
    • additive noise models ANM (Hoyer et al., 2009) relax the linear restriction
      • ANM-MM (Hu et al., 2018)
    • post-nonlinear models PNL (Zhang and Hyvarinen, 2009) expand the functional space with non-linear relations between the variables and the noise
  • many models assume the generating cause distribution $p(C)$ is in some sense “independent” to the mechanism $P(E C)$
    • e.g. IGCI (Janzing et al., 2012) uses orthogonality in information space to express the independence between the two distributions
    • e.g. KCDC (Mitrovic et al., 2018) uses invariance of Kolmogorov complexity of conditional distribution
    • e.g. RECI (Blobaum et al., 2018) extends IGCI to the setting with small noise, and proceeds by comparing the regression errors in both possible directions
  • check variables which reduce entropy the most
  • Learning and Testing Causal Models with Interventions (acharya et al. 2018)

    • given DAG, want to learn distribution on interventions with minimum number of interventions, variables intevened on, numper of samples draw per intervention
  • Discovering Causal Signals in Images (lopez-paz et al. 2017)
    • C(A, B) - count number of images in which B would disappear if A was removed
    • we say A causes B when C(A, B) is (sufficiently) greater than the converse C(B, A)
    • basics
      • given joint distr. of (A, B), we want to know if A -> B, B-> A
        • with no assumptions, this is nonidentifiable
      • requires 2 assumptions
        • ICM: independence between cause and mechanism (i.e. the function doesn’t change based on distr. of X) - this usually gets violated in anticausal direction
        • causal sufficiency - we aren’t missing any vars
      • ex. Screen Shot 2019-05-20 at 10.04.03 PM
        • here noise is indep. from x (causal direction), but can’t be independent from y (non-causal direction)
        • in (c), function changes based on input
      • can turn this into binary classification and learn w/ network: given X, Y, does X->Y or Y-X?
    • on images, they get scores for different objects (w/ bounding boxes)
      • eval - when one thing is erased, does the other also get erased?
    • link to iclr talk (bottou 2019)
  • Visual Causal Feature Learning (chalupka, perona, & eberhardt, 2015)
    • assume the behavior $T$ is a function of some hidden causes $H_i$ and the image
      • Screen Shot 2020-02-03 at 2.27.27 PM
    • Causal Coarsening Theorem - causal partition is coarser version of the observational partition
      • observational partition - divide images into partition where each partition has constant prediction $P(T I)$
      • causal partition - divide images into partition where each partition has constant $P(T man(I))$
        • $man(I)$ does visual manipulation which changes $I$, while keeping all $H_i$ fixed and $T$ fixed
          • ex. turn a digit into a 7 (or turn a 7 into not a 7)
    • can further simplify the problem into $P(T I) = P(T C, S)$
      • $C$ are the causes and $S$ are the spurious correlates
      • any other variable $X$ such that $P(T I) = P(T X)$ has Shannon entropy $H(X) \geq H(C, S)$ - these are the simplest descriptions of $P(T I$)
    • causal effect prediction
      • first, create causal dataset of $P(T man(I))$ and train, so the model can’t learn spurious correlations
      • then train on this - very similar to adversarial training
  • Visual Physics: Discovering Physical Laws from Videos
    • 3 steps
      • Mask R-CNN finds bounding box of object and center of bounding box is taken to be location
      • $\beta-VAE$ compresses the trajectory to some latent repr. (while also being able to predict held-out points of the trajectory)
      • Eureqa package does eq. discovery on latent repr + trajectory
        • includes all basic operations, such as addition, mult., sine function
        • R-squared value measures goodness of fit
    • see also SciNet - Discovering physical concepts with neural networks (iten et al. 2020)
    • see also the field of symbolic regression
      • genetic programming is the most pervalent method here
      • alternatives: sparse regression, dimensional function synthesis
  • Causal Mosaic: Cause-Effect Inference via Nonlinear ICA and Ensemble Method (wu & fukumizu, 2020)
    • focus on bivariate case

stable/invariant predictors

Under certain assumptions, invariance to data perturbations (i.e. interventions) can help us identify causal effects.

invariance hierarchies

  • The Hierarchy of Stable Distributions and Operators to Trade Off Stability and Performance (subbaswamy, chen, & saria 2019)
    • different predictors learn different things
    • only pick the stable parts of what they learn (in a graph representation)
    • there is a tradeoff between stability to all shifts and average performance on the shifts we expect to see
    • different types of methods
      • transfer learning - given unlabelled test data, match training/testing representations
      • proactive methods - make assumptions about possible set of target distrs.
      • data-driven methods - assume independence of cause and mechanism, like ICP, and use data from different shifts to find invariant subsets
      • explicit graph methods - assume explicit knowledge of graph representing the data-generating process
    • hierarchy
      • level 1 - invariant conditional distrs. of the form $P(Y \mathbf Z)$
      • level 2 - conditional interventional distrs. of the form $P(Y do(\mathbf W), \mathbf Z)$
      • level 3 - distributions corresponding to counterfactuals
  • Causality for Machine Learning (scholkopf 19)
    • most of ml is built on the iid assumption and fails when it is violated (e.g. cow on a beach)

invariance algorithms

  • algorithms overview (see papers for more details) + implementations
    • ICP - invariant causal prediction - find feature set where, after conditioning, loss is the same for all environments
      • fails when distr. of residuals varies across environments
    • IRM - invariant risk minimization (v1) - find a feature repr. such that the optimal classifier, on top of that repr., is the identity function for all environments
    • GroupDRO - distributionally robust optimization (e.g. encourage strict equality between err of each group)
    • ERM - empirical risk minimization - minimize total training err
    • domain-adversarial techniques: find a repr. which does not differ across environments, then predict
      • fails when distr. of causes changes across environments
    • AND-mask - minimize the err only in directions where the sign of the gradient of the loss is the same for most environments
    • IGA - inter-environmental gradient alignment - ERM + reduce variance of the gradient of the loss per environment: $\lambda \operatorname{trace}\left(\operatorname{Var}\left(\nabla_{\theta} L_{\mathcal{E}}(\theta)\right)\right)$
  • Invariance, Causality and Robustness - ICP (buhlmann 18)
    • predict $Y^e$ given $X^e$ such that the prediction “works well” or is “robust” for all $e ∈ \mathcal F$ based on data from much fewer environments $e \in \mathcal E$
      • key assumption (invariance): there exists a subset of “causal” covariates - when conditioning on these covariates, the loss is the same across all environments $e$
      • assumption: ideally $e$ changes only the distr. of $X^e$ (so doesn’t act directly on $Y^e$ or change the mechanism between $X^e$ and $Y^e$)
      • when these assumptions are satisfied, then minimizing a worst-case risk over environments $e$ yields a causal parameter
    • identifiability issue: we typically can’t identify the causal variables without very many perturbations $e$
      • Invariant Causal Prediction (ICP) only identifies variables as causal if they appear in all invariant sets (see also Peters, Buhlmann, & Meinshausen, 2015)
      • brute-force feature selection
    • anchor regression model helps to relax assumptions
  • Invariant Risk Minimization - IRM (arjovsky, bottou, gulrajani, & lopez-paz 2019)
    • idealized formulation: $\begin{array}{ll}\min {\Phi: \mathcal{X} \rightarrow \mathcal{H}} & \sum{e \in \mathcal{E}{\mathrm{tr}}} R^{e}(w \circ \Phi) \ \text { subject to } & w \in \underset{\bar{w}: \mathcal{H} \rightarrow \mathcal{Y}}{\arg \min } : R^{e}(\bar{w} \circ \Phi), \text { for all } e \in \mathcal{E}{\mathrm{tr}}\end{array}$
      • $\Phi$ is repr., $w \circ \Phi$ is predictor
    • practical formulation: $\min {\Phi: \mathcal{X} \rightarrow \mathcal{Y}} \sum{e \in \mathcal{E}{\mathrm{tr}}} R^{e}(\Phi)+\lambda \cdot\left|\nabla{w \mid w=1.0} R^{e}(w \cdot \Phi)\right|^{2}$
    • random splitting causes problems with our data
    • what to perform well under different distributions of X, Y
    • can’t be solved via robust optimization
    • a correlation is spurious when we do not expect it to hold in the future in the same manner as it held in the past
      • i.e. spurious correlations are unstable
    • assume we have infinite data, and know what kinds of changes our distribution for the problem might have (e.g. variance of features might change)
      • make a model which has the minimum test error regardless of the distribution of the problem
    • adds a penalty inspired by invariance (which can be viewed as a stability criterion)
  • Learning explanations that are hard to vary - AND-mask (parascandolo…sholkopf, 2020)
    • basically, gradients should be consistent during learning
      • after learning, they should be consistent within some epsilon ball
    • practical algorithm: AND-mask
      • like zeroing out those gradient components with respect to weights that have inconsistent signs across environments
        • basically same complexity as normal GD
      • previous works used cosine similarity between weights in different settings
      • experiments
        • real data is spiral but each env is linearly separable - still able to learn spiral
        • cifar - with real labels, performance unaffected; with random labels, training acc drops significantly
          • each example is its own environment
          • with noisy labels, imposes good regularization
        • rl - works well on coinrun
    • propose invariant learning consistency (ILC)- measures expected consistency of the soln found by an algorithm given a hypothesis class
      • consistency - what extent a minimum of the loss surface appears only when data from different envs are pooled
    • given algorithm $\mathcal A$, maximize this: $\mathrm{ILC}\left(\mathcal{A}, p_{\theta^{0}}\right):= \underbrace{-\mathbb{E}{\theta^{0} \sim p\left(\theta^{0}\right)}}{\text{expectation over reinits}}\left[\mathcal{I}^{\epsilon} (\underbrace{\mathcal{A}{\infty}(\theta^{0}, \mathcal{E}}{\hat \theta}) \right]$
      • inconsistency score for a solution $\hat \theta$ given environments $e$: $\mathcal{I}^{\epsilon}(\hat \theta):=\overbrace{\max {\left(e, e^{\prime}\right) \in \mathcal{E}^{2}} }^{\text{env. pairs}} \underbrace{\max _{\theta \in N{e, \hat \theta}^{\epsilon}}}{\text{low-loss region around $\hat \theta$}} \overbrace{\mid \mathcal{L}{e^{\prime}}(\theta)-\mathcal{L}_{e}(\theta) }^{\text{loss between envs.}}$
        • $N_{e, \hat \theta}^{\epsilon}$ is path-connected region around $\hat \theta$ where $\left{\theta \in \Theta\right.$ s.t. $\left \mathcal{L}{e}(\theta)-\mathcal{L}{e}\left(\hat \theta\right)\right \leqslant \epsilon$
  • Invariant Risk Minimization Games (ahuja et al. 2020) - pose IRM as finding the Nash equilibrium of an ensemble game among several environments

  • Invariant Rationalization (chang et al. 2020) - identify a small subset of input features – the rationale – that best explains or supports the prediction
    • key assumption: $Y \perp E Z$
    • $\max {\boldsymbol{m} \in \mathcal{S}} I(Y ; \boldsymbol{Z}) \quad$ s.t. $\boldsymbol{Z}= \overbrace{\boldsymbol{m}}^{\text{binary mask}} \odot \boldsymbol{X}, \quad \underbrace{Y \perp E \mid \boldsymbol{Z}}{\text{this part is invariance}}$
      • solve this via 3 nets with adv. penalty to approximate invariance
      • standard maximum mutual info objective is just $\max _{\boldsymbol{m} \in \mathcal{S}} I(Y ; \boldsymbol{Z}) \quad$ s.t. $\boldsymbol{Z}= \overbrace{\boldsymbol{m}}^{\text{binary mask}} \odot \boldsymbol{X}$ (see lei et al. 2016)
      • ex. $X$ is text reviews for beer, $Y$ is aroma, $E$ could be beer brand
  • Linear unit-tests for invariance discovery (aubin et al. 2020) - a set of 6 simple settings where current IRM procedures fail
    • test 1 (colormnist-style linear regr): $x_{inv} \to \tilde y \to x_{spu}$
      • $\tilde y \to y$
    • test 2 (cows vs camels binary classification): $y=mean(x_{inv})>0$, but $x_{inv} \propto x_{spu}$
    • test 3 (small invariant margin): $y_i\sim Bern(1/2)$, $x_{inv} = \pm 0.1+$noise, $x_{spu} = \pm \mu^e$ + noise, where $\mu^e ~ N(0, 1)$
    • scrambling: apply random rotation matrix to inputs

misc problems

  • Incremental causal effects (rothenhausler & yu, 2019)
    • instead of considering a treatment, consider an infinitesimal change in a continuous treatment
    • use assumption of local independence and can prove some nice things
      • local ignorability assumption states that potential outcomes are independent of the current treatment assignment in a neighborhood of observations
  • probability of necessity $PN(t, y) = P(Y^{T=t’}=y’ T=t, Y=y)$ = “probability of causation” (Robins & Greenland, 1989)
    • find the probability that $Y$ would be $y′$ had $T$ been $t’$, given that, in reality, $Y$ is actually $y$ and $T$ is $t$

    • If $Y$ is monotonic relative to $T$ $i.e ., Y^{T=1}(x) \geq Y^{T=0}(x),$ then $\mathrm{PN}$ is identifiable whenever the causal effect $P(y \mid d o(t))$ is identifiable and, moreover, \(\mathrm{PN}=\underbrace{\frac{P(y \mid t)-P\left(y \mid t^{\prime}\right)}{P(y \mid t)}}_{\text{excess risk ratio}}+\underbrace{\frac{P\left(y \mid t^{\prime}\right)-P\left(y \mid d o\left(t^{\prime}\right)\right)}{P(t, y)}}_{\text{confounding adjustment}}\)

  • causal transportability - seeks to identify conditions under which causal knowledge learned from experiments can be reused in different domains with observational data only

  • Radical Empiricism and Machine Learning Research (pearl 2021)

    • contrast the “data fitting” vs. “data interpreting” approaches to data-science
    • current research is too empiricist (data-fitting based) - should include man-made models of how data are generated
    • 3 reasons: expediency, transparency, explainability
      • expediency: we often have to ask fast (e.g. covid) and use our knowledge to guide future experiments
      • transparency: need repr.d with right level off abstraction
      • explainability: humans must understand inferences

transferring out-of-sample

different assumptions / experimental designs

  • unconfoundedness $T_i \perp (Y_i(0), Y_i(1)) X_i$ is strong
    • sometimes we may perfer $T_i \perp (Y_i(0), Y_i(1)) X_i, Z_i$ for some unobserved $Z_i$ that we need to figure out
  • The Blessings of Multiple Causes (wang & blei, 2019) - having multiple causes can help construct / find all the confounders
    • deconfounder algorithm
      • fit a factor model of causes
      • estimate a repr $Z_i$ of data point $A_i$ - $Z_i$ renders the causes conditionally independent
      • now, $Z_i$ is a substitute for unobserved confounders
    • assumptions
      • no causal arrows among causes $A_i$
      • no unobserved single-cause confounders: any missing confounder affects multiple observed variables
        • –> can check the fit of the factor model, but won’t check perfectly
      • the substitute confounder: there is enough information in the data to learn the variable $Z$ which renders causes conditionally independent
        • strong assumption!
    • thm
      • with observed covariates X, weak unconfoundedness: $A_1, …A_m \perp Y(a) Z, X$ (i.e. $Z$ contains all multi-cause confounders, $X$ contains all single-cause confounders)
    • replaces an uncheckable search for possible confounders with the checkable goal of building a good factor model of observed casts.
    • controversial whether this works in general
    • Towards Clarifying the Theory of the Deconfounder (wang & blei, 2020)
    • On Multi-Cause Causal Inference with Unobserved Confounding: Counterexamples, Impossibility, and Alternatives (d’amour 2019)

synthetic control/interventions

  • synthetic control
    • Using Synthetic Controls rvw paper (abadie 2020)
      • setting: treatment occurs at time t0 on multiple units (e.g. policy in california)
      • goal: estimate effect of treatment (e.g. effect of policy in california)
      • approach: impute counterfactual (california time-series without policy) by weighted combination on observed outcomes for other observations (e.g. average other “similar” states)
        • per-observation weights are learned during pre-intervention period with cross-validation procedure
          • per-feature weights are also learned (bc matching on some features is more important than others)
    • The Economic Costs of Conflict: A Case Study of the Basque Country (Abadie & G, 2003)
    • Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program (abadie et al. 2010)

solutions to basic problems

learning “causal representations”

limitations

  • Limits of Estimating Heterogeneous Treatment Effects: Guidelines for Practical Algorithm Design (alaa & van der schaar, 2018)
    • over enforcing balance can be harmful, as it may inadvertently remove information that is predictive of outcomes
    • analyze optimal minimax rate for ITE using Bayesian nonparametric methods
      • with small sample size: selection bias matters
      • with large sample size: smoothness and sparsity of $\mathbb{E}\left[Y_{i}^{(0)} \mid X_{i}=x\right]$ and $\mathbb{E}\left[Y_{i}^{(1)} \mid X_{i}=x\right]$
        • suggests smoothness of mean function for each group should be different, so better to approximate each individually rather than their difference directly
    • algorithm: non-stationary Gaussian process w/ doubly-robust huperparameters