causal inference
view markdownSome notes on causal inference both from the following resources:
 introductory courses following neymanrubin framework at UC Berkeley
 the textbook What if (hernan & robins)
 the book of why
 fairml book
 inprogress book by brady neal
 course notes by stefan wager
 rebecca barter’s blog posts
 wonderful review / intro paper (guo et al. 2020)
basics
 confounding = difference between groups other than the treatment which affects the response
 this is the key problem when using observational (nonexperimental) data to make causal inferences
 problem occurs because we don’t get to see counterfactuals
 ex from Pearl where Age is the confounder
 potential outcomes = counterfactual outcomes $Y^{t=1}, Y^{t=0}$
 treatment = intervention, exposure, action
 potential outcomes are often alternatively written as $Y(1)$ and $Y(0$) or $Y_1$ and $Y_0$

alternatively, $P(Y=y do(T=1))$ and $P(Y=y do(T=0))$ or $P(Y=y set(T=1))$ and $P(Y=y set(T=0))$  treatment $T$ and outcome $Y$ (from “What If”):
 different approaches to causal analysis
 experimental design: collect data in a way that enables causal conclusions
 ex. randomized control trial (RCT)  controls for any possible confounders
 quasiexperiments: without explicit random assignment, some data pecularity approximates randomization
 ex. regression discontinuity analysis
 ex. instrumental variables  variable which can be used to effectively due a RCT because it was made random by some external factor
 posthoc analysis: by arguing that certain assumptions hold, we can draw causal conclusions from nonobservational data
 ex. regressionbased adjustment after assuming ignorability
 some assumptions are not checkable, and we can only reason about how badly they can go wrong (e.g. using sensitivity analysis)
 experimental design: collect data in a way that enables causal conclusions
 background
 very hard to decide what to include and what is irrelevant
 epiphenomenon  a correlated effect (not a cause)
 a secondary effect or byproduct that arises from but does not causally influence a process
 ontology  study of being, concepts, categories
 nodes in graphs must refer to stable concepts
 ontologies are not always stable
 world changes over time
 “looping effect”  social categories (like race) are constantly chainging because people who putatively fall into such categories to change their behavior in possibly unexpected ways
 epistemology  theory of knowledge
 clear distinction between identification and estimation (and third problem is discovery  what is the structure?)
 a causal quantity is identifiable if we can compute it from a purely statistical quantity
intuition
 bradford hill criteria  some simple criteria for establishing causality (e.g. strength, consistency, specificity)
 association is circumstantial evidence for causation
 no causation without manipulation (rubin, 1975; Holland, 1986)
 in this manner, something like causal effect of race/gender doesn’t make sense
 can partially get around this by changing race $\to$ perceived race
 weaker view (e.g. of Pearl) is that we only need to be able to understand how entities interact (e.g. write an SEM)
 different levels
 experiment: experiment, RCT, natural experiment, observation
 evidence: marginal correlation, regression, invariance, causal
 inference (pearl’s ladder of causality): prediction/association, intervention, counterfactuals
 kosuke imai’s levels of inference: descriptive, predictive, causal
measures of association
 correlation
 regression coefficient

risk difference = $P(Y=1 T=1)  P(Y=1 T=0)$ 
risk ratio = relative risk = $P(Y=1 T=1) / P(Y=1 T=0)$ 
odds ratio = $\frac{P(Y=1 T=1) / P(Y=0 T=1)}{P(Y=1 T=0) / P(Y=0 T=0)}$  measures association (1 is independent, >1 is positive association, <1 is negative association)
 odds that $P(Y=1)$ = $P(Y=1)/P(Y \neq 1)$
causal ladder (different levels of inference)

prediction/association $P(Y T)$
 only requires joint distr. of the variables

intervention $P(Y^{T=t}) = P(Y do(T=t))$
 we can change things and get conditionals based on evidence after intervention
 represents the conditional distr. we would get if we were to manipulate $t$ in a randomized trial
 to get this, we assume the causal structure (can still kind of test it based on conditional distrs.)
 having assumed the structure, we delete all edges going into a do operator and set the value of $x$

then, docalculus yields a formula to estimate $p(y do(x))$ assuming this causal structure  see introductory paper here, more detailed paper here (pearl 2013)
 by assuming structure, we learn how large impacts are

counterfactuals $P(Y^{T=t’} = y’ T=t, y=y)$

we can change things and get conditionals based on evidence before intervention

instead of intervention $p(y do(x))$ we get $p(y^* x^, u=u)$ where $u$ represents fixing all the other variables and $y^$ and $x^*$ are not observed 
averaging over all data points, we’d expect to get something similar to the intervention $p(y do(x))$  probabalistic answer to a “what would have happened if” question
 e.g. “Given that Hillary lost and didn’t visit Michigan, would she win if she had visited Michigan?”
 e.g. “What fraction of patients who are treated and died would have survived if they were not treated?”
 this allows for our intervention to contradict something we condition on
 simple matching is often not sufficient (need a very good model for how to match, hopefully a causal one)
 key difference with standard intervention is that we incorporate available evidence into our calculation
 available evidence influences exogenous variables
 this is for a specific data point, not a randomly sampled data point like an intervention would be
 requires SEM, not just causal graph
frameworks
fisher randomization test

this framework seeks evidence against the null hypothesis (e.g. that there is no causal effect)
 fisher null hypothesis: $H_{0F}: Y_i^{T=0} = Y_i^{T=1}\quad \forall i$
 also called strong null hypothesis = sharp null hypothesis (Rubin, 1980)
 weak null hypothesis would be $\bar Y_i^{T=0} = Y_i^{T=1}$
 can work for any test statistic $test$
 only randomness comes from treatment variable  this allows us to get randomization distribution for a teststatistic $test(T, Y^{T=1}, Y^{T=0})$
 this yields $p$values: $p= \frac 1 M \sum_{m=1}^M \mathbb 1 { test(\mathbf t^m, \mathbf Y) \geq test (\mathbf T, \mathbf Y) }$
 can approximate this with Monte Carlo permutation test, with $R$ random permutations of $\mathbf T$: $p= \frac 1 R \sum_r \mathbb 1 { test(\mathbf t^r, \mathbf Y) \geq test (\mathbf T, \mathbf Y) }$
 also called strong null hypothesis = sharp null hypothesis (Rubin, 1980)
 canonical choices of teststatistic
 difference in means: $\hat \tau = \hat{\bar{Y}}^{T=1}  \hat{\bar{Y}}^{T=0}$
 $= \frac 1 {n_1} \sum_i T_i Y_i  \frac 1 {n_0} \sum_i (1  T_i) Y_i$
 $= \frac 1 {n_1} \sum_{T_i=1} Y_i  \frac 1 {n_0} \sum_{T_i=0} Y_i$
 studentized statistic: \(t=\frac{\hat \tau}{\sqrt{\frac{\hat S^2(T=1)}{n_1}+\frac{\hat S^2 (T=0)}{n_0}}}\)
 allows for variance to change between groups (heteroscedasticity)
 wilcoxon rank sum: $W = \sum_i T_i R_i$, where $R_i = #{j : Y_j \leq Y_i }$ is the rank of $Y_i$ in the observed samples
 the sum of the ranks is $n(n+1)/2$, and the mean of $W$ is $n_1(n+1)/2$
 less sensitive to outliers

kolmogorovsmirnov statistic: $D = \max_y \hat F_1(y)  \hat F_0 (y) $ where $\hat F_1(y) = \frac 1 {n_1} \sum_i T_i 1(Y_i \leq y)$, $\hat F_0(y) = \frac 1 {n_0} \sum_i (1 T_i) 1(Y_i \leq y)$  measures distance between distributions of treatment outcomes and control outcomes
 difference in means: $\hat \tau = \hat{\bar{Y}}^{T=1}  \hat{\bar{Y}}^{T=0}$
 alternative neymanpearson framework has a null + alternative hypothesis
 null is favored unless there is strong evidence to refute it
potential outcome framework (neymanrubin)
In this framework, try to explicitly compute the effect
 average treatment effect ATE $\tau = E {Y^{T=1}  Y^{T=0} }$
 individual treatment effect $\Delta = Y_i^{T=1}  Y_i^{T=0}$

conditional average effect $= E{Y^{T=1}  Y^{T=0}} X$
 estimator $\hat \tau = \hat{\bar{Y}}^{T=1}  \hat{\bar{Y}}^{T=0}$
 unbiased: $E(\hat \tau) = \tau$
 $V(\hat \tau) = \underbrace{S^2(1) / n_1 + S^2(0)/n_0}_{\hat V(\tau) \text{ conservative estimator}}  S^2(\tau)/n$
 95% CI: $\hat \tau \pm 1.96 \sqrt{\hat V}$ (based on normal approximation)
 we could similarly get a pvalue testing whether $\hat \tau$ goes to 0, unclear if this is better
 key assumptions: SUTVA, consistency, ignorability
 strict randomization framework: only assume treatment assignment is take to be a random variable
 alternatively assume population distr. from which potential outcomes are drawn
 advantages over DAGs: easy to express some common assumptions, such as monotonicity / convexity
 neymanrubin model: $Y_i = \begin{cases} Y_i^{T=1}, &\text{if } T_i=1\Y_i^{T=0}, &\text{if } T_i=0 \end{cases}$
 equivalently, $Y_i = T_i Y_i^{T=1} + (1T_i) Y_i^{T=0}$  $\widehat{ATE} = \mathbb E [\hat{Y}^{T=1}  \hat{Y}^{T=0}]$  $\widehat{ATE}_{adj} = [\bar{a}_A  (\bar{x}_A  \bar{x})^T \hat{\theta}_A]  [\bar{b}_B  (\bar{x}_B  \bar{x})^T \hat{\theta}_B]$
 $\hat{\theta}A = \mathrm{argmin} \sum{i \in A} (a_i  \bar{a}_A  (x_i  \bar{x}_A)^T \theta)^2$
 neymanpearson
 null + alternative hypothesis
 can also take a bayesian perspective on the missing data
 neymanrubin model: $Y_i = \begin{cases} Y_i^{T=1}, &\text{if } T_i=1\Y_i^{T=0}, &\text{if } T_i=0 \end{cases}$
DAGs / structural causal models (pearl et al.)
In this framework, depict assumptions in a DAG / SEM and use it to reason about effects. For notes on graphical models, see here and basics of docalculus see here.
 comparison to potential outcomes
 easy to make clear exactly what is independent, particularly when there are many variables
 docalculus allows for answering some specific questions easily
 often difficult to come up with proper causal graph, although may be easier than coming up with ignorability assumptions that hold
 more flexibly extends to cases with many variables
 DAGs quick review (more details here)
 dseparation allows us to identify independence
 A set $S$ of nodes is said to block a path $p$ if either
 $p$ contains at least one arrowemitting node that is in $S$
 $p$ contains at least one collision node that is outside $S$ and has no descendant in $S$
 A set $S$ of nodes is said to block a path $p$ if either
 absence of edges often corresponds to qualitative judgements of conditional independence
 disentangled factorization represented by the graph can use far fewer params
 dseparation allows us to identify independence
forks  mediators  colliders 

confounder $z$, can be adjusted for  confounder can vary causal effect  conditioning on confounder z can explain away a cause 
 structural causal model (scm) gives a set of variables $X_1, … X_i$ and and assignments of the form $X_i := f_i(X_{parents(i)}, \epsilon_i)$, which tell how to compute value of each node given parents
 exogenous variables $\epsilon_i$ = noise variables = disturbances = errors  node in the network that represents all the data not collected
 are not influenced by other variables
 modeler decides to keep these unexplained
 not the same as the noise term in a regression  need not be uncorrelated with regressors
 direct causes = parent nodes
 the $:=$ notation signifies a direction
 controlling for a variable (when we have a causal graph):

$P(Y=y do(T:=t)) = \sum_z \underbrace{P(Y=y T=t, T_{parents}=z)}{\text{effect for slice}} \underbrace{P(T{parents}=z)}_{\text{weight for slice}}$ 
postintervention distribution $P(z, y do(t_0))$  result of setting $T = t_0$  also called “controlled” or “experimental” distr.
 counterfactual $Y^{T=t}(x)$ under SCM can be explicitly computed
 given structural causal model M, observed event x, action T:=t, target variable Y, define counterfactual $Y^{T=t}(x)$ in 3 steps:
 abduction  adjust noise variables to be consistent with $x$
 action  perform dointervention, setting $T=t$
 prediction  compute target counterfactual, which follows directly from $M_t$, model where action has been performed
 counterfactuals can be derived from the model, unlike potential outcomes framework where counterfactuals are taken as primitives (e.g. they are undefined quantities which other quantities are derived from)
 exogenous variables $\epsilon_i$ = noise variables = disturbances = errors  node in the network that represents all the data not collected
 identifiability
 effects are identifiable whenever model is Markovian
 graph is acyclic
 error terms are jointly independent
 in general, parents of $X$ are the only variables that need to be measured to estimate the causal effects of $X$

in this simple case, $E(Y do(x_0)) = E(Y X=x_0)$  can plug into the algebra of an SEM to see the effect of intervention
 alternatively, can factorize the graph, set $x=x_0$ and then marginalize over all variables not in the quantity we want to estimate
 this is what people commonly call “adjusting for confounders”
 this is the same as the Gcomputation formula (Robins, 1987), which was derived from a more complext set of assumptions
 effects are identifiable whenever model is Markovian
 rules of docalculus allow us to identify causal effects in general (e.g. in frontdoor adjustement)
 general criterion
 backdoor / frontdoor

unconfounded children criterion: if it is possible to block all backdoor paths from $T$ to all of its children that are ancestors of $Y$ with a single conditioning set $S$, then $P(Y=y do(T=t))$ is identifiable (tian & pearl, 2002)
 examples
 ex. correlate flips
 in this ex, $W$ and $H$ are usually correlated, so conditional distrs. are similar, but do operator of changing $W$ has no effect on $H$ (and vice versa)

notation: $P(H do(W:=1))$ or $P_{M[W:=1]}(h)$ 
ATE of $W$ on $H$ would be $P(H do(W:=1))  P(H do(W:=0))$
 ex. simple structure
 where $\begin{align}z &= f_Z(u_Z)\x&=f_X(z, u_X)\y&=f_Y(x, u_Y) \end{align}$
 ex. correlate flips
randomized experiments
Randomized experiments randomly assign a treatment, thus creating comparable treatment and control groups on average (i.e. controlling for any confounders).
design
stratification / matchedpairs design
 sometimes we want different strata, ex. on feature $X_i \in {1, …, K }$
 fully random experiment will not put same amount of points in each stratum + will have different treatment/control balance in each stratum
 stratification at design stage: stratify and run RCT for each stratum
 can do all the same fisher tests, usually by computing statistics within each stratum than aggregating
 sometimes bridge strata with global data
 ex. aligned rank statistic  normalize each $Y_i$ with stratum mean, then look at ranks across all data (hodges & lehmann 1962)
 can do neymanrubin analysis as well  CLT holds with a few large strata and many small strata
 can prove that variance for stratified estimator is smaller when stratification is predictive of outcome
 can do all the same fisher tests, usually by computing statistics within each stratum than aggregating
matchedpairs design  like stratification, but with only one point in each stratum (i.e. a pair)
 in each pair, one unit is given treatment and the other is not
 Fisherian inference
 $H_{0F}: Y_{i, j}^{T=1}=Y_{i, j}^{T=0} \; \underbrace{\forall i}{\text{all pairs}} \underbrace{\forall j}{\text{both units in pair}}$
 $\hat \tau_i = \underbrace{(2T_i  1)}{1\text{ or } 1}(Y{i1}  Y_{i2})$, where $T_i$ determines which of the two units in the pair was given treatment
 many other statistics…e.g. wilcoxon signrank statistic, sign statistic, mcnemar’s statistic
 $H_{0F}: Y_{i, j}^{T=1}=Y_{i, j}^{T=0} \; \underbrace{\forall i}{\text{all pairs}} \underbrace{\forall j}{\text{both units in pair}}$
 Neymanian inference
 $\hat \tau = \frac 1 n \sum_i \tau_i$
 $\hat V = \frac 1 {n(n1)} \sum_i (\hat \tau_i  \hat \tau)^2$
 can’t estimate variance as in stratified randomized experiment
 heuristically, matchedpairs design helps when matching is welldone and covariates are predictive of outcome (can’t check this at design stage though)
 can also perform covariate adjustment (on covariates that weren’t used for matching, or in case matching was inexact)
rerandomization balances covariates
rerandomization adjusts for covariate imbalance given covariate vector $\mathbf x$ (assume mean 0) at design stage
 covariate differenceinmeans: $\mathbf{\hat \tau_x} = \frac 1 {n_1} \sum_i T_i \mathbf x_i  \frac 1 {n_0} \sum_i (1  T_i)\mathbf x_i$
 this is for RCT
 asymptotically zero, but real value need not be
 $\textbf{cov}(\hat\tau_x) = \frac 1 {n_1} S_x^2 + \frac 1 {n_0}S_x^2 = \frac{n}{n_1 n_0}S_x^2$
 $M = \hat \tau_x^T \textbf{cov}(\hat \tau_x)^{1} \hat \tau_x$  Mahalanabois distance measures difference between treatment/control groups
 invariant to nondegenerate linear transformations of $\mathbf x$
 can use other covariate balance criteria
 rerandomization: discard treatment allocations when $M \leq m_{thresh}$
 $m_{thresh}=\infty$: RCT
 $m_{thresh}=0$: few possible treatment allocations  limits randomness; in practice try to choose very small $m_{thresh}$
 proposed by Cox (1982) + Morgan & Rubin (2012)

can derive asymptotic distr. for $\hat \tau$ (li, deng & rubin 2018)
 combining rerandomization + regression adjustment can achieve better results (li & ding, 2020)
posthoc analysis
statification / matching
 poststratification at analysis stage: condition on stratum
 can do conditional FRT or poststratified Neymanian analysis
 can often improve efficiency
 is limited, because eventually there are no points in certain strata
 this is also called aggregating estimators (e.g. aggregating differenceinmeans estimators)
 for continuous confounder, can also stratify on propensity score
regression adjustment balances covariates
 regression adjustment at analysis stage account for covariates x
 fisher random trial adjustment  2 strategies
 construct teststatistic based on residuals of statistical models
 regress $Y_i$ on $\mathbf x_i$ to obtain residual $e_i$  then treat $e_i$ as pseudo outcome to construct test statistics
 use regression coefficient as a test statistic
 regress $Y_i$ on $(T_i, \mathbf x_i$) to obtain coefficient of $T_i$ as the test statistic (Fisher’s ANCOVA estimator): $\hat \tau_F$
 Freedman (2008) found that this estimator had issues: biased, large variance, etc.
 Lin (2013) finds favorable properties of the $\hat \tau_F$ estimator
 can get minor improvements by instead using coefficient of $T_i$ in the OLS of $Y_i$ on $(T_i, \mathbf x_i, T_i \times \mathbf x_i$)
 regress $Y_i$ on $(T_i, \mathbf x_i$) to obtain coefficient of $T_i$ as the test statistic (Fisher’s ANCOVA estimator): $\hat \tau_F$
 construct teststatistic based on residuals of statistical models
quasiexperiments
Can also be called pseudoexperiments. These don’t explicitly randomize treatment, as in RCTs, but some property of the data allows us to control for confounders.
natural experiments
 hard to justify
 e.g. jon snow on cholera  something acts as if it were randomized
backdoor criterion: capture parents of T
 backdoor criterion  establishes if variables $T$ and $Y$ are confounded in a graph given set $S$
 sufficient set $S$ = admissible set  set of factors to calculate causal effect of $T$ on $Y$ requires 2 conditions
 no element of $S$ is a descendant of $T$
 $S$ blocks all paths from $T$ to $Y$ that end with an arrow pointing to $T$ (i.e. backdoor paths)
 intuitively, paths which point to $T$ are confounders
 if $S$ is sufficient set, in each stratum of $S$, risk difference is equal to causal effect

e.g. risk difference $P(Y=1 T=1, S=s)  P(Y=1 T=0, S=s)$ 
e.g. causal effect $P(Y=1 do(T=1), S=s)  P(Y=1 do(T=0), S=s)$ 
average outcome under treatment, $P(Y=y do(T=t)) = \sum_s P(Y=y T=t, S=s) P(S=s) $  sometimes called backdoor adjustment  really just covariate adjustment (same as we had in potential outcomes)

 there can be many sufficient sets, so we may choose the set which minimizes measurement cost or sampling variability
 propensity score methods help us better estimate the RHS of this average outcome, but can’t overcome the necessity of the backdoor criterion
frontdoor criterion: good mediator variable
 frontdoor criterion  establishes if variables $T$ and $Y$ are confounded in a graph given mediators $M$
 intuition: we have unknown confounder $U$, but can find a mediator $M$ between $T$ and $Y$ unaffected by $U$, we can still calculate the effect
 this is because we can calculate effect of $T$ on $M$, $M$ on $Y$, then multiply them
 $M$ must satisfy 3 conditions
 $M$ intercepts all directed paths from $T$ to $Y$ (i.e. frontdoor paths)
 there is no unblocked backdoor path from $T$ to $M$
 all backdoor paths from $M$ to $Y$ are blocked by $T$

when satisfied and $P(t, m) > 0$, then causal effect is identifiable as $P(y do(T=t)) = \sum_m P(m t) \sum_{t’} P(y t’, m) P(t’)$  smoking ex. (hard to come up with many)
 $T$ = whether one smokes
 $Y$ = lung cancer
 $M$ = accumulation of tar in lungs
 $U$ = condition of one’s environment
 only really need to know about treatment, M, and outcome
graph LR
U >Y
U > T
T > M
M > Y
regression discontinuity: running variable
 dates back to thistlethwaite & campbell, 1960
 treatment definition (e.g. highschool acceptance) has an arbitrary threshold (e.g. score on a test)
 comparing groups very close to the cutoff should basically control for confounding

easily satisfies unconfounding, but has no overlap at all ($P(T_i=1 Z_i=z)$ is always 0 or 1)  $\tau_{c}=\mathbb{E}\left[Y_{i}(1)Y_{i}(0) \mid Z_{i}=c\right]$ is identified via $\tau_{c}=\lim {z \downarrow c} \mathbb{E}\left[Y{i} \mid Z_{i}=z\right]\lim {z \uparrow c} \mathbb{E}\left[Y{i} \mid Z_{i}=z\right]$ under continuity assumptions
 can estimate via local linear regr. – kernel fits things on either side
 needs assumptions on the smoothness of the mean function
 alternatively, assume some unconfounded noise in the running variable (i.e. the variable being thresholded)
 results in deconvolutiontype estimators (see eckles et al. 2020)
differenceindifference
 differenceindifference is a name given to many methods for estimating effects in longitudinal data = panel data
 requires data from both groups at 2 or more time periods (at least one before treatment and one after)
 simple constanttreatment model: $Y_{i t}=Y_{i t}(0)+T_{i t} \tau,$ for all $i=1, \ldots, n, t=1, \ldots$
 assumes no effect heterogeneity
 assumes treatment at time only effects outcomes at time t  this can be weird
 one approach for estimation: assume twoway model: $Y_{i t}=\alpha_{i}+\beta_{t}+T_{i t} \tau+\varepsilon_{i t}, \quad \mathbb{E}[\varepsilon \mid \alpha, \beta, T]=0$
 estimator for differenceindifference with two timepoints

$\hat{\tau}=\frac{1}{\left \left{i: T_{i 2}=1\right}\right } \underbrace{\sum_{\left{i: T_{i 2}=1\right}}\left(Y_{i 2}Y_{i 1}\right)}_{\text{diff for treated group}}\frac{1}{\left \left{i: T_{i 2}=0\right}\right } \underbrace{\sum_{\left{i: T_{i 2}=0\right}}\left(Y_{i 2}Y_{i 1}\right)}_{\text{diff for untreated group}}$

 estimator for differenceindifference with two timepoints
 second approach for estimation of constanttreatment: interactive panel models $Y_{i t}=A_{i .} B_{t .}^{\prime}+T_{i t} \tau+\varepsilon_{i t}, \quad \mathbb{E}[\varepsilon \mid A, B, T]=0, \quad A \in \mathbb{R}^{n \times k}, B \in \mathbb{R}^{T \times k}$
 allows for rank $k$ matrix, less restrictive (no longer forces parallel trends for all units)
 can estimate with synthetic controls (abadie, diamon, & hainmueller, 2010)
 artificially reweight unexposed units (i.e. units with $T_i=0$) so their average trend matches the unweighted mean trend up to time $t_0$
 if the weights create thes parallel trends, they should alos balance the latent factors $A_i$

 trends are clearly not parallel, but after reweighting they become parallel
 many other estimators, e.g. via clustering or nuclear norm minimization
 third approach: designbased assumptions e.g. $Y_{i .}(0) \perp T_{i .} \mid S_{i}, \quad S_{i}=\sum_{t=1}^{T} T_{i t}$
 ex. $Y_{it}$ is health outcome, $T_{it}$ is medical treatment, $S_i$ is unobserved healthseeking behavior
 $\hat{\tau}=\sum_{i, t} \gamma_{i t} Y_{i t}$ where $\gamma$matrix depends only on treatment assignment
instrumental variable

graph LR I > T X > T T > Y

instrument $I$ measurable quantity that correlates with the treatment, and is $\underbrace{\color{NavyBlue}{\text{only related to the outcome via the treatment}}}_{\textbf{exclusion restriction}}$

precisely 3 conditions must hold for $I_i$:
 exogenous: $\varepsilon_{i} \perp I_{i}$
 relevant: $\operatorname{Cov}\left[T_{i}, I_{i}\right] \neq 0 $
 exclusion restriction: any effect of $I_{i}$ on $Y_{i}$ must be mediated via $T_{i}$
 uncheckable

intuitively, need to combine effect of instrument on treatment and effect of instrument on outcome (through treatment)
 in practice, often implemented in a 2stage least squares (regress $I \to T$ then $T\to Y$)
 $Y_{i}=\alpha+T_{i} \tau+\varepsilon_{i}, \quad \varepsilon_{i} \perp I_{i}$
 $T_{i}=I_{i} \gamma+\eta_{i}$
 most important point is that $\epsilon_i \perp I_i$
 Wald estimator = $\frac{Cov(Y, I)}{Cov(T, I)}$
 LATE = local average treatment effect  this estimate is only valid for the patients who were influenced by the instrument
 we may have many potential instruments
 in this case, can learn a funciton of the instruments via crossfitting as an instrument
 in practice, often implemented in a 2stage least squares (regress $I \to T$ then $T\to Y$)

examples
 $I$: stormy weather, $T$ price of fish, $Y$ demand for fish
 stormy weather makes it harder to fish, raising price but not affecting demand
 $I$: quarter of birth, $T$: schooling in years, $Y$: earnings (angrist & krueger, 1991)
 $I$: sibling sex composition, $T$: family size, $Y$: mother’s employment (angist & evans, 1998)
 $I$: lottery number, $T$: veteran status, $Y$: mortality
 $I$: geographic variation in college proximity, $T$: schooling, $Y$: wage (card, 1993)
 $I$: stormy weather, $T$ price of fish, $Y$ demand for fish
effect under noncompliance
 CACE $\tau_c$ (complier average causal effect) = LATE (local average treatement effect)

technical setting: noncompliance  sometimes treatment assigned $I$ and treatment received $T$ are different
 assumptions
 randomization = instrumental unconfoundedness: $I \perp {T^{I=1}, T^{I=0}, Y^{I=1}, Y^{I=0} }$
 randomization lets us identify ATE of $I$ on $T$ and $I$ on $Y$
 $\tau_{T}=E{T^{I=1}T^{I=0}}=E(T \mid I=1)E(T \mid I=0)$
 $\tau_{Y}=E{Y^{I=1}Y^{I=0}}=E(Y \mid I=1)E(Y \mid I=0)$
 intentiontotreat $\tau_Y$
 four possible outcomes:
 $C_{i}=\left{\begin{array}{ll}
\mathrm{a}, & \text { if } T_{i}^{I=1}=1 \text { and } T_{i}^{I=0}=1 \text{ always taker}
\mathrm{c}, & \text { if } T_{i}^{I=1}=1 \text { and } T_{i}^{I=0}=0\text{ complier}
\mathrm{d}, & \text { if } T_{i}^{I=1}=0 \text { and } T_{i}^{I=0}=1 \text{ defier}
\mathrm{n}, & \text { if } T_{i}^{I=1}=0 \text { and } T_{i}^{I=0}=0\text{ never taker} \end{array}\right.$
 $C_{i}=\left{\begin{array}{ll}
\mathrm{a}, & \text { if } T_{i}^{I=1}=1 \text { and } T_{i}^{I=0}=1 \text{ always taker}
 randomization lets us identify ATE of $I$ on $T$ and $I$ on $Y$
 exclusion restriction: $Y_i^{I=1} = Y_i^{I=0}$ for alwaystakers and nevertakers
 means treatment assignment affects outcome only if it affects the treatment
 monotonicity: $P(C=d) = 0$ or $T_i^{U=1} \geq T_i^{U=0} \; \forall i$  there are no defiers

testable implication: $P(T=1 I=1) \geq P(T=1 C=0)$


under these 3 assumptions, LATE $\tau_c = \frac{\tau_Y}{\tau_T} = \frac{E(Y \mid I=1)E(Y \mid I=0)}{E(T \mid I=1)E(T \mid I=0)}$
 this can be estimated with the WALD estimator
 basically just scales the intentiontotreat estimator, so usually similar conclusions
 if we have some confounders $X$, can adjust for them
 the smaller $\tau_T$ is (i.e. the more noncompliance there is), the worse properties the LATE estimator has

instrumental variable inequalities $E(Q I=1) \geq E(Q I=0)$ where $Q = TY, T(1Y), (1T)Y, T+Y  TY$

instrumental inequality (Pearl 1995)  a necessary condition for any variable $I$ to qualify as an instrument relative to the pair $(T, Y)$: $\max_T \sum_Y \left[ \max_I P(T,Y I) \right] \leq 1$  when all vars are binary, this reduces to the following:
 $P(Y=0, T=0 \mid I=0) + P(Y=1, T=0 \mid I=1) \leq 1$ $P(Y=0, T=1 \mid I=0)+P(Y=1, T=1 \mid I=1) \leq 1$ $P(Y=1, T=0 \mid I=0)+P(Y=0, T=0 \mid I=1) \leq 1$ $P(Y=1, T=1 \mid I=0)+P(Y=0, T=1 \mid I=1) \leq 1$
 randomization = instrumental unconfoundedness: $I \perp {T^{I=1}, T^{I=0}, Y^{I=1}, Y^{I=0} }$

instrumental variable criterion  sufficient condition for identifying the causal effect $P(y do(t))$ is that every path between $T$ and any of its children traces at least one arrow emanating from a measured variable  sometimes this is satisfied even when backdoor criterion is not
synthetic data experiments
 Towards causal benchmarking of bias in face analysis algorithms (balakrishnan et al. 2020)  use GANs to generate synthetic data where only attribute varies
 causal imputation / causal transport = mapping bbetween different domains, where only one domain at a time is ever observed
 MultiDomain Translation by Learning Uncoupled Autoencoders (yang & uhler 2019)
 learn autoencoders for different domains that all map to a shared latent space
 this allows to translate between different domains, by using one encoder and then a different decoder
 Causal Imputation via Synthetic Interventions (squires, …, uhler 2020)
 MultiDomain Translation by Learning Uncoupled Autoencoders (yang & uhler 2019)
observational analysis
The emphasis in this section is on ATE estimation, as an example of the considerations required for making causal conclusions. Observational analysis focuses on adjusting for observed confounding.
ATE estimation basics
 assume we are given iid samples of ${ X_i, T_i, Y_i^{T=1}, Y_i^{T=0} }$, and drop the index $i$
 $\tau = E{Y^{T=1}  Y^{T=0}}$: average treatment effect (ATE)  goal is to estimate this

$\tau_T =E{Y^{T=1}−Y^{T=0} T =1}$: ATE on the treated units 
$\tau_C =E{Y^{T=1}−Y^{T=0} T =0}$: ATE on the control units 
$\tau_{PF} = E[Y T=1]  E[Y T=0]$: prima facie causal effect 
$= E[Y^{T=1} T=1]  E[Y^{T=0} T=0]$  naive, but computable!
 generally biased, with selection biases:

$E[Y^{T=0} T=\textcolor{NavyBlue}1]  E[Y^{T=0} T=\textcolor{NavyBlue}0]$ 
$E[Y^{T=1} T=\textcolor{NavyBlue}1]  E[Y^{T=1} T=\textcolor{NavyBlue}0]$


 randomization tells us the treatment is independent of the outcome with/without treatment $T \perp {Y^{T=1}, Y^{T=0}}$, so the selection biases are zero (rubin, 1978)
 $\implies \tau = \tau_T = \tau_C$
 this is more important than balancing the distr. of covariates
 conditioning on observed X, selection bias terms are zero:

$E{Y^{T=0} T=1, X} = E{Y^{T=0} T=0, X}$ 
$E{Y^{T=1} T=1, X} = E{Y^{T=1} T=0, X}$  $\implies \tau(X) = \tau_T(X) = \tau_C(X) = \tau_{PF}(X)$

 ex. stratified ATE estimator (given discrete covariate)
regression adjustments
 ATE conditional outcome modeling (all assume ignorability)
 ex. $\tau = \beta_t$ in OLS

$E(Y T, X) = \beta_0 + \beta_t T + \beta_x^TX$  assumes treatment effect is same for all individuals

 ex. $\tau = \beta_t + \beta_{tx}^TE(X)$

$E(Y T, X) = \beta_0 + \beta_tT + \beta_x^TX + \beta^T_{tx} X T$  incorporates heterogeneity

 ex. $\tau = \beta_t$ in OLS

intuition
 much like imputing missing potential outcomes using a linear model
 using nearestneighbor regr. would correspond to a matchingwithoutreplacement estimator
 sometimes use propensity scores as a predictor as well

assuming linear form is relatively strong assumption compared to that made by weighting / stratification

regr. adjustments are the most popular form of adjustments

these easily generalize to when $T$ is continuous
 $\hat \tau = \frac 1 n \sum_i (\hat \mu_1(X_i)  \hat \mu_0(X_i))$

general mean functions $\hat \mu_1(x), \hat \mu_0(x)$ approximate $\mu_i =E{ Y^{T=i} X}$  consistent when $\mu_i$ functions are wellspecified

 overadjustment
 Mbias
 originally, $T \perp Y$ 😊
 after adjusting for X, $T \not \perp Y$ 🙁

graph LR U1 > T U1 > X U2 > X U2 > Y
 Zbias: after adjusting, bias is larger

graph LR X > T T > Y U > Y U > T
 Mbias
weighting methods
Weighting methods assign a different importance weight to each unit to match the covariates distributions across treatment groups after reweighting. Balance is often used as a goodness of fit check after weighting (imbens & rubin 2015).
inverse propensity weighting

propensity score $e(X, Y^{T=1}, Y^{T=0}) = P{T=1 X, Y^{T=1}, Y^{T=0}}$ 
under strong ignorability, $e(X)=P(T=1 X)$ 
Thm. if $\underbrace{T \perp { Y^{T=1}, Y^{T=0}} \color{NavyBlue} X}_{\text{strong ignorability on X}}$, then $\underbrace{T \perp { Y^{T=1}, Y^{T=0}} \textcolor{NavyBlue}{e(X)}}_{\text{strong ignorability on e(X)}}$  therefore, can stratify on $e(X)$, but still need to estimate $e(X)$, maybe bin it into K quantiles (and pick K)
 could combine this propensity score weighting with regression adjustment (e.g. within each stratum)
 assumes positivity
 intuition: $e(X)$ fully mediates path from $X$ to $T$

Thm. $T \perp X \; \; e(X)$. Moreover, for any function $h$, $E{\frac{T \cdot h(X)}{e(X)}} = E{\frac{(1T)h(X)}{1e(X)}}$  can use this result to check for covariate balance in design stage
 can view $h(X)$ as psuedo outcome and estimtate ATE
 if we specify it to something like $h(X)=X$, then it should be close to 0

 $\hat \tau_{ht} = \frac 1 n \sum_i \frac{T_iY_i}{\hat e(X_i)}  \frac 1 n \sum_i \frac{(1T_i)Y_i}{1\hat e(X_i)} $ = inverse propensity score weighting estimator = horvitzthompson estimator (horvitz & thompson, 1952)
 weight outcomes by $1/e(X)$ for treated individuals and $1/(1e(X))$ for untreated

Based on Thm. if $\underbrace{T \perp { Y^{T=1}, Y^{T=0}} X}_{\text{strong ignorability on X}}$, then:  $E{Y^{T=1}} = E \left { \frac{TY}{e(X)} \right }$
 $E{Y^{T=0}} = E\left { \frac{(1T)Y}{1e(X)} \right }$
 $\implies \tau = E \left { \frac{TY}{e(X)}  \frac{(1T)Y}{1e(X)} \right }$
 consistent when propensity scores are correctly specified
 intuition: assume we have 2 subgroups in our data
 if prob. of a sample being assigned treatment in one subgroup is low, should upweight it because this sample is rare + likely gives us more information
 this also helps balance the distribution of the treatment for each subgroup
 if we add a constant to $Y$, then this estimator changes (not good)  if we adjust to avoid this change, we get the Hajek estimator (hajek, 1971), which is often more stable
 scores near 0/1 are unstable  sometimes truncate (“trim”) or drop units with these scores
 fundamental problem is ovelap of covariate distrs. in treatment/control
 when score = 0 or 1, counterfactuals may not even be well defined
 stratified estimator can be seen as a particular case of IPW estimator
doubly robust estimator
 combines weighting and regr. adjustment
 $\hat \tau^{\text{dr}} = \hat \mu_1^{dr}  \hat \mu_0^{dr}$ = doubly robust estimator = augmented inverse propensity score weighting estimator (robins, rotnizky, & zhao 1994, scharfstein et al. 1999, bang & robins 2005)
 given $\mu_1(X, \beta_1)$, $\mu_0(X, \beta_0)$, e.g. linear
 given $e(X, \alpha)$, e.g. logistic
 $\tilde{\mu}{1}^{\mathrm{dr}} =E\left[ \overbrace{\mu{1}\left(X, \beta_{1}\right)}^{\text{outcome mean}} + \overbrace{\frac{T\left{Y\mu_{1}\left(X, \beta_{1}\right)\right}}{e(X, \alpha)}}^{\text{invprop residuals}}\right]$
 $\tilde{\mu}{0}^{\mathrm{dr}} =E\left[\mu{0}\left(X, \beta_{0}\right) + \frac{(1T)\left{Y\mu_{0}\left(X, \beta_{0}\right)\right}}{1e(X, \alpha)}\right]$
 augments the oucome regression mean with inversepropensity of residuals
 can alternatively augment inv propensity score weighting estimator by the outcome models:
 $\begin{aligned}
\tilde{\mu}{1}^{\mathrm{dr}} &=E\left[\frac{T Y}{e(X, \alpha)}\frac{Te(X, \alpha)}{e(X, \alpha)} \mu{1}\left(X, \beta_{1}\right)\right]
\tilde{\mu}{0}^{\mathrm{dr}} &=E\left[\frac{(1T) Y}{1e(X, \alpha)}\frac{e(X, \alpha)T}{1e(X, \alpha)} \mu{0}\left(X, \beta_{0}\right)\right] \end{aligned}$
 $\begin{aligned}
\tilde{\mu}{1}^{\mathrm{dr}} &=E\left[\frac{T Y}{e(X, \alpha)}\frac{Te(X, \alpha)}{e(X, \alpha)} \mu{1}\left(X, \beta_{1}\right)\right]
 consistent if either the propensity scores or mean functions are wellspecified:
 propensities wellspecified: $e(X, \alpha) = e(X)$
 mean functions wellspecified: $\left{\mu_{1}\left(X, \beta_{1}\right)=\mu_{1}(X), \mu_{0}\left(X, \beta_{0}\right)=\mu_{0}(X)\right}$
 in practice, often use crossfitting (split the data randomly into two halves $\mathcal I_1$ and $\mathcal I _2$)

$\hat{\tau}_{A I P W}=\frac{\left \mathcal{I}_{1}\right }{n} \hat{\tau}^{\mathcal{I}_{1}}+\frac{\left \mathcal{I}_{2}\right }{n} \hat{\tau}^{\mathcal{I}_{2}}$ 
$\hat{\tau}^{\mathcal{I}_{1}}=\frac{1}{\left \mathcal{I}_{1}\right } \sum_{i \in \mathcal{I}{1}}\left(\hat{\mu}{(1)}^{\mathcal{I}{2}}\left(X{i}\right)\hat{\mu}{(0)}^{\mathcal{I}{2}}\left(X_{i}\right)\right. \left.+W_{i} \frac{Y_{i}\hat{\mu}{(1)}^{\mathcal{I}{2}}\left(X_{i}\right)}{\hat{e}^{\mathcal{I}{2}}\left(X{i}\right)}\left(1W_{i}\right) \frac{Y_{i}\hat{\mu}{(0)}^{\mathcal{I}{2}}\left(X_{i}\right)}{1\hat{e}^{\mathcal{I}{2}}\left(X{i}\right)}\right)$  avoids bias due to overfitting
 allows us to ignore form of estimators $\hat \mu$ and $\hat e$ and depend only on overlap, consistency, and risk decay (so CV risk of estimators should be small)

 targeted maximum likelihood (van der laan & rubin, 2006)  more general than DRE
alternative weighting
 solutions to deal with extreme weights (from assaad et al. 2020):
 Matching Weights (Li & Greene, 2013): MW
 Truncated IPW (Crump et al., 2009): TruncIPW
 Overlap Weights (Li et al., 2018): OW  this uses (1  IPW weights), and doesn’t suffer from instability at extreme values
 when estimating propensity scores with neural nets, often overconfident
 in general, there is a more general class of weights that can be used to balance covariates (li, morgan, & zaslavsky, 2016)
 directly incorporating covariate imbalance in weight construction (Graham et al., 2012; Diamond & Sekhon, 2013)
 unifying perspective on these methods via covariatebalancing loss functions (zhao, 2019)
 in general, balancing need not directly balance the propensity scores
 instead, might find propensity weights which balance covariates along certain basis functions $\psi_j(x)$
 $\frac{1}{n} \sum_{i=1}^{n} \frac{W_{i} \psi_{j}\left(X_{i}\right)}{\hat{e}\left(X_{i}\right)} \approx \frac{1}{n} \sum_{i=1}^{n} \psi_{j}\left(X_{i}\right),$ for all $j=1,2, \ldots$
 this can be desirable in high dims, when propensity scores may be unstable
 instead, might find propensity weights which balance covariates along certain basis functions $\psi_j(x)$
 hard momentmatching conditions (Li & Fu, 2017; entropy balancing from Hainmueller, 2012; Imai & Ratkovic, 2014)
 soft momentmatching conditions (Zubizarreta, 2015)
 approximate residual balancing (athey, imbens, & wager, 2018)  combines balancing weights with a regularized regression adjustment for learning ATE from highdimensional data
 Learning Causal Effects via Weighted Empirical Risk Minimization (jung et al. 2020)  estimate any computation (which usually requires actually modeling individual conditional probabilities) using weighted empirical risk minimization
stratification / matching
Matching methods choose try to to equate (or “balance’’) the distribution of covariates in the treated and control groups by picking wellmatched samples of the original treated and control groups
 Matching methods for causal inference: A review and a look forward (stuart 2010)
 matching basics
 includes 1:1 matching, weighting, or subclassification
 linear regression adjustment (so not matching) can actually increase bias in the estimated treatment effect when the true relationship between the covariate and outcome is even moderately nonlinear, especially when there are large differences in the means and variances of the covariates in the treated and control groups
 matching distance measures
 propensity scores summarize all of the covariates into one scalar: the probability of being treated
 defined as the probability of being treated given the observed covariates
 propensity scores are balancing scores: At each value of the propensity score, the distribution of the covariates X defining the propensity score is the same in the treated and control groups – usually this is logistic regresion
 if treatment assignment is ignorable given the covariates, then treatment assignment is also ignorable given the propensity score:
 hard constraints are called “exact matching”  can be combined with other methods
 mahalanabois distance
 propensity scores summarize all of the covariates into one scalar: the probability of being treated
 matching methods
 stratification = crosstabulation  only compare samples when confounding variables have same value
 nearestneighbor matching  we discard many samples this way (but samples are more similar, so still helpful)
 optimal matching  consider all potential matches at once, rather than one at a time
 ratio matching  could match many to one (especially for a rare group), although picking the number of matches can be tricky
 with/without replacement  with seems to have less bias, but more practical issues
 subclassification/weighting: use all the data  this is nice because we have more samples, but we also get some really poor matches
 subclassification  stratify score, like propensity score, into groups and measure effects among the groups
 full matching  automatically picks the number of groups
 weighting  use propensity score as weight in calculating ATE (also know as inverse probability of treatment weighting)
 common support  want to look at points which are similar, and need to be careful with how we treat points that violate similarity
 genetic matching  find the set of matches which minimize the discrepancy between the distribution of potential confounders
 diagnosing matches  are covariates balanced after matching?
 ideally we would look at all multidimensional histograms, but since we have few points we end up looking at 1d summaries
 one standard metric is difference in means of each covariate, divided by its stddev in the whole dataset
 analysis of the outcome  can still use regression adjustment after doing the matching to clean up residual covariances
 unclear how to propagate variance from matching to outcome analysis
 matching basics
 matching does not scale well to higher dimensions (abadie & imbens, 2005), improving balance for some covariates at the expense of others
 biascorrected matching estimator averages over all matches
 if we get perfect matches on covariates $X$ (or propensity score), it is just like doing matched design
 in practice, we only get approximate matches

for a unit, we take its value and impute its counterfactual as $\frac 1 { matches } \sum_{\text{match} \in {\text{matches}}} Y_{\text{match}}$  Abadie and Imbens (2011) add bias correction term
 requires complex bootstrap procedure to obtain variance
 biascorrected matching estimator is very similar to doubly robust estimator
 $\begin{aligned}
\hat{\tau}^{\mathrm{mbc}} &=n^{1} \sum_{i=1}^{n}\left{\hat{\mu}{1}\left(X{i}\right)\hat{\mu}{0}\left(X{i}\right)\right}+n^{1} \sum_{i=1}^{n}\left{\left(1+\frac{K_{i}}{M}\right) T_{i} \hat{R}{i}\left(1+\frac{K{i}}{M}\right)\left(1T_{i}\right) \hat{R}{i}\right}
\hat{\tau}^{\mathrm{dr}} &=n^{1} \sum{i=1}^{n}\left{\hat{\mu}{1}\left(X{i}\right)\hat{\mu}{0}\left(X{i}\right)\right}+n^{1} \sum_{i=1}^{n}\left{\frac{T_{i} \hat{R}{i}}{\hat{e}\left(X{i}\right)}\frac{\left(1T_{i}\right) \hat{R}{i}}{1\hat{e}\left(X{i}\right)}\right} \end{aligned}$
 $\begin{aligned}
\hat{\tau}^{\mathrm{mbc}} &=n^{1} \sum_{i=1}^{n}\left{\hat{\mu}{1}\left(X{i}\right)\hat{\mu}{0}\left(X{i}\right)\right}+n^{1} \sum_{i=1}^{n}\left{\left(1+\frac{K_{i}}{M}\right) T_{i} \hat{R}{i}\left(1+\frac{K{i}}{M}\right)\left(1T_{i}\right) \hat{R}{i}\right}
misc methods
 double machine learning
 first
 fit a model to predict $Y$ from $X$ to get $\hat Y$
 fit a model to predict $T$ from $X$ to get $\hat T$
 next, predict $Y \hat Y$ from $T  \hat T$, which has no in some sense removed effects of $X$
 ex. Chernozhukov et al. (2018), ‘Double/debiased machine learning for treatment and structural parameters’
 ex. Felton (2018), Chernozhukov et al. on Double / Debiased Machine Learning
 ex. Syrgkanis (2019), Orthogonal/Double Machine Learning
 ex. Foster and Syrgkanis (2019), Orthogonal Statistical Learning
 first
assumptions
common assumptions
 unconfoundedness
 exchangeability $T \perp !!! \perp { Y^{t=1}, Y^{t=0}}$ = exogeneity = randomization: the value of the counterfactuals doesn’t change based on the choice of the treatment
 intuition: we could have exchanged the treatment groups and done this experiment
 there are weaker forms of this assumption, such as mean exchangeability (where only means are exchangeable)

ignorability $T \perp !!! \perp { Y^{T=1}, Y^{T=0}} X $ = unconfoundedness = selection on observables = conditional exchangeability  very hard to check!
 rubin 1978; rosenbaum & rubin 1983
 this is strong ignorability, weak ignorability is a little weaker
 this will make selection biases zero

$E[Y^{T=0} T=\textcolor{NavyBlue}1]  E[Y^{T=0} T=\textcolor{NavyBlue}0]$ 
$E[Y^{T=1} T=\textcolor{NavyBlue}1]  E[Y^{T=1} T=\textcolor{NavyBlue}0]$

 graph assumptions
 backdoor criterion, frontdoor criterion, unconfounded children criterion
 basics in a graph
 modularity assumptions
 markov assumptions
 exclusion restrictions: For every variable $Y$ having parents $PA(Y)$ and for every set of endogenous variables $S$ disjoint of $PA(Y)$, we have $Y^{PA(y)} = Y^{PA(Y), S}$
 fixing a variables parents fully determines it
 independence restrictions: variables are independent of noise variables given their markov blanket
 exclusion restrictions: For every variable $Y$ having parents $PA(Y)$ and for every set of endogenous variables $S$ disjoint of $PA(Y)$, we have $Y^{PA(y)} = Y^{PA(Y), S}$
 exchangeability $T \perp !!! \perp { Y^{t=1}, Y^{t=0}}$ = exogeneity = randomization: the value of the counterfactuals doesn’t change based on the choice of the treatment
 overlap / imbalance assumption

positivity $P(T=1 x) > 0 : \forall x$ = overlap = common support  probability of receiving any treatment is positive for every individual  without overlap, assumption of linear effect can be very strong (e.g. extrapolates out of sample)
 strong overlap assumption: propensity scores are bounded away from 0 and 1
 $0 < \alpha_L \leq e(X) \leq \alpha_U < 1$
 when $e(X)=$0 or 1, potential outcomes are not even welldefined (king & zeng, 2006)
 things can be matched
 overlap and unconfoundedness trade off (e.g. see amour et al. 2020)
 strong overlap also implies moment bounds for covariate balance

 treatment assumptions
 treatment is not ambiguous
 no interference: my outcome is unaffected by anyone else’s treatment
 $Y_i(t_1, …,t_n) = Y_i(t_i)$
 consistency: $Y=Y^{t=0}(1T) + Y^{t=1}T$  outcome agrees with the potential outcome corresponding to the treatment indicator
 this grounds the definition of the counterfactuals $Y^{t=0}, Y^{t=1}$
 alternatively, $T=t \implies Y=Y(t)$
 basically means treatment is welldefined
 stable unit treatment value assumption (SUTVA) = nointerference assumption (Rubin, 1980): $Y_i = Y_i(T_i)$
 means unit $i$’s outcome is a function of unit $i$’s treatment
 no interference
 consistency (+deterministic potential outcomes)
 is it something that could be intervened on?
 can we intervene on something like Race? Soln: intervene on perceived race
 can we intervene on BMI? many potential interventions: e.g. diet, exercise
 SCM counterfactuals further assume 2 things:
 $Y_{yz} = y \quad \forall y, \text{ subsets }Z\text{, and values } z \text{ for } Z$
 ensures interventions $do(Y=y)$ results in $Y=y$, regardless of concurrent interventions, say $do(Z=z)$
 $X_z = x \implies Y_{xz} = Y_z \quad \forall x, \text{ subsets }Z\text{, and values } z \text{ for } Z$
 generalizes above
 ensures interventions $do(X=x)$ results in appropriate counterfactual, regardless of holding a variable fixed, say $Z=z$
 $Y_{yz} = y \quad \forall y, \text{ subsets }Z\text{, and values } z \text{ for } Z$
 assumptions for ATE being identifiable: exchangeability (or ignorability) + consistency, positivity
 Independent Causal Mechanisms (ICM) Principle: The causal generative process of a system’s variables is composed of autonomous modules that do not inform or influence each other.
 In the probabilistic case, this means that the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other mechanisms.
assumption checking
check ignorability: use auxilary outcome

negative outcome  assume we have a secondary informative outcome $Y’$

$Y’$ is similar to $Y$ in terms of confounding: if we believe $T \perp Y(t) \mid X$ we also think $T \perp Y’(t) \mid X$

we know the expected effect of $T$ on $Y’$ (ex. it should be 0)

$\tau(T \to Y’) = E{Y’^{T=1}  Y’^{T=0})}$

 ex. cornfield et al. 1959  effect of smoking on car accident

graph LR X(X) >Y'(Y') X > Y X > T T > Y
 negative exposure  assume we have a secondary informative treatment $T’$
 $T’$ is similar to $T$ in terms of confounding: if we believe $T \perp Y(t) \mid X$ we also think $T’ \perp Y(t) \mid X$
 we know the expected effect of $T’$ on $Y$ (ex. it should be 0)
 $\tau(T’ \to Y) = E{Y^{T’=1}  Y^{T’=0}}$
 ex. maternal exposure $T$ and parental exposure $T’$

graph LR X >T' X > Y X > T T > Y
check ignorability impact: sensitivity analysis wrt unmeasured confounding

sensitivity analysis measures how “sensitive” a model is to changes in the value of the parameters / structure / assumptions of the model

we focus on sensitivity analyses wrt unmeasured confounding  how strong the effects of an unobserved covariate $U$ on the exposure and/or the outcome would have to be to change the study inference (estimated effect of $T$ on $Y$)

there are many types of such analyses (for a review, see liu et al. 2013)

ex. rosenbaumtype: interested in finding thresholds of the following association(s) that would render test statistic of the study inference insignificant
 e.g. true oddsratio of outcome and treatment, adjusted for $X$ and $U$
 unobserved confounder and the treatment (oddsratio $OR_{\text{UT}}$)
 and/or unobserved confounder and the outcome (oddsratio $OR_{\text{UY}}$)
 usually used after matching
 3 types
 primal sensitivity analysis: vary $OR_{\text{UT}}$ with $OR_{\text{UY}} = \infty$
 dual sensitivity analysis: vary $OR_{\text{UY}}$ with $OR_{\text{UT}}=\infty$
 simultaneous sensitivity analysis: vary both vary $OR_{\text{UT}}$ and $OR_{\text{UY}}$
 e.g. true oddsratio of outcome and treatment, adjusted for $X$ and $U$
 ex. confidenceinterval methods: quantify unobserved confounder under specifications then arrive at target of interest + confidence interval, adjusted for $U$
 first approach: use association between $T, Y, U$ to create data as if $U$ was observed

specifications: $P(U T=0$), $P(U T=1)$ (greenland, 1996)  specifications: $OR_{\text{UY}}$ and $OR_{\text{UT}}$, $U$ evenly distributed (harding, 2003)

given either of these specifications, can fill out the table of $X, Y U$, then use these weights to recreate data and fit weighted logistic regression

 second approach: use association between $T, Y, U$ to compute adjustment
 this approach can relax some assumptions, such as the nothreewayinteraction assumption

specifications: $P(U T=1), P(U T=0), OR_{\text{UY T=1}}, OR_{\text{UY T=0}}$(lin et al. 1998) 
specifications: $P(U T=1), P(U T=0), OR_{\text{UY}}$ (vanderweele & arah, 2011)
 first approach: use association between $T, Y, U$ to create data as if $U$ was observed
 ex. cornfieldtype (cornfield et al. 1959)  seminal work

ignorability does not hold $T \not \perp { Y^{T=1}, Y^{T=0}} X $ 
latent ignorability holds $T \perp { Y^{T=1}, Y^{T=0}} (X, U) $  true causal effect (risk ratio): $\mathrm{RR}_{x}^{\text {true }}=\frac{\operatorname{pr}{Y^{T=1}=1 \mid X=x}}{\operatorname{pr}{Y^{T=0}=1 \mid X=x}}$
 observed: $\mathrm{RR}_{x}^{\mathrm{obs}}=\frac{\operatorname{pr}(Y=1 \mid T=1, X=x)}{\operatorname{pr}(Y=1 \mid T=0, X=x)}$
 we can ask how strong functions of U, T, Y all given X must be to explain away an observed association
 $\mathrm{RR}_{T U \mid x}=\frac{\operatorname{pr}(U=1 \mid T=1, X=x)}{\operatorname{pr}(U=1 \mid T=0, X=x)}$
 $\mathrm{RR}_{U Y \mid x}=\frac{\operatorname{pr}(Y=1 \mid U=1, X=x)}{\operatorname{pr}(Y=1 \mid U=0, X=x)}$
 Thm: under latent ignorability, $\mathrm{RR}{x}^{\mathrm{obs}} \leq \frac{\mathrm{R} \mathrm{R}{T U \mid x} \mathrm{RR}{U Y \mid x}}{\mathrm{RR}{T U \mid x}+\mathrm{RR}_{U Y \mid x}1}$


ex. bounds on direct effects with confounded intermediate variables (Cai et al. 2008)
 bounds with no assumptions (manski 1990)
 worstcase bounds are very poor
 also bounds with monotone treatment response (manski, 1997)
 bounds with optimal treatment selection (i.e. individuals always receive the treatement that is best for them) (manski, 1990, “nonparametric bounds on treatment effects”)

Sensitivity Analysis in Observational Research: Introducing the EValue (vanderweele & ding, 2017)
 Evalue = min strength of association (risk ratio) that an unmeasured confounder would require with both $T$ and $Y$ to explain away a specific treatmentoutcome association, conditional on $X$
 higher = more causal
 Evalue = min strength of association (risk ratio) that an unmeasured confounder would require with both $T$ and $Y$ to explain away a specific treatmentoutcome association, conditional on $X$
 Interpretable Sensitivity Analysis for Balancing Weights (soriano et al. 2021)  percentile bootstrap procedure applied to balancing weights estimators