The Pragmatic Scientist's Guide to Causal Inference

My notes from the course taught by Edward Kennedy at CMU. Work in progress: feel free to email me or leave annotations with feedback.

1. Identification

Complementary readings:

Setting the Target: Precise Estimands and the Gap Between Theory and Empirics Ian Lundberg, Rebecca Johnson, Brandon Stewart. Working Paper.
Potential Outcome and Directed Acyclic Graph Approaches to Causal Inference Guido W. Imbens. Journal of Economic Literature.

What is an estimand \(\psi\)? An estimand is a statistical functional. It can be applied to any distribution \(\mathbb{P}\) as \(\psi(\mathbb{P})\). Common examples of estimands are means and variances.

Causal inference involves 3 tasks:

Choose the target parameter: The target parameter is an estimand that involves counterfactual quantities. The choice of target parameter is motivated by a substantive question. An example of a target parameter is the average treatment effect (ATE), given by the mean of a counterfactual difference, \(\mathbb{E}[Y^1 - Y^0]\).
Prove identification: The target parameter is a statistical functional of counterfactual quantities. An identification proof connects the target parameter with observed data. This necessarily involves making additional assumptions, called identification assumptions.
Perform estimation and inference: Once an identification proof connects the target parameter to the observed data, estimating and performing inference on the target parameter is a purely statistical problem. Usually, estimation and inference involve making additional statistical assumptions.

It is important to separate the assumptions needed for identification (which are typically untestable and must be argued for using substantive knowledge) from the assumptions needed for estimation and inference (which are often testable using the observed data).

Identification also has a formal definition: if \(\mathbb{P}\) and \(\mathbb{Q}\) are any two observed distributions and \(\mathbb{P'}\) and \(\mathbb{Q'}\) are any two counterfactual distributions, a target parameter \(\psi\) is identified iff.

\[\psi(\mathbb{P'}) \neq \psi(\mathbb{Q'}) \implies \mathbb{P} \neq \mathbb{Q}\]

I found it easier to understand this definition by considering an example when \(\psi\) is not identified.

Let \(\mathbb{P}\) and \(\mathbb{Q}\) be distributions of the observed outcome \(Y\), and \(\mathbb{P'}\) and \(\mathbb{Q'}\) be distributions of the counterfactual \(Y^1\). Let \(\psi\) be the mean. For \(\psi\) to be unidentified, it must be that \(\psi(\mathbb{P'}) \neq \psi(\mathbb{Q'})\) and \(\mathbb{P} = \mathbb{Q}\); this implies that the counterfactual mean \(\mathbb{E}[Y^1]\) is different under the two distributions, but the observed mean \(\mathbb{E}[Y]\) is identical.

2. Simple Randomized Experiments

Data: \((A, Y) \sim \mathbb{P}\) where \(A \in \{0,1\}\) is the treatment and \(Y \in \mathbb{R}\) is the outcome.

Target parameter: \(\psi = \mathbb{E}[Y^a]\) for \(a = 0, 1\).

Identification assumptions:

Consistency: \(Y = Y^a\) whenever \(A = a\)
Randomization: \(A \,\bot\, Y^0, Y^1\)

Identification proof:

\[\begin{align} \mathbb{E}[Y^a] &= \mathbb{E}[Y^a | A=a] & \textrm{(randomization)}\\ &= \mathbb{E}[Y | A=a] & \textrm{(consistency)} \end{align}\]

Hence, estimation can proceed by simply measuring the sample mean of the outcome for the treated or untreated units.

This is a good time to introduce some convenient notation. Let \(\mathbb{P}_n(X)\) be the sample mean of \(X\) over \(n\) samples. Then:

\[\begin{align} \mathbb{E}[Y|A=1] &= \frac{\sum_{i=1}^n Y_i A_i}{\sum_{i=1}^n A_i} = \frac{\mathbb{P}_n(YA)}{\mathbb{P}_n(A)}\\ \mathbb{E}[Y|A=0] &= \frac{\sum_{i=1}^n Y_i (1-A_i)}{\sum_{i=1}^n (1-A_i)} = \frac{\mathbb{P}_n(Y(1-A))}{\mathbb{P}_n(1-A)} \end{align}\]

2.1 The Difference-in-Means Estimator

Statistical assumptions:

Consistency:

Inference:

Python implementation:

2.2 The Horvitz-Thompson Estimator