Chapter 7 Introduction to Causal Inference

Jingyue Huang

In this workshop, we will introduce the basic concepts and framework in causal inference, followed by some commonly used methods, including matching, stratification, inverse probability weighting, and doubly robust estimation. Some materials can be found in the books “Causal Inference in Statistics, Social, and Biomedical Science” (Imbens and Rubin 2015) and “Causal Inference: What If” (Hernán and Robins 2020). I also made use of course slides from STAT 931 offered by Professor Yeying Zhu when I was taking this course.

7.1 Basic Concepts

For causal inference, we are interested in causation. There are two notions of causation. One is the causes of an outcome, such as “what causes lung cancer?” The other one is the effect of a cause. For example, we may ask: “Does smoking cause lung cancer?” and “How strong is the effect?” In this workshop, we focus on the effect of a cause.

Note: Correlation/association does not imply causation.

Example: Weight and hight are associated with each other. But more weight will not cause someone higher.

7.1.1 Notation

Suppose we have data on subjects \(i=1,\dots, n\)

  • \(X_i=(X_{i1}, \dots, X_{ip})^\intercal\): covariates/potential confounders
  • \(T_i\): treatment assignment; \(T_i=1\) if treated and \(T_i=0\) if untreated (control)
  • \(Y_i\): observed outcome

Potential outcomes

  • \(Y_{1i}\): potential outcome if treated
  • \(Y_{0i}\): potential outcome if untreated
  • Note that \(Y_i=T_iY_{1i}+(1-T_i)Y_{0i}\).

7.1.2 Causal Effect

The individual-level causal effect for subject \(i\): \(Y_{1i}-Y_{0i}\)

Average treatment Effect (ATE): \(\theta=E(Y_{1i}-Y_{0i})=E(Y_{1i})-E(Y_{0i})\)

Confounders: Covariates which are associated with the treatment assignment and potential outcomes simultaneously.

Example: Gender (\(X\)), Smoking (\(T\)), Life expectancy (\(Y\))

Association: \(E(Y_{1i}\mid T_i=1)-E(Y_{0i}\mid T_i=0)\)

7.1.3 Assumptions

  • Strongly Ignorable Treatment Assignment (SITA). The treatment indicator (\(T\)) and the response variables (\(Y_{1}\), \(Y_{0}\)) are independent given the set of covariates (\(X\)); \((Y_{1}, Y_{0}) \perp T\mid X\).
  • Stable Unit Treatment Value Assumption (SUTVA). Each subject’s potential outcomes are not influenced by the actual treatment status of other subjects; \((Y_{1i}, Y_{0i}) \perp T_j\) for \(i\neq j\).
  • Positvity. \(0<P(T=1\mid X=\boldsymbol{x})<1\) for any possible value \(\boldsymbol{x}\).

7.1.4 Propensity Score

The propensity score for the treatment assignment is defined as the conditional probability of choosing treatment given the covariates and response variables, that is, \(\tau=P(T=1\mid Y_{1}, Y_{0}, X)\). Under the SITA assumption, it is true that the propensity score \(\tau=P(T=1\mid X=x)\), which is a function of \(x\).

Properties:

  • Propensity score is a balancing score; \(X \perp T\mid \tau\).
  • If the treatment is strongly ignorable given \(X\), i.e., \((Y_{1}, Y_{0}) \perp T\mid X\), then it is strongly ignorable given \(\tau\), i.e., \((Y_{1}, Y_{0}) \perp T\mid \tau\).

Propensity scores are unknown and need to be estimated. Indeed, we model \(T\) (assuming binary) as a function of \(X\). Parametric (logistic or probit regression) or nonparametric (random forest, generalized boosted model, etc.) methods can be applied.

7.2 Commonly Used Methods

The general framework for the propensity score based methods involves two steps:

  • Get the estimated propensity scores \(\hat{\tau}_i\) for \(i=1, \dots, n\) based on the available data \((T_i, X_i), i=1, \dots, n\);
  • Using \(\hat{\tau}_i\) to adjust the original sample and estimate the average treatment effect.

7.2.1 Matching

Basic idea: \((Y_{1}, Y_{0}) \perp T\mid X\).

For each subject in the treated group, if we can find an untreated subject with the same (or similar) covariates (\(X\)), they can form a matched dataset where the property \((Y_{1}, Y_{0}) \perp T\) holds. This means we can estimate the average treatment effect as in a randomized study.

Problem: If the size of \(X\) is moderate or high-dimensional, it is hard to get a matched dataset (curse of dimensionality).

Solution: Use \(\tau(X)\) instead of \(X\) because \((Y_{1}, Y_{0}) \perp T\mid \tau\).

Algorithm: One-to-one nearest available matching on estimated propensity scores.

  • Randomly order the treated and untreated (control) subjects;
  • Select the first treated subject and find the control subject with the closest propensity score;
  • Both subjects are then removed from the pool, and then repeat the second step until all the treated subjects are matched;
  • Estimate the causal effects as in a randomized study, e.g., \[ \hat{\theta}_{\mathrm{M}}=\frac{\sum_{i\in M}T_i Y_{1i}}{\sum_{i\in M} T_i}-\frac{\sum_{i\in M}(1-T_i) Y_{0i}}{\sum_{i\in M} (1-T_i)}\, , \] where \(M\) denotes the matched dataset.

There are other versions of the matching algorithm: one-to-one versus one-to-m, with replacement versus without replacement, etc.

7.2.2 Stratification

Basic idea: Consider strata, \(S_1, \dots, S_K\), \[ E(Y_1-Y_0)=\sum_{k=1}^KE(Y_1-Y_0\mid X\in S_k)P(X\in S_k)\, , \] where we have balance within each stratum.

  • \(E(Y_1-Y_0\mid X\in S_k)\) can be estimated as in a randomized study using data in \(S_k\);
  • \(P(X\in S_k)\) can be approximated by (number of subjects in \(S_k\))\(/n\).

We may encounter the same problem as before when \(X\) is high-dimensional. Again, we can solve this problem by replacing \(X\) with the propensity score \(\tau(X)\): \[ E(Y_1-Y_0)=\sum_{k=1}^KE(Y_1-Y_0\mid \tau(X)\in S_k)P(\tau(X)\in S_k)\, , \] where individuals have similar, but not identical, values of \(\tau\), in each stratum.

Algorithm:

  • Divide the units into \(K\) subclasses based on the quantiles of \(\hat{\tau}_i\)’s (\(i=1,...,n\));
  • Estimate the average treatment effect within each subclass as in a randomized study and take the average of the estimated values across all the subclasses; mathematically, the estimate for the average treatment effect is given by \[ \hat{\theta}_{\mathrm{S}}=\frac{1}{K}\sum_{ k=1}^{K}\left(\frac{\sum_{i\in S_k} Y_{1i}T_i}{\sum_{i\in S_k} T_i}-\frac{\sum_{i\in S_k} Y_{0i}(1-T_i)}{\sum_{i\in S_k} (1-T_i)}\right)\, , \] where \(S_k\) means the \(k\)-th subclass.

How to choose \(K\)? Usually 5.

7.2.3 Inverse Probability Weighting

Basic idea:

\[ E\left\{\frac{TY}{\tau(X)}\right\}=E(Y_1) \text{ and } E\left\{\frac{(1-T)Y}{1-\tau(X)}\right\}=E(Y_0)\, . \]

The inverse probability weighted estimator of the average treatment effect is defined as \[ \hat{\theta}_{\mathrm{IPW}_1}=\frac{1}{n} \sum_{i=1}^n \frac{T_iY_i}{\hat{\tau}_i}-\frac{1}{n} \sum_{i=1}^n\frac{(1-T_i)Y_i}{1-\hat{\tau}_i}\, . \]

We give each subject a weight \(w_i\), where \(w_i=\hat{\tau}_i^{-1}\) for those in the treatment group and \(w_i=(1-\hat{\tau}_i)^{-1}\) for those in the control group. By weighting, each subject is replicated \(w_i\) times. This creates a psuedo-population in which \(T\) and \(X\) are not associated anymore (no confounding).

A more efficient estimator is

\[ \hat{\theta}_{\mathrm{IPW}_2}=\left(\sum_{i=1}^n\frac{T_i}{\hat{\tau}_i}\right)^{-1} \sum_{i=1}^n \frac{T_iY_i}{\hat{\tau}_i}-\left(\sum_{i=1}^n\frac{1-T_i}{1-\hat{\tau}_i}\right)^{-1} \sum_{i=1}^n\frac{(1-T_i)Y_i}{1-\hat{\tau}_i}\, . \]

7.2.4 Doubly Robust Estimation

Problem: If the propensity score model is incorrect, Matching, Stratfication and IPW estimators will be biased.

Solution: Combine IPW with the regression modeling approach to protect against model misspecification.

Note that we now have two sets of models:

  • The model for the propensity score;
  • The models for the potential outcomes:

\[ \begin{aligned} m_1(X;{\beta}_1)&={E}(Y_{1}\mid X)={E}(Y_{1}\mid X, T=1)\\ m_0(X;{\beta}_0)&=E(Y_{0}\mid X)={E}(Y_{0}\mid X, T=0) \end{aligned} \] The so-called doubly robust estimators are consistent if one of the two sets of models is correctly specified.

The doubly robust estimator for the average treatment effect is \[ \hat{\theta}_{\mathrm{DR}}=\frac{1}{n} \sum_{i=1}^{n} \frac{T_{i} Y_{1i}-\left(T_i-\hat{\tau}_i\right)\hat{m}_{1i}}{\hat{\tau}_{i}}-\frac{1}{n} \sum_{i=1}^{n} \frac{(1-T_{i}) Y_{0i}+\left(T_i-\hat{\tau}_i\right)\hat{m}_{0i}}{1-\hat{\tau}_{i}}\, . \]

We can see that this estimator consists of the inverse probability weighted estimator and a second term used for “augmenting.” So, the doubly robust estimators are also called the augmented inverse probability weighted (AIPW) estimators.