A walkthrough of how Causal Forest 🌲 works

Aug 20, 2024 8 min read causal inference

Introduction

In this post, I will go over how causal forest works based on the tutorial in grf R package. Causal Forests offer a flexible, data-driven approach to estimating varied treatment effects, bridging machine learning and causal inference techniques.

Common Setting

If we are working on an observational study, we have the following data:

Outcome variable: $Y_{i}$
Binary treatment indicator: $W_{i} = {0, 1}$
A set of covariates: $X_{i}$

Let’s assume that the following conditions hold:

(Assumption 1) $W_{i}$ is unconfounded given $X_{i}$ (i.e. treatment is as good as random given covariates).

${Y_{i} (0), Y_{i} (1)} ⊥ W_{i} | X_{i}$

(Assumption 2) The confounders $X_{i}$ have a linear effect on $Y_{i}$ .
(Assumption 3) The treatment effect $τ$ is constant.

Then we could run a regression of the type

$Y_{i} = τ W_{i} + β X_{i} + ϵ_{i}$

and interpret the estimate of $\hat{τ}$ as the average treatment effect (ATE) $τ = E (Y_{i} (1) - Y_{i} (0))$ .

Relaxing Assumptions

Assumption 1 is an “identifying” assumption we have to live with
Assumption 2 and Assumption 3 are modeling assumptions that we can question.

Relaxing Assumption 2: Partially Linear Model (PLR)

Assumption 2 is a strong parametric modeling assumption that requires the confounders to have a linear effect on the outcome, and that we should be able to relax by relying on semi-parametric statistics.

We can instead posit the partially linear model:

$Y_{i} = τ W_{i} + f (X_{i}) + ϵ_{i}, E (ϵ_{i} | X_{i}, W_{i}) = 0$

How do we get around estimating $τ$ when we do not know $f (X_{i})$ ?

Define the propensity score as $e (x) = E (W_{i} | X_{i} = x),$

and the conditional mean of $Y$ as

$m (x) = E (Y_{i} | X_{i} = x) = f (x) + τ e (x) .$

By Robinson (1988), we can rewrite the above equation in “centered” form:

$Y_{i} - m (x) = τ \cdot [W_{i} - e (x)] + ϵ_{i}$

This formulation has great practical appeal, as it means $τ$ can be estimated by residual-on-residual regression.

Good properties 😀: Robinson (1988) shows that this approach yields root-n consistent estimates of $τ$ , even if estimates of $m (x)$ and $e (x)$ converge at a slower rate (“4-th root” in particular). This property is often referred to as orthogonality and is a desirable property that essentially tells you that given noisy “nuisance” estimates ( $m (x)$ and $e (x)$ ) you can still recover “good estimates of your target parameter ( $τ$ ). For more details, please refer to Wager, Stefan “STATS 361: Causal Inference” Lecture 3.

But how to estimate $m (x)$ and $e (x)$ ?

Use modern machine learning models! One could use boosting, random forest, and etc to estimate $m (x)$ and $e (x)$ because what we need is just “reasonable accurate” predictions, i.e.

$E {[(\hat{m} (X) - m (X))^{2}]}^{\frac{1}{2}}, E {[(\hat{e} (X) - e (X))^{2}]}^{\frac{1}{2}} = o_{P} (\frac{1}{n^{1 / 4}})$

Issue with Direct Plug-in of Estimates: Directly plugging in $\hat{m} (x)$ and $\hat{e} (x)$ into the residual-on-residual regression typically leads to bias because modern ML methods regularize to trade off bias and variance.
Solution via Cross-Fitting: cross-fitting, where the prediction for observation $i$ is obtained without using unit $i$ for estimation, can help overcome this bias (Chernozhukov et al. 2018).

Recap: We have a way to adopt the modern ML toolkit to non-parametrically control for confounding when estimating an ATE, and still retain desirable statistical properties such as unbiased-ness and consistency.

Relaxing Assumption 3: Non-constant treatment effects

Non-constant treatment effects occur when the impact of a treatment varies across different subgroups or individuals. This concept relaxes the assumption of homogeneous treatment effects, where the treatment is assumed to have the same impact on all units.

We could specify certain subgroups and run separate regressions for each subgroup and obtain different estimates of $τ$ . To avoid false discoveries, this approach would require us to specify potential subgroups without looking at the data. How can we use the data to inform us of potential subgroups?

Let’s define,

$Y_{i} = \textcolor r e d τ (X_{i}) W_{i} + f (X_{i}) + ε_{i}, E [ε_{i} ∣ X_{i}, W_{i}] = 0,$

where $\textcolor r e d τ (X_{i})$ is the conditional ATE, i.e., $\textcolor r e d τ (X_{i}) := E [Y_{i} (1) - Y_{i} (0) ∣ X_{i}]$ . How do we estimate this?

Idea: If we imagine we had access to some neighborhood $N (x)$ where $τ$ was constant, we could proceed exactly as before, by doing a residual-on-residual regression on the samples belonging to $N (x)$ , i.e.:

$τ (x) := lm (Y_{i} - {\hat{m}}^{(- i)} (X_{i}) \sim W_{i} - {\hat{e}}^{(- i)} (X_{i}), weights = 1 {X_{i} \in N (x)})$

This is conceptually what Causal Forest does, it estimates the treatment effect $τ (x)$ for a target sample $X_{i} = x$ by running a weighted residual-on-residual regression on samples that have similar treatment effects.

Recap: Causal Forest is running a a “forest”-localized version of Robinson’s regression.

These weights play a crucial role, so how does grf 📦 find them?

Random forest as an adaptive neighborhood finder

Breiman’s random forest for predicting the conditional mean $m (x) = E (Y_{i} | X_{i} = x)$ can be briefly summarized in two steps:

Building phase: Build $B$ trees which greedily place covariate splits that maximize the squared difference in subgroups means

$n_{L} \cdot n_{R} \cdot ({\bar{y}}_{L} - {\bar{y}}_{R})^{2}$

Prediction phase: Aggregate each tree’s prediction to form the final point estimate by averaging the outcomes $Y_{i}$ that fall into the same terminal leaf $L_{b} (X_{i})$ as the targets sample $x$ :

$\begin{aligned} \hat{m} (x) & = \frac{1}{B} \sum_{b = 1}^{B} \sum_{i = 1}^{n} \frac{Y_{i} 1 {X_{i} \in L_{b} (x)}}{| L_{b} (x) |} \\ (1) & = \sum_{i = 1}^{n} \frac{1}{B} \sum_{b = 1}^{B} Y_{i} \frac{1 {X_{i} \in L_{b} (x)}}{| L_{b} (x) |} \\ = \sum_{i = 1}^{n} Y_{i} \textcolor b l u e \frac{1}{B} \sum_{b = 1}^{B} \frac{1 {X_{i} \in L_{b} (x)}}{| L_{b} (x) |} \\ (2) & = \sum_{i = 1}^{n} Y_{i} \textcolor b l u e α_{i} (x), \end{aligned}$

Note that, this procedure is a double summation, first over trees, then over training samples (see equation (1)). We can swap the order of summation and obtain $\textcolor b l u e α_{i} (x)$ in the equation (2). We have defined $\textcolor b l u e α_{i} (x)$ as the frequency with which the $i$ -th training sample falls into the same leaf as $x$ .

Does above $\textcolor b l u e α_{i} (x)$ remind you the traditional deterministic kernel function and bandwidth story?

The following image illustrates how the $\textcolor b l u e α_{i} (x)$ are calculated: Some dots are larger because they are used by all trees, while some are smaller because they are only used by a few trees.

Causal Forest

Causal Forest essentially combines Breiman (2001) and Robinson (1988) by modifying the steps above to:

Building phase: Greedily places covariate splits that maximize the squared difference in subgroup treatment effects $n_{L} \cdot n_{R} \cdot ({\hat{τ}}_{L} - {\hat{τ}}_{R})^{2},$ where $\hat{τ}$ is obtained by running Robinson’s residual-on-residual regression for each possible split point.
Use the resulting forest weights $\textcolor b l u e α_{i} (x)$ to estimate $τ (x) := lm (Y_{i} - {\hat{m}}^{(- i)} (X_{i}) \sim W_{i} - {\hat{e}}^{(- i)} (X_{i}), weights = \textcolor b l u e α_{i} (x)),$ where $\textcolor b l u e α_{i} (x)$ capture how “similar” a target sample $x$ is to each of the training samples $X_{i}$

That is, Causal Forest is running a “forest”-localized version of Robinson’s regression. This adaptive weighting (instead of leaf-averaging) coupled with some other forest construction details known as “honesty” and “subsampling” can be used to give asymptotic guarantees for estimation and inference with random forests (Wager & Athey, 2018)

Efficiently estimating summaries of the CATEs

What about estimating summaries of $τ (x)$ , in terms of estimands like the average treatment effect (ATE), or the best linear projection (BLP), that have guaranteed $\sqrt{n}$ rate of convergence along with exact confidence intervals?

For estimating ATE, the most intuitive approach is to average the CATE, i.e. $\frac{1}{n} \sum_{i = 1}^{n} τ (X_{i})$ , right? However, there are more efficient methods than simply averaging individual CATE estimates.

Robins, Rotnitzky & Zhao (1994) showed that the so-called Augmented Inverse Probability Weighted (AIPW) estimator is asymptotically optimal for $τ$ (meaning that among all non-parametric estimators, it has the lowest variance).

\begin{matrix} {\hat{τ}}_{A I P W} = \frac{1}{n} \sum_{i = 1}^{n} ({\hat{μ}}_{(1)} (X_{i}) - {\hat{μ}}_{(0)} (X_{i}) + \frac{W_{i}}{\hat{e} (X_{i})} (Y_{i} - {\hat{μ}}_{(1)} (X_{i})) \\ - \frac{1 - W_{i}}{1 - \hat{e} (X_{i})} (Y_{i} - {\hat{μ}}_{(0)} (X_{i}))), \end{matrix}

where $μ_{(w)} (x) := μ (x, w) := E [Y_{i} ∣ X_{i} = x, W_{i} = w]$ and $e (x) = P [W_{i} = 1 ∣ X_{i} = x]$

To interpret the AIPW estimator ${\hat{τ}}_{A I P W}$ , it is helpful to decompose it into two components: Let ${\hat{τ}}_{A I P W} = A + B$ , where

A = \frac{1}{n} \sum_{i = 1}^{n} ({\hat{μ}}_{(1)} (X_{i}) - {\hat{μ}}_{(0)} (X_{i}))

B = \frac{1}{n} \sum_{i = 1}^{n} (\frac{W_{i}}{\hat{e} (X_{i})} (Y_{i} - {\hat{μ}}_{(1)} (X_{i})) - \frac{1 - W_{i}}{1 - \hat{e} (X_{i})} (Y_{i} - {\hat{μ}}_{(0)} (X_{i})))

$A$ represents the outcome regression adjustment estimator using ${\hat{μ}}_{(w)}$
$B$ is an inverse propensity score weighting (IPW) estimator applied to the residuals $Y_{i} - {\hat{μ}}_{(W_{i})} (X_{i})$
The AIPW estimator utilizes propensity score weighting on the residuals to debias the direct estimate
One key property of the AIPW estimator is its “double robustness”, which means that the estimator remains consistent and asymptotically normal even if either the outcome model or the propensity score model is misspecified. For proof, please refer to Stefan Wager’s Lecture 3 notes in “STATS 361: Causal Inference”.

The expression for above AIPW estimator can be rearranged and expressed as

\begin{aligned} \frac{1}{n} \sum_{i = 1}^{n} (τ (X_{i}) + \textcolor r e d [\frac{W_{i}}{e (X_{i})} - \frac{1 - W_{i}}{1 - e (X_{i})}] \textcolor v i o l e t [Y_{i} - μ (X_{i}, W_{i})]) \\ = \frac{1}{n} \sum_{i = 1}^{n} (τ (X_{i}) + \frac{W_{i} - e (X_{i})}{e (X_{i}) [1 - e (X_{i})]} (Y_{i} - μ (X_{i}, W_{i}))) \\ ≜ \frac{1}{n} \sum_{i = 1}^{n} Γ_{i} \end{aligned}

We can understand above terms as:

$τ (X_{i})$ is an initial treatment effect estimate
$\textcolor r e d [\frac{W_{i}}{e (X_{i})} - \frac{1 - W_{i}}{1 - e (X_{i})}]$ is the Riesz Representer (Chernozhukov et al., 2022), which is used to “correct” the bias
$\textcolor v i o l e t [Y_{i} - μ (X_{i}, W_{i})]$ is the residual part
$Γ_{i}$ is called double robust score

References

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68. https://doi.org/10.1111/ectj.12097.
Athey, Susan, Julie Tibshirani, and Stefan Wager. 2019. “Generalized Random Forests.” The Annals of Statistics 47 (2): 1148–78. https://doi.org/10.1214/18-AOS1709.
Robinson, Peter M. “Root-N-consistent semiparametric regression.” Econometrica: Journal of the Econometric Society (1988): 931-954.
Wager, Stefan, and Susan Athey. “Estimation and inference of heterogeneous treatment effects using random forests.” Journal of the American Statistical Association 113.523 (2018): 1228-1242.
Wager, S. (2022). STATS 361: Causal Inference Lecture notes. Stanford University. https://web.stanford.edu/~swager/stats361.pdf
Chernozhukov, V., Newey, W. K., & Singh, R. (2022). Automatic Debiased Machine Learning of Causal and Structural Effects. Econometrica, 90(3), 967–1027. https://doi.org/10.3982/ECTA18515

causal inference causal machine learning double robust double machine learning causal forest grf

Chen Xing

Founder & Data Scientist

Enjoy Life & Enjoy Work!