Notes on Causal Survival Forest 🛟🌲

In this post, I provide summary notes on the paper “Estimating Heterogeneous Treatment Effects with Right-Censored Data via Causal Survival Forests” by Cui et al. (2023).

Motivation

How to estimate heterogeneous treatment effects with right-censored data?
  1. Heterogeneous treatment effect (HTE) estimation plays a central role in data-driven personalization

  2. Existing methods often can’t handle censored survival outcomes, common in medical/business applications

Causal Survival Forests (CSF)

To address this challenge, the paper proposes causal survival forests (CSF)

1233
  • An adaptation of the causal forest algorithm of Athey et al. (2019)

  • It adjusts for censoring using doubly robust estimating equations developed in the survival analysis literature

Advantages

  • Robust, computationally tractable, and outperforms available baselines in our experiments

  • Good statistical properties – UCAN

Statistical Setting

Assume i.i.d tuples {Xi,Ti,Ci,Wi}, where

  • XiX denote covariates
  • TiR+ is the survival time for ith unit
  • CiR+ is censoring time (the time at which ith unit gets censored)
  • Wi{0,1} denotes treatment assignment

Using potential outcome framework, posit potential outcomes {Ti(1),Ti(0)} s.t. Ti=Ti(Wi), we need to estimate the conditional average treatment effect (CATE)

τ(x)=E[y(Ti(1))y(Ti(0))Xi=x],

where y() is the outcome transformation, e.g.

  • y(T)=T

  • y(T)=Th for the restricted mean survival time (RMST); here h is some chosen maximum considered time

  • y(T)=1{Th} for the survival probability

To estimate τ(x), the main challenge is that Ti is not always observable. We only observe:

  • censored survival time: Ui=TiCi

  • non-censoring indictor: Δi=1{TiCi}

Based on Assumption 1 (see later), we define the effective non-censoring indictor as follows:

Δih=1{TihCi}=(2)Δi1{Uih}

Note that, for the eqn (2), everything is observed. We can regard an observation with Δih=1 as a complete observation.

Assumptions

In order to identify treatment effects, we need to rely on two sets of assumptions.

  • Assumption 2-4 enable us to identify the causal effect of Wi on Ti without censoring

  • Assumption 5-6 is to guarantee that censoring due to Ci does not break identification results

Assumption 1 (Finite Horizon)

y(t)=y(h), th, 0<h<

Assumption 2 (Potential Outcomes)

{Ti(1),Ti(0)}s.t.Ti=Ti(Wi)a.s. Assumption 3 (Ignorability)

{Ti(1),Ti(0)}WiXi

Assumption 4 (Overlap)

Propensity score e(x)=P(Wi=1Xi=x) is uniformly bounded away from 0 and 1,

ηee(x)1ηe,0<ηe12

Assumption 5 (Ignorable censoring)

Censoring is independent of survival time conditionally on treatment and covariates,

TiCiWi,Xi

Assumption 6 (Positivity)

P(Ci<h|Wi,Xi)1ηc,0<ηc1

Causal Forests Without Censoring

How does causal forest work?

Essentially, we are running a “forest”-localized version of Robinson’s regression

τ(x):=lm(Yiμ^(i)(Xi)Wie^(i)(Xi), weights =\textcolorblueαi(x)),

where \textcolorblueαi(x) capture how “similar” a target sample x is to each of the training samples Xi

Using notations in previous section, we estimate τ(x) by solving the following equation,

(cf)i=1nαi(x)ψτ^(x)(c)(Xi,y(Ti),Wi;e^,m^)=0

where,

ψτ(c)(Xi,y(Ti),Wi;e^,m^)=[Wie^(Xi)]×[y(Ti)m^(Xi)τ(Wie^(Xi))]

is the orthogonal complete score function (shown as up-script (c)),

  • e(x)=P(Wi=1Xi=x)

  • m(x)=E(y(Ti)Xi)

  • e^(Xi) and m^(Xi) are estimates derived via cross-fitting

Adjusting for Censoring via Weighting

In the presence of censoring, the Ti in equation (cf) is no longer observable.

Simply ignoring censoring and building models on with complete observations (i.e. Δih=1) would lead to bias.

Simple Censoring Adjustment via IPCW

Define the conditional survival function for censoring process as SwC(sx)=P[CisWi=w,Xi=x] We have, P[Δih=1Xi,Wi,Ti]=SWiC(TihXi)

  • the LHS is the conditional probability of observing a complete observations (i.e. Δih=1)

  • the RHS is the conditional probability that censoring time is greater than survival time

  • Does the above P(Δih=1) look like propensity score function?

The main idea of IPCW estimation is to only consider complete cases, but up-weight all complete observations by 1/SWiC(TihXi) to compensate for censoring.

As a result, IPCW estimators succeed in eliminating censoring bias.

With IPCW, we estimate τ(x) by solving the following equation,

(IPCW){i:Δih=1}αi(x)S^WiC(TihXi)ψτ^(x)(c)(Xi,y(Ti),Wi;e^,m^)=0 Let’s compare the equation (cf) v.s (IPCW),

  • For eqn (cf), we sum over all observations; for eqn (IPCW), we only sum over complete observations

  • For eqn (IPCW), we add 11/SWiC(TihXi) as a part of weight

For more details on IPCW, please check:

  • Chapter 8 and 12 in the textbook Causal Inference: What If" (Hernán and Robins, 2020). In particular, “Ch 12.6 Censoring and missing data” is very helpful.

  • Chapter 21 “Treatment Heterogeneity with Survival Outcomes” in the textbook Handbook of Matching and Weighting Adjustments for Causal Inference (Zubizarreta et al., 2023)

A Doubly Robust Correction

Two limitations of IPCW approach:

  1. Only use complete observations; throw away all observations with Δih=0, and this may hurt efficiency

  2. IPCW-type methods are generally not robust to estimation errors; Neyman orthogonality condition does not hold (Chernozhukov et al. 2018)

CSF Method

CSF method does not rely on IPCW. Instead, it relies on a more robust approach to making estimating equations robust to censoring.

Recall the simplest case (without censoring), we have, (cf)i=1nαi(x)ψτ^(x)(c)(Xi,y(Ti),Wi;e^,m^)=0 Now, we estimate the τ(x) by solving the following equation,

i=1nαi(x)ψτ^(x)(Xi,y(Ui),Uih,Wi,Δih;e^,m^,λ^wC,S^wC,Q^w)=0,

where the score function is,

image-20240919213740735

the conditional expectation of the transformed survival time is defined as:

Qw(sx)=E[y(Ti)Xi=x,Wi=w,Tih>s]

and the associated conditional hazard function is defined as:

λwC(sx)=ddslogSwC(sx)

Q^w(sx),S^wC(sx) and λ^wC(sx) are cross-fit nuisance parameter estimates.

How to understand the above score function?

The short answer is that that functional form emerges for the math (i.e., the desire for a doubly robust adjustment); and, unlike with the basic AIPW formula, it’s not as immediately intuitive.1

💡 KEY Points

We should think about the Neyman-orthogonal property. In summary, CSF alleviates the drawbacks of IPCW so by taking the (complete-data) causal forest estimating equation ψτ(x)(c)(T,W,) (the “R-learner”) and turn it into a censoring robust estimating equation ψτ(x)(Y,W,) by using estimates of the survival and censoring processes2. ("…" refers to additional nuisance parameters):

  • censoring process: P[Ci>tXi=x,Wi=w]

  • survival process: P[Ti>tXi=x,Wi=w]

The upshot of this “orthogonal” estimating equation is that it will be consistent if either the survival or censoring process is correctly specified, which is very beneficial when we want to estimate these by modern ML tools, such as random survival forests.

CSF approach is doubly robust in the sense that we can obtain the consistent estimator either the survival or censoring process is correctly specified.

For more details, Rubin & van der Laan (2007) and the chapter on RCTs with time-to-event data in Targeted Learning (2011) gives some more digestible details on doubly robust estimation with survival data.3

References

Athey, Susan, Julie Tibshirani, and Stefan Wager. 2019. “Generalized Random Forests.” The Annals of Statistics 47 (2): 1148–78. https://doi.org/10.1214/18-AOS1709.

Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. “Double/Debiased Machine Learning for Treatment and Structural Parameters.” The Econometrics Journal 21 (1): C1–68. https://doi.org/10.1111/ectj.12097.

Cui, Yifan, Michael R Kosorok, Erik Sverdrup, Stefan Wager, and Ruoqing Zhu. 2023. “Estimating Heterogeneous Treatment Effects with Right-Censored Data via Causal Survival Forests.” Journal of the Royal Statistical Society Series B: Statistical Methodology 85 (2): 179–211. https://doi.org/10.1093/jrsssb/qkac001.

Hernán MA, Robins JM (2020). Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.

Zubizarreta, J. R., Stuart, E. A., Small, D. S., & Rosenbaum, P. R. (2023). Handbook of Matching and Weighting Adjustments for Causal Inference. CRC Press.


  1. This was suggested by Professor Wager in an email conversation. ↩︎

  2. Check more on grf tutorial: Causal forest with time-to-event data ↩︎

  3. Suggested by Erik Sverdrup. Many thanks! ↩︎

Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!

Related