Study Notes on Bounding OVB 🌀 in Causal ML

Motivation

In empirical research, one of the challenges to causal inference is the potential presence of unobserved confounding. Even when we adjust for a wide range of observed covariates, there’s often a lingering concern: what if there are important variables we’ve failed to measure or include? This challenge necessitates a careful approach to sensitivity analysis, where we assess how strong unobserved confounders would need to be to meaningfully alter our conclusions.

image of Mafūba technique
Mafūba: A technique designed to seal demons by drawing them into a container, which is then secured with a special "Demon Seal" ofuda. Source: Image from Dragon Ball Fandom

Questions1

  1. How “strong” would a particular confounder (or group of confounders) have to be to change the conclusions of a study?

  2. In a worst case scenario how vulnerable is the study’s result to many or all unobserved confounders acting together, possibly nonlinearly?

  3. Are these confounders or scenarios plausible? How strong would they have to be relative to observed covariates (e.g female), in order to be problematic?

  4. How can we present these sensitivity results concisely for easy routine reporting?

TL;DR

The “Long Story Short” paper (Chernozhukov et al., 2024) introduces a framework for bounding omitted variable bias (OVB).

They develop a general theory applicable to a broad class of causal parameters that can be expressed as linear functionals of the conditional expectation function.

The key insight is to characterize the bias using the Riesz representation. Specifically, they express the OVB as the covariance of two error terms:

  1. The error in the outcome regression

  2. The error in the Riesz representer (RR)

This formulation leads to a bound on the squared bias that is the product of two terms:

  1. The MSE of the outcome regression
  2. The MSE of the RR

How to understand the Riesz representer (RR)?

  • We can derive it using Frisch–Waugh–Lovell (FWL) theorem

  • In a special case, i.e. the average treatment effect (ATE) with binary treatment, the Riesz representer (RR) “looks like” the weights in Inverse Propensity Score weighting (IPW)

Theoretical Details

Recall that: for classical linear regression,

“Short equals long plus the effect of omitted times the regression of omitted on included.” – Angrist and Pischke, Mostly Harmless Econometrics.

  1. In the ideal case:

    • “long” regression: $Y = \theta D + f(X, A) + \epsilon$

    • Define $W := (D, X, A)$ as the “long” list of regressors

    • $A$ is unobserved vector of covariates

    • Assume $\mathbb{E}(\epsilon \mid D, X, A) = 0$

  2. In the practical case:

    • “short” regression: $Y = \theta_sD + f_s(X) + \epsilon_s$
    • Define $W_s := (D,X)$ as the “short” list of observed regressors because $A$ is unmeasured/unobservable
    • Use $\theta_s$ to approximate $\theta$; need to bound $\theta_s - \theta$
  3. Conditional Expectation

    • “long” CET, $g(W) := \mathbb{E}(Y \mid D, X, A) = \theta D + f(X, A)$

    • “short” CET, $g_s(W) := \theta_s D + f_s(X)$

  4. Riesz representers (RR):

    • “long” RR: $$ \alpha:= \alpha(W):=\frac{D-\mathrm{E}[D \mid X, A]}{\mathrm{E}(D-\mathrm{E}[D \mid X, A])^2} $$

      By FWL Theorem, we have $$ \theta = \frac{Cov(Y\tilde{D})}{Var(\tilde{D})} = \mathbb{E}(y\alpha) $$

    • “short” RR $$ \alpha_s := \alpha_s\left(W^S\right):=\frac{D-\mathrm{E}[D \mid X]}{\mathrm{E}(D-\mathrm{E}[D \mid X])^2} $$

    • In special case, one can show: $$ \alpha(W)=\frac{D}{P(D=1 \mid X, A)}-\frac{1-D}{P(D=0 \mid X, A)}, $$

    $$ \alpha_s(W)=\frac{D}{P(D=1 \mid X)}-\frac{1-D}{P(D=0 \mid X)}, $$

    The RR “looks like” the weights in Inverse Propensity Score weighting (IPW). For example,

    $$ \begin{aligned} \alpha_s(W) &= \frac{D}{P(D=1 \mid X)}-\frac{1-D}{P(D=0 \mid X)} \\& = \frac{D}{e(X)} - \frac{1-D}{1-e(X)}, \ \, \text{where } e(X) = P(D=1 \mid X) \end{aligned} $$

    Recall the property of IPW estimator, we have:

    $$ \theta_s \overset{ipw}{=} \mathbb{E}\left[\frac{YD}{e(X)} - \frac{Y(1-D)}{1-e(X)}\right] = \mathbb{E}(Y\alpha_s) $$
  5. Note that, $\theta = \mathbb{E}(g \alpha)$. Why? $$ \begin{aligned} \mathbb{E}(g \alpha) & = \mathbb{E}\left\{ \mathbb{E}(Y \mid D, X, A) \frac{D-\mathrm{E}[D \mid X, A]}{\mathrm{E}(D-\mathrm{E}[D \mid X, A])^2} \right\} \\ &= \mathbb{E}\left\{ [\theta D + f(X, A)] \frac{D-\mathrm{E}[D \mid X, A]}{\mathrm{E}(D-\mathrm{E}[D \mid X, A])^2} \right\} \\ &= \theta \end{aligned} $$

    For the third equation, we use the fact that: $\mathrm{E}[D \mid X, A] \ {\perp \!\!\! \perp} \ D-\mathrm{E}[D \mid X, A]$ and $f(X, A) \ {\perp \!\!\! \perp} \ D-\mathrm{E}[D \mid X, A]$.

We have $\theta = \mathbb{E}(Y\alpha) = \mathbb{E}(g \alpha)$ and $\theta_s = \mathbb{E}(Y\alpha_s) = \mathbb{E}(g_s \alpha_s)$.

Key Results

image-20240913170504351
  1. OVB is bounded by $\mathbb{E}(\text{regression error} \cdot \text{RR error})$

  2. Square bias is bounded by $\text{MSE(regression)} \cdot \text{MSE(RR)}$

Idea of Proof:

  1. Write $\theta = \mathbb{E}(g \alpha)$ and $\theta_s = \mathbb{E}(g_s \alpha_s)$

  2. Use the fact that $\alpha_s \ {\perp \!\!\! \perp} \ g-g_s$ and $g_s \ {\perp \!\!\! \perp} \ \alpha - \alpha_s$

Now, we know the omitted variable bias, $\theta_s - \theta = \mathbb{E}(g_s - g)(\alpha_s - \alpha)$, is the covariance between the regression error, $g_s - g = \mathbb{E}(Y|W_s) - \mathbb{E}(Y|W)$, and RR error, $\alpha_s - \alpha$, which means the “weights in IPW” calculated using the short list of regressors, minus the “weights in IPW” using the long list of regressors.

As you can see, this is not easy to interpret, so the following Corollary provides a more intuitive understanding using the familiar $R^2$.

image-20240913170858030
  1. $𝑆$ is the scale of the bias, identified from the data. The confounding strength $𝐶_𝑌$ and $𝐶_𝐷$ have to be restricted by the analyst.

  2. $C_{Y}^2$ measures the proportion of residual variation of the outcome explained by latent confounders; in short, the strength of unmeasured confounding in outcome equation

  3. $C_{D}^2$ measures the proportion of residual variation of the treatment explained by latent confounders; in short, the strength of unmeasured confounding in treatment equation

image-20240913193118914
Source: Victor’s tutorial at the Chamberlain Seminar

Empirical Challenge

How do we determine the plausible strength of these unobserved confounders?

How to set the strength of these unobserved confounders? $C_{Y}^2 = ?$ and $C_{D}^2 = ?$

This is a critical question, as setting these values arbitrarily could lead to overly conservative or overly optimistic interpretations of our results. We need a principled approach to guide our choices.

Benchmarking

One solution is benchmarking. This approach leverages the observed data to inform our judgments about unobserved confounders. The basic idea is the following: we can use the impact of observed covariates as a reference point for the potential impact of unobserved ones. For instance, if we’ve measured income, the “most” important factor, and found it explains 15% of the variation in our outcome, we might reason that an unobserved confounder is unlikely to have an even larger effect.

In practice, benchmarking can take several forms. We might purposely omit a known important covariate, refit our model, and observe the change. This gives us a concrete example of how omitting an important variable affects our estimates. Alternatively, we could express the strength of unobserved confounders relative to observed ones. For example, we might consider scenarios where an unobserved confounder is as strong as income, or perhaps 25% as strong.

Recent research (Chernozhukov et al.,2024) has demonstrated the utility of this approach. In a study on 401(k) eligibility, they used the observed impact of income, IRA participation, and two-earner status as benchmarks for potential unobserved firm characteristics. Similarly, in a study on gasoline demand, the known impact of income brackets informed the choice of sensitivity parameters for potential remnant income effects.

Robustness Value

Another useful concept is the Robustness Value (RV) which represents the minimum strength of confounding required to change a study’s conclusions. This provides a clear threshold for evaluating the robustness of results. For instance, if a study finds that confounders would need to explain more than 5% of residual variation to alter the conclusions, researchers can debate whether such strong confounding is plausible in their context.

By grounding our sensitivity analyses in observed data and domain knowledge, we can provide more meaningful and defensible assessments of the potential impact of unobserved confounding. This approach moves us beyond arbitrary choices and towards a more rigorous and context-specific evaluation of causal claims.

References

Chernozhukov, Victor, Carlos Cinelli, Whitney Newey, Amit Sharma, and Vasilis Syrgkanis. “Long Story Short: Omitted Variable Bias in Causal Machine Learning.” arXiv, May 26, 2024. https://doi.org/10.48550/arXiv.2112.13398.

10. Sensitivity analysis in DoubleML package

Angrist, Joshua D., and Jörn-Steffen Pischke. Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press, 2009.

sensemakr: Sensitivity Analysis Tools for OLS in R and Stata

Carlos Cinelli and Chad Hazlett. ‘Making sense of sensitivity: Extending omitted variable bias’. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82.1 (2020), pp. 39–67 (cited on pages 23, 26).


  1. These questions are in the slides of this tutorial ↩︎

Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!

Related