What is the Critical Radius in Riesz Regression?

$$ \newcommand{\indep}{\mathrel{\perp\mkern-10mu\perp}} \newcommand{\P}{\mathbb{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} \newcommand{\1}[1]{\mathbf{1}\\{#1\\}} $$

TL;DR

In the paper Automatic Debiased Machine Learning via Riesz Regression by Chernozhukov et al. (2024), the critical radius is a key concept for understanding how accurately machine learning (ML) can estimate the Riesz representer—a function used to debias ML estimates of economic parameters like treatment effects.

In Simple Terms: The critical radius measures how “hard” it is to learn a function (like the Riesz representer) from data.

The critical radius depends on:

  • Model Complexity: More flexible ML models (e.g., deep neural nets) have a larger critical radius, meaning they need more data to estimate accurately.
  • Sample Size: More data reduces the critical radius, making estimates more precise.

Intuition: Think of the critical radius as a speed limit on how fast your machine learning estimator can converge to the true function. If you’re estimating a complicated function (like a highly nonlinear Riesz representer) with a flexible model (e.g., a deep neural net), the critical radius will be larger, meaning you need more data to get a good estimate. If the function is simpler or you use a less flexible model (e.g., a linear model), the critical radius is smaller, and you need less data.

In the paper, the critical radius is critical because it ensures the Riesz regression estimator is reliable enough to debias the final parameter estimate (e.g., ATE). If the critical radius is too large (due to an overly complex model or too little data), the error in estimating $\alpha_0$ could undermine the debiasing process, leading to biased or imprecise results.

Why It Matters: The critical radius helps bound the error in estimating the Riesz representer (see Theorem 2.1). A smaller radius means lower error, ensuring reliable debiasing of ML estimates. For example, in neural nets, the radius scales with network size and shrinks with more data. For example, $$\delta_n \propto \sqrt{\frac{\text{network size}}{n}}$$

Practical Takeaway: Choose ML models with complexity suited to your sample size to keep the critical radius small, ensuring accurate estimates. The paper’s simulations show better performance with larger samples (e.g., $n=10,000$), where the critical radius is smaller, leading to robust results.

The critical radius in the context of the paper Automatic Debiased Machine Learning via Riesz Regression by Chernozhukov et al. is a statistical concept from learning theory, but its name does indeed suggest a geometric intuition. Below, I’ll explain whether the critical radius has a geometric meaning, why it’s called a “radius,” and how this relates to the paper, keeping the explanation concise and accessible for a PhD-level economist familiar with econometrics but not necessarily advanced statistical learning theory.


Does the Critical Radius Have a Geometric Meaning?

The critical radius has a geometric interpretation, though it’s rooted in the abstract geometry of function spaces rather than physical space. In statistical learning, the critical radius is related to the complexity of a set of functions (e.g., the class of neural nets or random forests used to estimate the Riesz representer). Geometrically, you can think of it as a measure of the “size” or “spread” of this function class in a high-dimensional space, which determines how hard it is to learn a specific function (like the Riesz representer $\alpha_0$) from data.

Here’s the intuition:

  • Imagine the set of possible functions $\mathcal{A}_n$ (e.g., all possible neural nets with a given architecture) as a “cloud” of points in a function space, where each point is a function $\alpha(x)$.

  • The critical radius ($\delta_n$) is like the radius of a ball around the true function ($\alpha_0$) that captures how spread out or complex the function class is. A larger radius means the function class is more complex (e.g., deeper neural nets with many parameters), making it harder to pinpoint the true function with limited data.

  • This “ball” isn’t in physical space but in a mathematical space where distance is measured by the mean square error. For example, $$|\alpha - \alpha_0|^2 = \mathbb{E}[(\alpha(X) - \alpha_0(X))^2]$$

In the paper, the critical radius is used in Theorem 2.1 to bound the error of the Riesz regression estimator:

$$ |\hat{\alpha} - \alpha_0|^2 \leq C \left( M \delta_n^2 + |\alpha^* - \alpha_0|^2 + \frac{M \ln(1/\zeta)}{n} \right), $$

where $\delta_n$ is the critical radius of the function class $\mathcal{A}_n$. Geometrically, $\delta_n$ quantifies the “width” of the function class, affecting how close the estimated function $\hat{\alpha}$ can get to the true $\alpha_0$.


Why Is It Called a “Radius”?

The term “radius” comes from its connection to Rademacher complexity or similar complexity measures in statistical learning theory, which often have a geometric flavor. Here’s why:

  • Rademacher Complexity: The critical radius is closely tied to the Rademacher complexity of a function class, which measures the ability of the class to fit random noise. It’s like asking, “How big is the set of functions in terms of their ability to wiggle around and fit data?” This is often visualized as the radius of a ball in a function space that encloses the class’s variability.

  • Covering Numbers: Another related concept is the covering number, which counts how many small balls (of radius $\delta$) are needed to cover the function class. The critical radius is the smallest $\delta_n$ where the function class’s complexity balances with the sample size $n$, resembling the radius of these covering balls.

  • Historical Naming: The term “radius” is borrowed from empirical process theory (e.g., Foster and Syrgkanis, 2019, cited in the paper), where it describes the scale of stochastic fluctuations in a function class. It’s called a radius because it behaves like the size of a region in function space where the estimator is likely to lie.

In the paper, for a neural net with width $K$ and depth $m$, the critical radius is bounded as: $$ \delta_n \leq C \sqrt{\frac{K^2 m^2 \ln(K^2 m) \ln(n)}{n}}. $$ This reflects the geometric idea that a more complex model (larger $K$ or $m$) has a larger “radius,” requiring more data ($n$) to shrink the error.


Summary

The critical radius has a geometric meaning as the “size” of a function class in an abstract space, reflecting its complexity and the data needed to learn a function accurately. It’s called a “radius” due to its connection to Rademacher complexity and covering numbers, which describe the spread of functions like a ball’s radius. In the paper, it quantifies the error in estimating the Riesz representer, ensuring effective debiasing of ML-based economic parameters.

Reference

Chernozhukov, V., Newey, W. K., Quintas-Martinez, V., & Syrgkanis, V. (2024). Automatic debiased machine learning via riesz regression. https://arxiv.org/abs/2104.14737

Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!

Related