Notes on Instrumental Variables

Here are my notes on instrumental variables from Stefan’s lecture materials.

Motivation

How can we identify causal effects WY when we are in the presence of unobserved confounding U?

image-20250529155826927

One popular way is to find and use instrumental variables.

Partially Linear IV Models

When instrumental variables are available, it becomes possible to point identify causal effects in partially linear models and certain types of causal effects in nonlinear models.

The keyword in this section is linear, constant effect IV

Here we begin with partially linear models.

Assumption 1 (partially linear).
Assume the structural equation for Y is linear: (PLM)Y=fY(W,U,εY)=α+Wτ+ε, where ϵ is an error term that captures the contribution of both U and ϵY.

This is a semiparametric specification, in that we impose a linear relationship between W and Y but let the rest be non-parametric. Notice that, besides the linearity assumption, we also assume a constant treatment effect (i.e. τ), which is also a strong assumption.

Remark 1 (partialling out X).
I ignore the role of covariates X and the possible extension like εiZiXi because we can recover the same DAG after partialling out observed confounder X. As an illustration, we can use Y~,W~,Z~, where V~=VE[VX]
Remark 2 (relax linearity assumption).
The key limitation here is that we assume the linear structure and constant treatment effect. Later, we will consider IV estimator without the linearity assumption.

When the constant treatment effect model (PLM) doesn’t hold, the average treatment effect τATE=E[Yi(1)Yi(0)] is NOT identified without more data, because we don’t have any observations on treated never takers, etc. Without linearity, the estimator τIV still converges to a large-sample limit τLATE, the local average treatment effect (LATE).

Identifying assumptions

There are 3 identification assumptions. Or let’s say there are three main assumptions that must be satisfied for a variable Z to be considered an instrument.

Assumption 2 (exogeneity).
The instrument Zi must be exogenous, which here means ϵiZi
Assumption 3 (relevance).
The instrument Zi must be relevant, such that Cov[Wi,Zi]0

Graphically, the relevance assumption corresponds to the existence of an active edge from Z to W in the causal graph.

Assumption 4 (exclusion restriction).
The instrument Zi must satisfy exclusion restriction, meaning that any effect of Zi on Yi must be fully mediated via the treatment Wi

Graphically, this means that we’ve excluded enough potential edges between variables in the causal graph so that all causal paths from Z to Y go through W.

Case I (Easiest)

Consider the fully linear version,

(C1)Y=α+Wτ+ε,εZW=Zγ+η

Then,

Cov[Y,Z]=Cov[τW+ε,Z]=τCov[W,Z] τ=Cov[Y,Z]Cov[W,Z]

It implies,

τ^IV=Cov^[Yi,Zi]Cov^[Wi,Zi]

Case II (More general: optimal instruments)

Case I assumes (1) linear relationship between Y and W and (2) linear relationship between W and Z. This may be too restrictive. How to extend to more general specification? What should we do if

  • we have multiple instruments

  • or we believe that the instrument may act non-linearly

Consider the following,

(C2)Y=τW+ε,εZ,Y,WR,ZZ,

where Z can be a high-dimensional space. Define function w that maps Zi to the real line w:ZR

Then by the same argument as the Case I (note: we can regard w(Z) as a “pseudo Z”),

τ=Cov[Y,w(Z)]Cov[W,w(Z)]

provided the denominator is non-zero, resulting a feasible estimator

τ^IV=Cov^[Yi,w(Zi)]Cov^[Wi,w(Zi)]=1ni=1n(YiY¯)(w(Zi)w(Z))1ni=1n(WiW¯)(w(Zi)w(Z))
Theorem 1.
Suppose (Xi,Wi,Yi,Zi) are IID draws from a distribution satisfying (C2), and let w:ZR be such that Cov[W,w(Z)]0. Then, τ^IV as given above is consistent for τ, and n(τ^IVτ)N(0,Vw),Vw=Var[εi]Var[w(Zi)]Cov[Wi,w(Zi)]2

What is the best function w, say w(z), that minimizes the variance Vw? It turns out that the optimal instrument is the best prediction of Wi from Zi.

w(z)=E[WiZi=z]

How to do we estimate? Do cross-fitting! Why? Because

ϵiWiw^(Zi)

We no longer have w^(Zi)ϵi. Therefore, we need cross-fitting to address this issue.

Case III (Much more general: non-parametric IV regression)

The more general version than Case II is the following,

(C3)Yi=α+g(Wi)+εi,Ziεi,Yi,WiR,ZiZ

where g() is some generic smooth function we want to estimate. Note that:

  • (C3) still requires the effect of Wi on Yi to be additive

  • however, unlike (C2), it now allows this additive effect to be modified by a non-linearity g()

Now,

E[YiZi=z]=E[α+g(Wi)+εiZi=z]=α+E[g(Wi)Zi=z]=α+Rg(w)f(wz)dw,

There are two steps for learning g():

  1. non-parametric model f^(wz) using cross-fitting

  2. estimate g() using empirical minimization

For more details, see Section 9.2 in Stefan’s lecture.

Local Average Treatment Effects

Motivation

  • IV without the linearity assumption: One may doubt the validity the linearity and constant treatment effect assumption in previous section. What about non-parametric identification using IV?

  • Encouragement Design and Noncompliance: Noncompliance is a common problem in encouragement designs involving human beings as experimental units. In those cases, the experimenters cannot force the units to take the treatment but rather only encourage them to do so. Heterogeneous effects should be allowed.

Setup

Consider an randomized experiment,

  • Let Zi{0,1} be the treatment assigned

  • Let Wi{0,1} be the treatment received

  • When ZiWi, the noncompliance problem arises

  • Potential outcome {Wi(1),Wi(0)} s.t. Wi=Wi(Zi)

  • Potential outcome {Yi(w,z)}(w,z){0,1} s.t. Yi=Yi(Wi,Zi)

Identifying assumptions

image-20250530095837904

LATE Theorem

image-20250530100412392

Idea of proof:

  1. Start with Cov(Y,Z)=E[YZ]E[Y]E[Z], then apply LIE by conditioning on Z

  2. Derive Cov(W,Z) similarly as above, then get the ratio

  3. Decompose the ATE on Y, E[Y(1)Y(0)], into four terms (always-taker, compiler, defier, never-taker) using law of total probability

  4. By exclusion restriction and monotonicity assumption, only “compiler” remains

Multiple instruments

We may have access to data from multiple randomized trials that can be used to study a treatment effect via a non-compliance analysis.

Marketing example:

  • goal: study the effect of subscription to a loyalty program (Wi) on long-term customer CLV (Yi)

  • randomized trial 1: offering discounts for joining the loyalty program (Zi=1({ customer received a discount }))

  • randomized trial 2: showing advertisements (Zi=1({ customer was shown an ad for the program }))

Previously, under the linear treatment effect model, multiple instruments could be combined into a single optimal instrument, and the optimal instrument corresponds to the summary of all the instruments that best predicts the treatment.

Without the linear treatment effect model, however, we caution that no such result is available. Different instruments may induce difference compliance patterns, and so the LATEs identified different instruments may not be the same.

In the marketing example, the ATE for customers who respond to a discount may be different from the ATE for customers who respond to an advertisement.

Reference

Wager, S. (2024). Causal inference: A statistical learning approach. https://web.stanford.edu/~swager/causal_inf_book.pdf

Chen Xing
Chen Xing
Founder & Data Scientist

Enjoy Life & Enjoy Work!

Related