Notes on Control Function Method
Motivation
- Endogeneity: When an explanatory variable (treatment) is correlated with the unobserved error term (e.g., due to omitted variables, measurement error, or simultaneity).
- Consequence: Standard regression (e.g., OLS) yields biased estimates.
- Goal: The control function (CF) approach “purges” endogeneity by modeling the correlation between the treatment and unobservables.
The control-function (CF) approach tackles this by explicitly modelling the source of endogeneity and then partialling it out. For linear models that modelling step turns out to be algebraically equivalent to 2SLS, but the real power of CF is that it extends seamlessly to nonlinear or limited‐dependent‐variable settings where 2SLS cannot be applied directly.
How It Works
Outcome model:
: Endogenous treatment : Exogenous controls : Unobservables (correlated with ).
First-stage model:
: Instrumental variable (IV) : First-stage error.
CF Insight:
If
Estimation
-
First Stage:
- Regress
on and :
- Obtain the residual:
.
- Regress
-
Second Stage:
- Add
to the outcome model:
- Estimate via OLS.
- Add
Why this works:
“controls for” the part of correlated with .- Once
is included, becomes exogenous in the modified model ( ). is consistent for the causal effect.
Key Assumptions
-
Instrument Validity:
is relevant: (strong first stage). is exogenous: .
-
Correct Functional Form:
- Linearity in the first stage and control function (e.g.,
).
- Linearity in the first stage and control function (e.g.,
-
Exclusion Restriction:
affects only through .
CF vs. Other Methods
Method | Key Difference |
---|---|
2SLS | Uses fitted values ( |
Control Function | Uses residuals ( |
Advantage of CF:
- Directly models the endogeneity structure (via
). - Extends to non-additive errors, discrete outcomes, and heteroscedastic settings.
Example
Problem: Estimate returns to education (
-
IV: Distance to college (
). -
Steps:
-
Regress: education
distance + controls → get residuals . -
Regress: wages
education + controls + .
-
-
Result: Coefficient on education is causal.
Generalization
Setting | Control-function term | Typical reference |
---|---|---|
Binary/probit |
Include |
Rivers & Vuong (1988) |
Heckman sample selection | Inverse Mills ratio |
Heckman (1979) |
Count models (Poisson, NB) | Nonparametric sieve |
Wooldridge (2015) |
Semiparametric/ML partially-linear | Learnt nuisance |
Chernozhukov et al. (2018) |
The principle is identical: obtain a residual that captures unobserved heterogeneity driving
Why This Matters
-
CF is essential when:
- You have a valid IV and suspect omitted variable bias.
- You work with nonlinear models (e.g., binary/duration outcomes).
-
Software Implementation:
- Stata:
ivregress 2sls
(equivalent to CF in linear cases) orcmp
for nonlinear. - R:
ivreg
(linear),controlfunction
package.
- Stata:
Critical Caveats
- Weak Instruments: If
is weak, is noisy → bias. - Functional Form Misspecification: If
, CF fails. - No Magic Bullet: Validity of
is untestable and must be justified theoretically.
Bottom Line
The CF approach harnesses IV residuals to “control” for endogeneity, converting an endogenous variable into a conditionally exogenous one. It’s a blend of IV intuition and regression control — powerful when assumptions hold.