Instrumental Variables without Traditional Instruments

Typically, regression models in empirical economic research suffer from at least one form of endogeneity bias.

The classic example is economic returns to schooling, where researchers want to know how much increased levels of education affect income. Estimation using a simple linear model, regressing income on schooling, alongside a bunch of control variables, will typically not yield education’s true effect on income. The problem here is one of omitted variables – notably unobserved ability. People who are more educated may be more motivated or have other unobserved characteristics which simultaneously affect schooling and future lifetime earnings.

Endogeneity bias plagues empirical research. However, there are solutions, the most common being instrumental variables (IVs). Unfortunately, the exclusion restrictions needed to justify the use of traditional IV methodology may be impossible to find.

So, what if you have an interesting research question, some data, but endogeneity with no IVs. You should give up, right? Wrong. According to Lewbel (forthcoming in Journal of Business and Economic Statistics), it is possible to overcome the endogeneity problem without the use of a traditional IV approach.

Lewbel’s paper demonstrates how higher order moment restrictions can be used to tackle endogeneity in triangular systems. Without going into too much detail (interested readers can consult Lewbel’s paper), this method is like the traditional two-stage instrumental variable approach, except the first-stage exclusion restriction is generated by the control, or exogenous, variables which we know are heteroskedastic (interested practitioners can test for this in the usual way, i.e. a White test).

In the code below, I demonstrate how one could employ this approach in R using the GMM framework outlined by Lewbel. My code only relates to a simple example with one endogenous variable and two exogenous variables. However, it would be easy to modify this code depending on the model.

rm(list=ls())
library(gmm)
# gmm function for 1 endog variable with 2 hetero exogenous variable
# outcome in the first column of 'datmat', endog variable in second
# constant and exog variables in the next three
# hetero exog in the last two (i.e no constant)
g1 <- function(theta, datmat) {
#set up data
y1 <- matrix(datmat[,1],ncol=1)
y2 <- matrix(datmat[,2],ncol=1)
x1 <- matrix(datmat[,3:5],ncol=3)
z1 <- matrix(datmat[,4:5],ncol=2)
# if the variable in the 4th col was not hetero
# this could be modified so:
# z1 <- matrix(datmat[,5],ncol=1)

#set up moment conditions
in1 <- (y1 -theta*x1[,1]-theta*x1[,2]-theta*x1[,3])
M <- NULL
for(i in 1:dim(z1)){
M <- cbind(M,(z1[,i]-mean(z1[,i])))
}
in2 <- (y2 -theta*x1[,1]-theta*x1[,2]-theta*x1[,3]-theta*y1)
for(i in 1:dim(x1)){M <- cbind(M,in1*x1[,i])}
for(i in 1:dim(x1)){M <- cbind(M,in2*x1[,i])}
for(i in 1:dim(z1)){M <- cbind(M,in2*((z1[,i]-mean(z1[,i]))*in1))}
return(M)
}
# so estimation is easy
# gmm(function(...), data matrix, initial values vector)
# e.g : gmm(g1, x =as.matrix(dat),c(1,1,1,1,1,1,1))

I also tested the performance of Lewbel’s GMM estimator in comparison a mis-specified OLS estimator. In the code below, I perform 500 simulations of a triangular system containing an omitted variable. For the GMM estimator, it is useful to have good initial starting values. In this simple example, I use the OLS coefficients. In more complicated settings, it is advisable to use the estimates from the 2SLS procedure outlined in Lewbel’s paper. The distributions of the coefficient estimates are shown in the plot below. The true value, indicated by the vertical line, is one. It is pretty evident that the Lewbel approach works very well. I think this method could be very useful in a number of research disciplines. beta1 <- beta2 <- NULL
for(k in 1:500){
#generate data (including intercept)
x1 <- rnorm(1000,0,1)
x2 <- rnorm(1000,0,1)
u <- rnorm(1000,0,1)
s1 <- rnorm(1000,0,1)
s2 <- rnorm(1000,0,1)
ov <- rnorm(1000,0,1)
e1 <- u + exp(x1)*s1 + exp(x2)*s1
e2 <- u + exp(-x1)*s2 + exp(-x2)*s2
y1 <- 1 + x1 + x2 + ov + e2
y2 <- 1 + x1 + x2 + y1 + 2*ov + e1
x3 <- rep(1,1000)
dat <- cbind(y1,y2,x3,x1,x2)

#record ols estimate
beta1 <- c(beta1,coef(lm(y2~x1+x2+y1)))
#init values for iv-gmm
init <- c(coef(lm(y2~x1+x2+y1)),coef(lm(y1~x1+x2)))
#record gmm estimate
beta2 <- c(beta2,coef(gmm(g1, x =as.matrix(dat),init)))
}

library(sm)
d <- data.frame(rbind(cbind(beta1,"OLS"),cbind(beta2,"IV-GMM")))
d\$beta1 <- as.numeric(as.character(d\$beta1))
sm.density.compare(d\$beta1, d\$V2,xlab=("Endogenous Coefficient"))
title("Lewbel and OLS Estimates")
legend("topright", levels(d\$V2),lty=c(1,2,3),col=c(2,3,4),bty="n")
abline(v=1)

18 thoughts on “Instrumental Variables without Traditional Instruments”

1. Vincent

This is great! thanks for sharing.

• diffuseprior

Thanks!

2. Andres Trujillo-Barrera

Nice job!

• diffuseprior

Thanks, I hope you found it useful!

3. cpeter9

How does this compare to Amemya and McCurdy’s method? Sounds similar.

• diffuseprior

AFAIK A & McC is a panel method. I would have to check that though, so I am willing to be corrected. Thanks for the comment!

4. matthieu

Great to see this econometric topics discussed in R, thanks!

5. Nils

Thanks a lot for the illustrative example. I have two questions:

1/

Say, I wanted to estimate the correct OLS (including the ommitted variable), just for the sake of curiosity. So I replace line 18 in the second code panel with

beta1 <- c(beta1,coef(lm(y2~x1+x2+y1+ov)))

(Note the additional regressor "ov" in the equation.) Why are the results for y1 still upward biased? Actually, they do not differ significantly from the restricted case at all.

2/

I tried to compare Lewbel's method to the Klein & Vella (2010) estimator (JEconometrics 154 (2),154-164). It also uses heteroskedasticity for identification. In their paper they do MC simulations with the following data generating process:

x1 <- rnorm(1000,0,1)
x2 <- rnorm(1000,0,1)
vstar <- rnorm(1000,0,1)
ustar <- .33*vstar+rnorm(1000,0,1)
u <- 1 + exp(.2*x1+.6*x2)*ustar
v <- 1 + exp(.6*x1+.2*x2)*vstar
y1 <- 1 + x1 + x2 + v
y2 <- 1 + x1 + x2 + y1 + u

In this DGP endogeneity doesn't result from ommitted variables but from the correlated error terms, u and v, with a degree of .33. This corresponds to their "multiplicative error structure", for which they prove their estimator to be consistent. They discuss ommitted variables in a way, that they can be included as factor in the composite error terms:

u=S_u * w * u^star
v=S_v * w * v^star

where S are heteroskedasticity functions that "scale" the homoskedastic (and correlated) errors, u^star and v^star, to being heteroskedastic errors. "w" is meant to be a common element, representing ommitted variables like, for example, "ability" in wage regressions.

Using this data generating process for your simulation example shows that Lewbel's estimator performs very poorly. (You can see this by just replacing the DGP in the providing code above.)

What went wrong? Any ideas? Does Lewbel's estimator not allow for such DGPs?

• diffuseprior

Hey Nils,

Thanks for the detailed comment.

I will try and answer as concisely as possible.

On 1/, you are right. This problem seems to be caused by the error terms, the distributions for which are far from normal. However, when I ran the MC analysis the OLS estimate that includes the omitted variable is 1.049 and the OLS without is 1.140 (as in the example). So the upward bias is still there, but not as bad. Using robust regressions (rlm in the MASS package) reduces this bias further to 1.03919.

Your point on 2/ is something I have been thinking about. I can’t give you a full reply now. What I want to do is write up a function that estimates Klein and Vella’s method, then compare both the KV and Lewbel approaches using the MC simulations undertaken in both the KV and the Lewbel papers. I think this would be a useful and informative exercise. Hopefully, I get some time to do this in the coming weeks. Bear with me on this!

• Nils

Thanks for your reply. I just found a typo in my program that caused Lewbel to go wild. So “very poor performance” was a too harsh judgement.

Instead I find that Lewbel reduces the bias, but not all of it. The betas’ distributions centers at roughly 1.12, which is closer to the true value than OLS.

I am also working on testing Klein & Vella in comparison to Lewbel. Though I start with the parametrized version of Farré, Klein & Vella (2010, EmpEconomics). I keep you updated.

• diffuseprior

Great. Keep me posted, and I will do likewise!

6. ajeya

great. Thanks for the post.

7. Catalina

Thanks a lot for your post. It is very helpful. In order to fully understand what Lewbel IV does, I am trying to replicate the Stata results that I get for ivreg2h for my data and cannot manage to get the same numbers. How would your code look like without the gmm black box? Have you alreadu done that? Thanks a lot for your help!

• diffuseprior

Hi Catalina,

Thanks for the question. I am actually in the process of writing up some functions that perform the Lewbel calculation. I will email you and let you know when they are ready to go.

I am very busy at the moment, so it might be a while before I get around to it. However, it is worth pointing out that the Stata GMM estimator is not the GMM estimator that Lewbel describes in his paper. The Stata GMM estimator is just a wrapper for ivreg2 gmm, where the generated IVs are just included as regular regressors. This estimator is just like the simple one proposed by Lewbel. The Stata function ignores the uncertainty generated by estimating the model with generated instruments so the standard errors in the Stata function will/could (correct me if incorrect) be wrong (how wrong, I am not sure what difference it makes). The upside to this is that you should be able to use the gmm or AER packages in R to generate comparable estimates and standard errors to the Stata version.

Send me an email if you need any further help.

• Catalina

Thanks a lot for your reply, it is really helpful! I am now trying to create a code in R. I would like to be able to play around with the parameters to see how the bias changes and up to what extent Lewbel works or not with credible assumptions for my data. I will take into account the gmm or AER packages for getting the right SE. I will let you know if I have questions and if I get something interesting! Thanks again!