The ivlewbel Package. A new way to Tackle Endogenous Regressor Models.

In April 2012, I wrote this blog post demonstrating an approach proposed in Lewbel (2012) that identifies endogenous regressor coefficients in a linear triangular system. Now I am happy to announce the release of the ivlewbel package, which contains a function through which Lewbel’s method can be applied in R. This package is now available to download on the CRAN.

Please see the example from the previous blog post replicated in the below. Additionally, it would be very helpful if people could comment on bugs and additional features they would like to add to the package. My contact details are in the about section of the blog.



beta1 <- beta2 <- NULL
for(k in 1:500){
  #generate data (including intercept)
  x1 <- rnorm(1000,0,1)
  x2 <- rnorm(1000,0,1)
  u <- rnorm(1000,0,1)
  s1 <- rnorm(1000,0,1)
  s2 <- rnorm(1000,0,1)
  ov <- rnorm(1000,0,1)
  e1 <- u + exp(x1)*s1 + exp(x2)*s1
  e2 <- u + exp(-x1)*s2 + exp(-x2)*s2
  y1 <- 1 + x1 + x2 + ov + e2 
  y2 <- 1 + x1 + x2 + y1 + 2*ov + e1
  x3 <- rep(1,1000)
  dat <- data.frame(y1,y2,x3,x1,x2)
  #record ols estimate
  beta1 <- c(beta1,coef(lm(y2~x1+x2+y1))[4])
  #init values for iv-gmm
  beta2 <- c(beta2,lewbel(formula = y2 ~ y1 | x1 + x2 | x1 + x2, data = dat)$coef.est[1,1])

d <- data.frame(rbind(cbind(beta1,"OLS"),cbind(beta2,"IV-GMM")))
d$beta1 <- as.numeric(as.character(d$beta1))$beta1, d$V2,xlab=("Endogenous Coefficient"))
title("Lewbel and OLS Estimates")
legend("topright", levels(d$V2),lty=c(1,2,3),col=c(2,3,4),bty="n")

IV Estimates via GMM with Clustering in R

In econometrics, generalized method of moments (GMM) is one estimation methodology that can be used to calculate instrumental variable (IV) estimates. Performing this calculation in R, for a linear IV model, is trivial. One simply uses the gmm() function in the excellent gmm package like an lm() or ivreg() function. The gmm() function will estimate the regression and return model coefficients and their standard errors. An interesting feature of this function, and GMM estimators in general, is that they contain a test of over-identification, often dubbed Hansen’s J-test, as an inherent feature. Therefore, in cases where the researcher is lucky enough to have more instruments than endogenous regressors, they should examine this over-identification test post-estimation.

While the gmm() function in R is very flexible, it does not (yet) allow the user to estimate a GMM model that produces standard errors and an over-identification test that is corrected for clustering. Thankfully, the gmm() function is flexible enough to allow for a simple hack that works around this small shortcoming. For this, I have created a function called gmmcl(), and you can find the code below. This is a function for a basic linear IV model. This code uses the gmm() function to estimate both steps in a two-step feasible GMM procedure. The key to allowing for clustering is to adjust the weights matrix after the second step. Interested readers can find more technical details regarding this approach here. After defining the function, I show a simple application in the code below.

gmmcl = function(formula1, formula2, data, cluster){
  library(plyr) ; library(gmm)
  # create data.frame
  data$id1 = 1:dim(data)[1]
  formula3 = paste(as.character(formula1)[3],"id1", sep=" + ")
  formula4 = paste(as.character(formula1)[2], formula3, sep=" ~ ")
  formula4 = as.formula(formula4)
  formula5 = paste(as.character(formula2)[2],"id1", sep=" + ")
  formula6 = paste(" ~ ", formula5, sep=" ")
  formula6 = as.formula(formula6)
  frame1 = model.frame(formula4, data)
  frame2 = model.frame(formula6, data)
  dat1 = join(data, frame1, type="inner", match="first")
  dat2 = join(dat1, frame2, type="inner", match="first")
  # matrix of instruments
  Z1 = model.matrix(formula2, dat2)
  # step 1
  gmm1 = gmm(formula1, formula2, data = dat2, 
             vcov="TrueFixed", weightsMatrix = diag(dim(Z1)[2]))
  # clustering weight matrix
  cluster = factor(dat2[,cluster])
  u = residuals(gmm1)
  estfun = sweep(Z1, MARGIN=1, u,'*')
  u = apply(estfun, 2, function(x) tapply(x, cluster, sum))  
  S = 1/(length(residuals(gmm1)))*crossprod(u)
  # step 2
  gmm2 = gmm(formula1, formula2, data=dat2, 
             vcov="TrueFixed", weightsMatrix = solve(S))

# generate data.frame
n = 100
z1 = rnorm(n)
z2 = rnorm(n)
x1 = z1 + z2 + rnorm(n)
y1 = x1 + rnorm(n)
id = 1:n

data = data.frame(z1 = c(z1, z1), z2 = c(z2, z2), x1 = c(x1, x1),
                  y1 = c(y1, y1), id = c(id, id))

summary(gmmcl(y1 ~ x1, ~ z1 + z2, data = data, cluster = "id"))

Detecting Weak Instruments in R

Weak Instruments

Weak Instruments

Any instrumental variables (IV) estimator relies on two key assumptions in order to identify causal effects:

  1. That the excluded instrument or instruments only effect the dependent variable through their effect on the endogenous explanatory variable or variables (the exclusion restriction),
  2. That the correlation between the excluded instruments and the endogenous explanatory variables is strong enough to permit identification.

The first assumption is difficult or impossible to test, and shear belief plays a big part in what can be perceived to be a good IV. An interesting paper was published last year in the Review of Economics and Statistics by Conley, Hansen, and Rossi (2012), wherein the authors provide a Bayesian framework that permits researchers to explore the consequences of relaxing exclusion restrictions in a linear IV estimator. It will be interesting to watch research on this topic expand in the coming years.

Fortunately, it is possible to quantitatively measure the strength of the relationship between the IVs and the endogenous variables. The so-called weak IV problem was underlined in paper by Bound, Jaeger, and Baker (1995). When the relationship between the IVs and the endogenous variable is not sufficiently strong, IV estimators do not correctly identify causal effects.

The Bound, Jaeger, and Baker paper represented a very important contribution to the econometrics literature. As a result of this paper, empirical studies that use IV almost always report some measure of the instrument strength. A secondary result of this paper was the establishment of a literature that evaluates different methods of testing for weak IVs. Staiger and Stock (1997) furthered this research agenda, formalizing the relevant asymptotic theory and recommending the now ubiquitous “rule-of-thumb” measure: a first-stage partial-F test of less than 10 indicates the presence of weak instruments.

In the code below, I have illustrated how one can perform these partial F-tests in R. The importance of clustered standard errors has been highlighted on this blog before, so I also show how the partial F-test can be performed in the presence of clustering (and heteroskedasticity too). To obtain the clustered variance-covariance matrix, I have adapted some code kindly provided by Ian Gow. For completeness, I have displayed the clustering function at the end of the blog post.

# load packages
library(AER) ; library(foreign) ; library(mvtnorm)
# clear workspace and set seed

# number of observations
n = 1000
# simple triangular model:
# y2 = b1 + b2x1 + b3y1 + e
# y1 = a1 + a2x1 + a3z1 + u
# error terms (u and e) correlate 
Sigma = matrix(c(1,0.5,0.5,1),2,2)
ue = rmvnorm(n, rep(0,2), Sigma)
# iv variable
z1 = rnorm(n)
x1 = rnorm(n)
y1 = 0.3 + 0.8*x1 - 0.5*z1 + ue[,1]
y2 = -0.9 + 0.2*x1 + 0.75*y1 +ue[,2]
# create data
dat = data.frame(z1, x1, y1, y2)

# biased OLS
lm(y2 ~ x1 + y1, data=dat)
# IV (2SLS)
ivreg(y2 ~ x1 + y1 | x1 + z1, data=dat)

# do regressions for partial F-tests
# first-stage:
fs = lm(y1 ~ x1 + z1, data = dat)
# null first-stage (i.e. exclude IVs):
fn = lm(y1 ~ x1, data = dat)

# simple F-test
waldtest(fs, fn)$F[2]
# F-test robust to heteroskedasticity
waldtest(fs, fn, vcov = vcovHC(fs, type="HC0"))$F[2]

# now lets get some F-tests robust to clustering

# generate cluster variable
dat$cluster = 1:n
# repeat dataset 10 times to artificially reduce standard errors
dat = dat[rep(seq_len(nrow(dat)), 10), ]

# re-run first-stage regressions
fs = lm(y1 ~ x1 + z1, data = dat)
fn = lm(y1 ~ x1, data = dat)

# simple F-test
waldtest(fs, fn)$F[2]
# ~ 10 times higher!
# F-test robust to clustering
waldtest(fs, fn, vcov = clusterVCV(dat, fs, cluster1="cluster"))$F[2]
# ~ 10 times lower than above (good)

Further “rule-of-thumb” measures are provided in a paper by Stock and Yogo (2005) and it should be noted that whole battery of weak-IV tests exist (for example, see the Kleinberg-Paap rank Wald F-statistic and Anderson-Rubin Wald test) and one should perform these tests if the presence of weak instruments represents a serious concern. 

# R function adapted from Ian Gows' webpage:
clusterVCV <- function(data, fm, cluster1, cluster2=NULL) {
  # Calculation shared by covariance estimates <- estfun(fm)
  inc.obs <- complete.cases(data[,names(fm$model)])
  # Shared data for degrees-of-freedom corrections
  N  <- dim(fm$model)[1]
  K  <- fm$rank
  # Calculate the sandwich covariance estimate
  cov <- function(cluster) {
    cluster <- factor(cluster)
    # Calculate the "meat" of the sandwich estimators
    u <- apply(, 2, function(x) tapply(x, cluster, sum))
    meat <- crossprod(u)/N
    # Calculations for degrees-of-freedom corrections, followed 
    # by calculation of the variance-covariance estimate.
    # NOTE: NROW/N is a kluge to address the fact that sandwich uses the
    # wrong number of rows (includes rows omitted from the regression).
    M <- length(levels(cluster))
    dfc <- M/(M-1) * (N-1)/(N-K)
    dfc * NROW/N * sandwich(fm, meat=meat)
  # Calculate the covariance matrix estimate for the first cluster.
  cluster1 <- data[inc.obs,cluster1]
  cov1  <- cov(cluster1)
  if(is.null(cluster2)) {
    # If only one cluster supplied, return single cluster
    # results
  } else {
    # Otherwise do the calculations for the second cluster
    # and the "intersection" cluster.
    cluster2 <- data[inc.obs,cluster2]
    cluster12 <- paste(cluster1,cluster2, sep="")
    # Calculate the covariance matrices for cluster2, the "intersection"
    # cluster, then then put all the pieces together.
    cov2   <- cov(cluster2)
    cov12  <- cov(cluster12)
    covMCL <- (cov1 + cov2 - cov12)
    # Return the output of coeftest using two-way cluster-robust
    # standard errors.

Endogenous Spatial Lags for the Linear Regression Model

Over the past number of years, I have noted that spatial econometric methods have been gaining popularity. This is a welcome trend in my opinion, as the spatial structure of data is something that should be explicitly included in the empirical modelling procedure. Omitting spatial effects assumes that the location co-ordinates for observations are unrelated to the observable characteristics that the researcher is trying to model. Not a good assumption, particularly in empirical macroeconomics where the unit of observation is typically countries or regions.

Starting out with the prototypical linear regression model: y = X \beta + \epsilon, we can modify this equation in a number of ways to account for the spatial structure of the data. In this blog post, I will concentrate on the spatial lag model. I intend to examine spatial error models in a future blog post.

The spatial lag model is of the form: y= \rho W y + X \beta + \epsilon, where the term \rho W y measures the potential spill-over effect that occurs in the outcome variable if this outcome is influenced by other unit’s outcomes, where the location or distance to other observations is a factor in for this spill-over. In other words, the neighbours for each observation have greater (or in some cases less) influence to what happens to that observation, independent of the other explanatory variables (X). The W matrix is a matrix of spatial weights, and the \rho parameter measures the degree of spatial correlation. The value of \rho is bounded between -1 and 1. When \rho is zero, the spatial lag model collapses to the prototypical linear regression model.

The spatial weights matrix should be specified by the researcher. For example, let us have a dataset that consists of 3 observations, spatially located on a 1-dimensional Euclidean space wherein the first observation is a neighbour of the second and the second is a neighbour of the third. The spatial weights matrix for this dataset should be a 3 \times 3 matrix, where the diagonal consists of 3 zeros (you are not a neighbour with yourself). Typically, this matrix will also be symmetric. It is also at the user’s discretion to choose the weights in W. Common schemes include nearest k neighbours (where k is again at the users discretion), inverse-distance, and other schemes based on spatial contiguities. Row-standardization is usually performed, such that all the row elements in W sum to one. In our simple example, the first row of a contiguity-based scheme would be: [0, 1, 0]. The second: [0.5, 0, 0.5]. And the third: [0, 1, 0].

While the spatial-lag model represents a modified version of the basic linear regression model, estimation via OLS is problematic because the spatially lagged variable (Wy) is endogenous. The endogeneity results from what Charles Manski calls the ‘reflection problem’: your neighbours influence you, but you also influence your neighbours. This feedback effect results in simultaneity which renders bias on the OLS estimate of the spatial lag term. A further problem presents itself when the independent variables (X) are themselves spatially correlated. In this case, completely omitting the spatial lag from the model specification will bias the \beta coefficient values due to omitted variable bias.

Fortunately, remedying these biases is relatively simple, as a number of estimators exist that will yield an unbiased estimate of the spatial lag, and consequently the coefficients for the other explanatory variables—assuming, of course, that these explanatory variables are themselves exogenous. Here, I will consider two: the Maximum Likelihood estimator (denoted ML) as described in Ord (1975), and a generalized two-stage least squares regression model (2SLS) wherein spatial lags, and spatial lags lags (i.e. W^{2} X) of the explanatory variables are used as instruments for Wy. Alongside these two models, I also estimate the misspecified OLS both without (OLS1) and with (OLS2) the spatially lagged dependent variable.

To examine the properties of these four estimators, I run a Monte Carlo experiment. First, let us assume that we have 225 observations equally spread over a 15 \times 15 lattice grid. Second, we assume that neighbours are defined by what is known as the Rook contiguity, so a neighbour exists if they are in the grid space either above or below or on either side. Once we create the spatial weight matrix we row-standardize.

Taking our spatial weights matrix as defined, we want to simulate the following linear model: y = \rho Wy + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon, where we set \rho=0.4 , \beta_{1}=0.5, \beta_{2}=-0.5, \beta_{3}=1.75. The explanatory variables are themselves spatially autocorrelated, so our simulation procedure first simulates a random normal variable for both x_{2} and x_{3} from: N(0, 1), then assuming a autocorrelation parameter of \rho_{x}=0.25, generates both variables such that: x_{j} = (1-\rho_{x}W)^{-1} N(0, 1) for j \in \{ 1,2 \}. In the next step we simulate the error term \epsilon. We introduce endogeneity into the spatial lag by assuming that the error term \epsilon is a function of a random normal v so \epsilon = \alpha v + N(0, 1) where v = N(0, 1) and \alpha=0.2, and that the spatial lag term includes v. We modify the regression model to incorporate this: y = \rho (Wy + v) + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon. From this we can calculate the reduced form model: y = (1 - \rho W)^{-1} (\rho v + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon), and simulate values for our dependent variable y.

Performing 1,000 repetitions of the above simulation permits us to examine the distributions of the coefficient estimates produced by the four models outlined in the above. The distributions of these coefficients are displayed in the graphic in the beginning of this post. The spatial autocorrelation parameter \rho is in the bottom-right quadrant. As we can see, the OLS model that includes the spatial effect but does not account for simultaneity (OLS2) over-estimates the importance of spatial spill-overs. Both the ML and 2SLS estimators correctly identify the \rho parameter. The remaining quadrants highlight what happens to the coefficients of the explanatory variables. Clearly, the OLS1 estimator provides the worst estimate of these coefficients. Thus, it appears preferable to use OLS2, with the biased autocorrelation parameter, than the simpler OLS1 estimator. However, the OLS2 estimator also yields biased parameter estimates for the \beta coefficients. Furthermore, since researchers may want to know the marginal effects in spatial equilibrium (i.e. taking into account the spatial spill-over effects) the overestimated \rho parameter creates an additional bias.

To perform these calculations I used the spdep package in R, with the graphic created via ggplot2. Please see the R code I used in the below.

library(spdep) ; library(ggplot2) ; library(reshape)

n = 225
data = data.frame(n1=1:n)
# coords
data$lat = rep(1:sqrt(n), sqrt(n))
data$long = sort(rep(1:sqrt(n), sqrt(n)))
# create W matrix
wt1 = as.matrix(dist(cbind(data$long, data$lat), method = "euclidean", upper=TRUE))
wt1 = ifelse(wt1==1, 1, 0)
diag(wt1) = 0
# row standardize
rs = rowSums(wt1)
wt1 = apply(wt1, 2, function(x) x/rs)
lw1 = mat2listw(wt1, style="W")

rx = 0.25
rho = 0.4
b1 = 0.5
b2 = -0.5
b3 = 1.75
alp = 0.2
inv1 = invIrW(lw1, rho=rx, method="solve", feasible=NULL)
inv2 = invIrW(lw1, rho=rho, method="solve", feasible=NULL)

sims = 1000
beta1results = matrix(NA, ncol=4, nrow=sims)
beta2results = matrix(NA, ncol=4, nrow=sims)
beta3results = matrix(NA, ncol=4, nrow=sims)
rhoresults = matrix(NA, ncol=3, nrow=sims)

for(i in 1:sims){
  u1 = rnorm(n)
  x2 = inv1 %*% u1
  u2 = rnorm(n)
  x3 = inv1 %*% u2
  v1 = rnorm(n)
  e1 = alp*v1 + rnorm(n)  
  data1 = data.frame(cbind(x2, x3),lag.listw(lw1, cbind(x2, x3)))
  names(data1) = c("x2","x3","wx2","wx3")
  data1$y1 = inv2 %*% (b1 + b2*x2 + b3*x3 + rho*v1 + e1)
  data1$wy1 = lag.listw(lw1, data1$y1)
  data1$w2x2 = lag.listw(lw1, data1$wx2)
  data1$w2x3 = lag.listw(lw1, data1$wx3)
  data1$w3x2 = lag.listw(lw1, data1$w2x2)
  data1$w3x3 = lag.listw(lw1, data1$w2x3)

  m1 = coef(lm(y1 ~ x2 + x3, data1))
  m2 = coef(lm(y1 ~ wy1 + x2 + x3, data1))
  m3 = coef(lagsarlm(y1 ~ x2 + x3, data1, lw1))
  m4 = coef(stsls(y1 ~ x2 + x3, data1, lw1))
  beta1results[i,] = c(m1[1], m2[1], m3[2], m4[2])
  beta2results[i,] = c(m1[2], m2[3], m3[3], m4[3])
  beta3results[i,] = c(m1[3], m2[4], m3[4], m4[4])
  rhoresults[i,] = c(m2[2],m3[1], m4[1])  

apply(rhoresults, 2, mean) ; apply(rhoresults, 2, sd)
apply(beta1results, 2, mean) ; apply(beta1results, 2, sd)
apply(beta2results, 2, mean) ; apply(beta2results, 2, sd)
apply(beta3results, 2, mean) ; apply(beta3results, 2, sd)

colnames(rhoresults) = c("OLS2","ML","2SLS")
colnames(beta1results) = c("OLS1","OLS2","ML","2SLS")
colnames(beta2results) = c("OLS1","OLS2","ML","2SLS")
colnames(beta3results) = c("OLS1","OLS2","ML","2SLS")

rhoresults = melt(rhoresults)
rhoresults$coef = "rho"
rhoresults$true = 0.4

beta1results = melt(beta1results)
beta1results$coef = "beta1"
beta1results$true = 0.5

beta2results = melt(beta2results)
beta2results$coef = "beta2"
beta2results$true = -0.5

beta3results = melt(beta3results)
beta3results$coef = "beta3"
beta3results$true = 1.75

data = rbind(rhoresults,beta1results,beta2results,beta3results)
data$Estimator = data$X2

ggplot(data, aes(x=value, colour=Estimator, fill=Estimator)) + 
  geom_density(alpha=.3) + 
  facet_wrap(~ coef, scales= "free") +
  geom_vline(aes(xintercept=true)) + 
  scale_y_continuous("Density") +
  scale_x_continuous("Effect Size") +
  opts(legend.position = 'bottom', legend.direction = 'horizontal')

The Frisch–Waugh–Lovell Theorem for Both OLS and 2SLS

The Frisch–Waugh–Lovell (FWL) theorem is of great practical importance for econometrics. FWL establishes that it is possible to re-specify a linear regression model in terms of orthogonal complements. In other words, it permits econometricians to partial out right-hand-side, or control, variables. This is useful in a variety of settings. For example, there may be cases where a researcher would like to obtain the effect and cluster-robust standard error from a model that includes many regressors, and therefore a computationally infeasible variance-covariance matrix.

Here are a number of practical examples. The first just takes a simple linear regression model, with two regressors: x1 and x2. To partial out the coefficients on the constant term and x2, we first regress x2 on y1 and save the residuals. We then regress x2 on x1 and save the residuals. The final stage regresses the second residuals on the first. The following code illustrates how one can obtain an identical coefficient on x1 by applying the FWL theorem.

x1 = rnorm(100)
x2 = rnorm(100)
y1 = 1 + x1 - x2 + rnorm(100)

r1 = residuals(lm(y1 ~ x2))
r2 = residuals(lm(x1 ~ x2))
# ols
coef(lm(y1 ~ x1 + x2))
# fwl ols
coef(lm(r1 ~ -1 + r2))

FWL is also relevant for all linear instrumental variable (IV) estimators. Here, I will show how this extends to the 2SLS estimator, where slightly more work is required compared to the OLS example in the above. Here we have a matrix of instruments (Z), exogenous variables (X), and endogenous variables (Y1). Let us imagine we want the coefficient on one endogenous variable y1. In this case we can apply FWL as follows. Regress X on each IV in Z in separate regressions, saving the residuals. Then regress X on y1, and X on y2, saving the residuals for both. In the last stage, perform a two-stage-least-squares regression of the X on y2 residuals on the X on y2 residuals using the residuals from X on each Z as instruments. An example of this is shown in the below code.


ov = rnorm(100)
z1 = rnorm(100) 
z2 = rnorm(100)
y1 = rnorm(100) + z1 + z2 + 1.5*ov
x1 = rnorm(100) + 0.5*z1 - z2
x2 = rnorm(100) 
y2 = 1 + y1 - x1 + 0.3*x2 + ov + rnorm(100)

r1 = residuals(lm(z1 ~ x1 + x2))
r2 = residuals(lm(z2 ~ x1 + x2))
r3 = residuals(lm(y1 ~ x1 + x2))
r4 = residuals(lm(y2 ~ x2 + x2))

# biased coef on y1 as expected for ols
# 2sls
# fwl 2sls

The FWL can also be extended to cases where there are multiple endogenous variables. I have demonstrated this case by extending the above example to model x1 as an endogenous variable.

# 2 endogenous variables
r5 = residuals(lm(z1 ~ x2))
r6 = residuals(lm(z2 ~ x2))
r7 = residuals(lm(y1 ~ x2))
r8 = residuals(lm(x1 ~ x2))
r9 = residuals(lm(y2 ~ x2))

# 2sls coefficients
p1 = fitted.values(lm(y1~z1+z2+x2))
p2 = fitted.values(lm(x1~z1+z2+x2))
lm(y2 ~ p1 + p2 + x2)

# 2sls fwl coefficients
p3 = fitted.values(lm(r7~-1+r5+r6))
p4 = fitted.values(lm(r8~-1+r5+r6))
lm(r9 ~ p3 + p4)

Minimizing Bias in Observational Studies

Measuring the effect of a binary treatment on a measured outcome is one of the most common tasks in applied statistics. Examples of these applications abound, like the effect of smoking on health, or the effect of low birth weight on cognitive development. In an ideal world we would like to be able to assign one group of people to receive some form of treatment, and an identical group to not receive this treatment. In this world, the average treatment effect (ATE) is the difference in outcomes between the treatment and control groups.

However, the real world is far from ideal, and the problematic nature of measuring causal effects has (justifiably) spawned a wide literature. Many solutions have been proposed to this problem. The simplest one involves controlling for the pre-treatment characteristics defining both the treatment and control groups. For example, in the case of low birth weight, a researcher may want to adjust for variation in the parent’s socioeconomic status, as one would expect that the ‘treatment’ of low birth weight not to be randomly assigned amongst different socioeconomic strata. Once controls for various confounding variables are introduced, it is then feasible to measure the causal effect of the treatment. Methods that perform this adjustment include simple linear regression modelling, or various forms of matching estimators.

The key assumption in the above is that the researcher is able to separate out differences between the treatment and control groups based on observable characteristics (selection on observables). However, there are many cases, especially in the social sciences, where it is not unreasonable to suspect this assumption does not hold (selection based on the unobservable). In this scenario, the use of instrumental variable estimators represents a viable solution.

Unfortunately, suitable instrumental variables are commonly not available to researchers. In this instance is there anything a researcher can do? Yes, according to a recently published paper by Daniel L. Millimet and Rusty Tchernis. Millimet and Tchernis examine this problem from the point of view of a researcher trying to minimize the bias associated with selection on unobservables. Their paper demonstrates how the bias of ATE can be derived from a regression model of the probability of treatment on observable characteristics. Using this regression model, it is possible to find a bias minimizing propensity score (P*). Once this score is calculated, the researcher is able to estimate a bias reduced ATE by trimming observations outside a pre-specified neighborhood around P*.

Millimet and Tchernis propose a number of estimators that one can use to produce bias minimized ATEs. Those interested in potentially using this bias minimization strategy should refer to their paper for a more detailed examination of these estimators, and their potential uses/misuses. This paper also features a neat empirical application that suggests why their approach might be better than the more conventional methods.

In the below, I have supplied an image summarizing the output produced from a Monte Carlo exercise to highlight the efficacy of the bias corrected approach. Those interested in design of this experiment should look at the Millimet and Tchernis paper, since I have simply replicated their MC design (250 datasets with 5,000 observations in each), with the assumption that the correlation between the treatment equation error term and outcome equation is −0.6 (so a really strong correlation). Additionally, I have set the trimming parameter (theta) to 0.05, so at least 5% of the treatment and 5% of the control group are contained in the trimmed sample.

Note that full descriptions of the acronyms in this image can be found in the paper, although MB.BC and IPW.BC refer to the bias corrected measures of the ATE, and from the image it is clear to see that these estimators are much closer to the true average treatment effect of 1, albeit with a higher variance. This simulation was conducted with R using a self-written function. I have benchmarked this function against Daniel Millimet’s STATA function, and the results are identical. I hope to possibly release this function as an R package in the future, although I would be happy to supply my function’s code to anyone who is interested.

Probit Models with Endogeneity

Dealing with endogeneity in a binary dependent variable model requires more consideration than the simpler continuous dependent variable case. For some, the best approach to this problem is to use the same methodology used in the continuous case, i.e. 2 stage least squares. Thus, the equation of interest becomes a linear probability model (LPM). The advantage of this approach is its simplicity; both in terms of estimation and interpretation (a 1-unit change in x causes a \beta change in the probability of y).

The disadvantage of this approach is that the LPM may imply probabilities outside the unit interval. It is typically for this reason that generalized linear models, like probit or logit, are used to model binary dependent variables in applied research, and an approach that extends the probit model to account for endogeneity was proposed by Rivers & Vuong (1988).

The Rivers & Vuong estimator assumes the following usual triangular system:

\textbf{y}_{2i}=1(\textbf{X}_{i}\boldsymbol{\beta}+\textbf{y}_{1i}\alpha+\boldsymbol{\epsilon}_{i}>0) (1),

\textbf{y}_{1i}=\textbf{X}_{i}\boldsymbol{\gamma}+\textbf{Z}_{i}\boldsymbol{\theta}+\textbf{v}_{i}) (2),

wherein the jointly normally distributed error terms correlate, the control variables are in \textbf{X}_{i} and the instrumental variables are in \textbf{Z}_{i}. A simple procedure that estimates this model is the following two-step approach. First regress \textbf{y}_{1i} on \textbf{X}_{i} and \textbf{Z}_{i}. Collect the residuals \hat{\textbf{v}}_{i}. The second step involves estimating the probit model of interest (2), and including the first stage residuals as an additional regressor. This method has also been termed the control function approach, as the inclusion of \hat{\textbf{v}}_{i} controls for the correlation between \textbf{v}_{i} and \boldsymbol{\epsilon}_{i}.

The second step coefficient estimates are scaled versions of their real values for reasons outlined here. It is possible to get the un-scaled values using the appropriate transformation. However, for analysis it is often more useful to know the Average Structural Function (ASF), as discussed in Blundell & Powell, 2004). The ASF can be seen as the policy response measure as it illustrates the probability of the dependent variable being a one given the values of the regressors, in the absence of endogeneity. Therefore, with the ASF it is possible to trace out the effects of changes in \textbf{y}_{1i} on the probability of \textbf{y}_{2i}. Additionally, it is possible to compute average partial/marginal effects with the ASF, as described here.

Estimating the ASF involves taking the average of the normal CDF transformed predictions where the scaled coefficients are multiplied by all of the variables at some fixed value (commonly the mean) and all of the first-stage residuals are multiplied by the relevant scaled coefficient. This calculation is formalized on page 7 of this document.

To simulate the effect of changes in the endogenous variable, the ASF can be recursively estimated with different values of \textbf{y}_{1i}. I have attached an example of how this calculation can be performed for a simple simulation in R. It would also be possible to construct confidence intervals for this ASF using bootstrapping methods. As is evident from the plot, the control function estimator correctly identifies the ASF.

n <- 10000
Sigma <- matrix(c(1,0.75,0.75,1),2,2)
u <- rmvnorm(n,rep(0,2),Sigma)

x1 <- rnorm(n)
x2 <- rnorm(n)
y1 <- 1.5 + 2*x1 - 2*x2 + u[,1] 
y2 <- ifelse(-0.25 - 1.25*x1 - 0.5*y1 + u[,2] > 0, 1, 0)

#true asf
eq1 <- function(x1,y1){pnorm(-0.25 - 1.25*x1 - 0.5*y1)}
data <- data.frame(cbind(mean(x1),seq(ceiling(min(y1)),floor(max(y1)),0.2)))
names(data) <- c("x1","y1")
data$asf <- eq1(data$x1,data$y1)

# naive probit: biased regression 
r1 <- glm(y2~x1+y1,binomial(link="probit"))
dat1 <- data.frame(cbind(mean(x1),seq(ceiling(min(y1)),floor(max(y1)),0.2)))
names(dat1) <- c("x1","y1")
asf1 <- cbind(dat1$y1,pnorm(predict(r1,dat1)))

# 2 step control function approach
v1 <- (residuals(lm(y1~x1+x2)))/sd(residuals(lm(y1~x1+x2)))
r2 <- glm(y2~x1+y1+v1,binomial(link="probit"))
# proceedure to get asf
asf2 <- cbind(seq(ceiling(min(y1)),floor(max(y1)),0.2),NA)
for(i in 1:dim(asf2)[1]){
  dat2 <- data.frame(cbind(mean(x1),asf2[i,1],v1))
  names(dat2) <- c("x1","y1","v1")
  asf2[i,2] <- mean(pnorm(predict(r2,dat2)))

# get respective asfs and plot
plotdat <- data.frame(rbind(cbind(data$y1,data$asf,"TRUE ASF"),
                            cbind(data$y1,asf2[,2],"2 STEP PROBIT")))
names(plotdat) <- c("Y1","Y2","Estimator")
plotdat$Y1 <- as.numeric(as.character(plotdat$Y1))
plotdat$Y2 <- as.numeric(as.character(plotdat$Y2))

ggplot(plotdat, aes(x=Y1, y=Y2, colour = Estimator, group=Estimator)) + 
  geom_line(size=0.8) + geom_point()+
  scale_x_continuous('Y1') +
  scale_y_continuous('P(Y2)') +
  theme_bw() +
  opts(title = expression("Average Structural Function Comparison"),legend.position=c(0.8,0.8))