An ivreg2 function for R

The ivreg2 command is one of the most popular routines in Stata. The reason for this popularity is its simplicity. A one-line ivreg2 command generates not only the instrumental variable regression coefficients and their standard errors, but also a number of other statistics of interest.

I have come across a number of functions in R that calculate instrumental variable regressions. However, none appear to (and correct me if I am wrong) offer an output similar to the ivreg2 command in Stata. The function below is my first attempt to replicate Stata’s ivreg2.

ivreg2(form,endog,iv,data,digits)

There are four required arguments. The ‘form’ argument is the second stage regression, written in the same manner as any regression model in R. The ‘endog’ argument is a character object with the name of the endogenous variable. The user should specify the instrumental variable(s) with the ‘iv’ argument. These instruments should be contained in ‘data’ – a data frame object. Note, the function in its current state only allows of one endogenous variable (which is usually more than enough for the researcher to contend with). Furthermore, make sure that there are no ‘NA’ values in the data frame being passed through the function.

This function performs a 2SLS regression calculating the usual regression output, a weak identification F-statistic, the Wu-Hausman test of endogeneity, and, in the case where there is more than one-instrument, a Sargan test. The weak identification statistic is used to determine whether the instrument(s) is(are) sufficiently correlated with the endogenous variable of interest. The ‘rule-of-thumb’ critical statistic here is ten. A Wu-Hausman test examines the difference between the IV and OLS coefficients. Rejecting the null hypothesis indicates the presence of endogeneity. Finally, the Sargan over-identification test is used in the cases where there are more instruments than endogenous regressors. A rejection of the null in this test means that the instruments are not exclusively affecting the outcome of interest though the endogenous variable.

The code for this function, alongside an example with the well known Mroz data, is shown below.

> mroz <- read.dta("mroz.dta")
> mroz <- mroz[,c("hours","lwage","educ","age","kidslt6","kidsge6","nwifeinc","exper")]
> ivreg2(form=hours ~ lwage + educ + age + kidslt6 + kidsge6 + nwifeinc,
+       endog="lwage",iv=c("exper"),data=na.omit(mroz))
$results
                Coef    S.E. t-stat p-val
(Intercept) 2478.435 655.207  3.783 0.000
lwage       1772.323 594.185  2.983 0.003
educ        -201.187  69.910 -2.878 1.996
age          -11.229  10.537 -1.066 1.713
kidslt6     -191.659 195.761 -0.979 1.672
kidsge6      -37.732  63.635 -0.593 1.447
nwifeinc      -9.978   7.174 -1.391 1.836

$weakidtest
     First Stage F-test
[1,]             12.965

$endogeneity
     Wu-Hausman F-test p-val
[1,]             36.38     0

$overid
     Sargan test of over-identifying restrictions 
[1,] "No test performed. Model is just identified"
ivreg2 <- function(form,endog,iv,data,digits=3){
  # library(MASS)
  # model setup
  r1 <- lm(form,data)
  y <- r1$fitted.values+r1$resid
  x <- model.matrix(r1)
  aa <- rbind(endog == colnames(x),1:dim(x)[2])  
  z <- cbind(x[,aa[2,aa[1,]==0]],data[,iv])  
  colnames(z)[(dim(z)[2]-length(iv)+1):(dim(z)[2])] <- iv  
  # iv coefficients and standard errors
  z <- as.matrix(z)
  pz <- z %*% (solve(crossprod(z))) %*% t(z)
  biv <- solve(crossprod(x,pz) %*% x) %*% (crossprod(x,pz) %*% y)
  sigiv <- crossprod((y - x %*% biv),(y - x %*% biv))/(length(y)-length(biv))
  vbiv <- as.numeric(sigiv)*solve(crossprod(x,pz) %*% x)
  res <- cbind(biv,sqrt(diag(vbiv)),biv/sqrt(diag(vbiv)),(1-pnorm(biv/sqrt(diag(vbiv))))*2)
  res <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res)),nrow=dim(res)[1])
  rownames(res) <- colnames(x)
  colnames(res) <- c("Coef","S.E.","t-stat","p-val")
  # First-stage F-test
  y1 <- data[,endog]
  z1 <- x[,aa[2,aa[1,]==0]]
  bet1 <- solve(crossprod(z)) %*% crossprod(z,y1)
  bet2 <- solve(crossprod(z1)) %*% crossprod(z1,y1)
  rss1 <- sum((y1 - z %*% bet1)^2)
  rss2 <- sum((y1 - z1 %*% bet2)^2)
  p1 <- length(bet1)
  p2 <- length(bet2)
  n1 <- length(y)
  fs <- abs((rss2-rss1)/(p2-p1))/(rss1/(n1-p1))
  firststage <- c(fs)
  firststage <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),firststage)),ncol=length(firststage))
  colnames(firststage) <- c("First Stage F-test")
  # Hausman tests
  bols <- solve(crossprod(x)) %*% crossprod(x,y) 
  sigols <- crossprod((y - x %*% bols),(y - x %*% bols))/(length(y)-length(bols))
  vbols <- as.numeric(sigols)*solve(crossprod(x))
  sigml <- crossprod((y - x %*% bols),(y - x %*% bols))/(length(y))
  x1 <- x[,!(colnames(x) %in% "(Intercept)")]
  z1 <- z[,!(colnames(z) %in% "(Intercept)")]
  pz1 <- z1 %*% (solve(crossprod(z1))) %*% t(z1)
  biv1 <- biv[!(rownames(biv) %in% "(Intercept)"),]
  bols1 <- bols[!(rownames(bols) %in% "(Intercept)"),]
  # Durbin-Wu-Hausman chi-sq test:
  # haus <- t(biv1-bols1) %*% ginv(as.numeric(sigml)*(solve(crossprod(x1,pz1) %*% x1)-solve(crossprod(x1)))) %*% (biv1-bols1)
  # hpvl <- 1-pchisq(haus,df=1)
  # Wu-Hausman F test
  resids <- NULL
  resids <- cbind(resids,y1 - z %*% solve(crossprod(z)) %*% crossprod(z,y1))
  x2 <- cbind(x,resids)
  bet1 <- solve(crossprod(x2)) %*% crossprod(x2,y)
  bet2 <- solve(crossprod(x)) %*% crossprod(x,y)
  rss1 <- sum((y - x2 %*% bet1)^2)
  rss2 <- sum((y - x %*% bet2)^2)
  p1 <- length(bet1)
  p2 <- length(bet2)
  n1 <- length(y)
  fs <- abs((rss2-rss1)/(p2-p1))/(rss1/(n1-p1))
  fpval <- 1-pf(fs, p1-p2, n1-p1)
  #hawu <- c(haus,hpvl,fs,fpval)
  hawu <- c(fs,fpval)
  hawu <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),hawu)),ncol=length(hawu))
  #colnames(hawu) <- c("Durbin-Wu-Hausman chi-sq test","p-val","Wu-Hausman F-test","p-val")
  colnames(hawu) <- c("Wu-Hausman F-test","p-val")  
  # Sargan Over-id test
  ivres <- y - (x %*% biv)
  oid <- solve(crossprod(z)) %*% crossprod(z,ivres)
  sstot <- sum((ivres-mean(ivres))^2)
  sserr <- sum((ivres - (z %*% oid))^2)
  rsq <- 1-(sserr/sstot)
  sargan <- length(ivres)*rsq
  spval <- 1-pchisq(sargan,df=length(iv)-1)
  overid <- c(sargan,spval)
  overid <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),overid)),ncol=length(overid))
  colnames(overid) <- c("Sargan test of over-identifying restrictions","p-val")
  if(length(iv)-1==0){
    overid <- t(matrix(c("No test performed. Model is just identified")))
    colnames(overid) <- c("Sargan test of over-identifying restrictions")
  }
  full <- list(results=res, weakidtest=firststage, endogeneity=hawu, overid=overid)
  return(full)
}

About these ads

4 thoughts on “An ivreg2 function for R

  1. Is there any valid readon for using solve() here? If you use qr(), then chol2inv() on the $qr component, you avoid numerical issues associated with variable scaling and aliasing too (if you check for rank). See also:

    @article{Rnews:Bates:2004,
    author = {Douglas Bates},
    title = {Least Squares Calculations in {R}},
    journal = {R News},
    year = 2004,
    volume = 4,
    number = 1,
    pages = {17–20},
    month = {June},
    url = {http://CRAN.R-project.org/doc/Rnews/},
    pdf = {http://CRAN.R-project.org/doc/Rnews/Rnews_2004-1.pdf}
    }

    • Hi Roger,

      Thanks for the comment. Other than ignorance, I did not have a valid reason for using of the solve() command.

      When I extend this function I will definitely take your advice, and be sure to read that paper you recommend.

      Kind regards.

  2. Look at the packages plm and Formula. They implement various form of supplying instrumental variables directly in the formula. Also I suggest splitting code for statistical test functions into separate functions. This is how usually it is implemented in R. You can print these statistics using print method, or summary. Look into print.summary.lm and summary.lm functions on how this is implemented.

    • As far as I am aware no function in plm returns the test statistics in a manner similar to ivreg2 in Stata. That is all I wanted to replicate with this simple function, which works OK (for me anyway).

      I understand that further development requires refinement, and, to this end, I appreciate your input. These comments are helpful.

      Best wishes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s