An ivreg2 function for R

The ivreg2 command is one of the most popular routines in Stata. The reason for this popularity is its simplicity. A one-line ivreg2 command generates not only the instrumental variable regression coefficients and their standard errors, but also a number of other statistics of interest.

I have come across a number of functions in R that calculate instrumental variable regressions. However, none appear to (and correct me if I am wrong) offer an output similar to the ivreg2 command in Stata. The function below is my first attempt to replicate Stata’s ivreg2.

ivreg2(form,endog,iv,data,digits)

There are four required arguments. The ‘form’ argument is the second stage regression, written in the same manner as any regression model in R. The ‘endog’ argument is a character object with the name of the endogenous variable. The user should specify the instrumental variable(s) with the ‘iv’ argument. These instruments should be contained in ‘data’ – a data frame object. Note, the function in its current state only allows of one endogenous variable (which is usually more than enough for the researcher to contend with). Furthermore, make sure that there are no ‘NA’ values in the data frame being passed through the function.

This function performs a 2SLS regression calculating the usual regression output, a weak identification F-statistic, the Wu-Hausman test of endogeneity, and, in the case where there is more than one-instrument, a Sargan test. The weak identification statistic is used to determine whether the instrument(s) is(are) sufficiently correlated with the endogenous variable of interest. The ‘rule-of-thumb’ critical statistic here is ten. A Wu-Hausman test examines the difference between the IV and OLS coefficients. Rejecting the null hypothesis indicates the presence of endogeneity. Finally, the Sargan over-identification test is used in the cases where there are more instruments than endogenous regressors. A rejection of the null in this test means that the instruments are not exclusively affecting the outcome of interest though the endogenous variable.

The code for this function, alongside an example with the well known Mroz data, is shown below.

> mroz <- read.dta("mroz.dta")
> mroz <- mroz[,c("hours","lwage","educ","age","kidslt6","kidsge6","nwifeinc","exper")]
> ivreg2(form=hours ~ lwage + educ + age + kidslt6 + kidsge6 + nwifeinc,
+       endog="lwage",iv=c("exper"),data=na.omit(mroz))
$results
                Coef    S.E. t-stat p-val
(Intercept) 2478.435 655.207  3.783 0.000
lwage       1772.323 594.185  2.983 0.003
educ        -201.187  69.910 -2.878 1.996
age          -11.229  10.537 -1.066 1.713
kidslt6     -191.659 195.761 -0.979 1.672
kidsge6      -37.732  63.635 -0.593 1.447
nwifeinc      -9.978   7.174 -1.391 1.836

$weakidtest
     First Stage F-test
[1,]             12.965

$endogeneity
     Wu-Hausman F-test p-val
[1,]             36.38     0

$overid
     Sargan test of over-identifying restrictions 
[1,] "No test performed. Model is just identified"
ivreg2 <- function(form,endog,iv,data,digits=3){
  # library(MASS)
  # model setup
  r1 <- lm(form,data)
  y <- r1$fitted.values+r1$resid
  x <- model.matrix(r1)
  aa <- rbind(endog == colnames(x),1:dim(x)[2])  
  z <- cbind(x[,aa[2,aa[1,]==0]],data[,iv])  
  colnames(z)[(dim(z)[2]-length(iv)+1):(dim(z)[2])] <- iv  
  # iv coefficients and standard errors
  z <- as.matrix(z)
  pz <- z %*% (solve(crossprod(z))) %*% t(z)
  biv <- solve(crossprod(x,pz) %*% x) %*% (crossprod(x,pz) %*% y)
  sigiv <- crossprod((y - x %*% biv),(y - x %*% biv))/(length(y)-length(biv))
  vbiv <- as.numeric(sigiv)*solve(crossprod(x,pz) %*% x)
  res <- cbind(biv,sqrt(diag(vbiv)),biv/sqrt(diag(vbiv)),(1-pnorm(biv/sqrt(diag(vbiv))))*2)
  res <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res)),nrow=dim(res)[1])
  rownames(res) <- colnames(x)
  colnames(res) <- c("Coef","S.E.","t-stat","p-val")
  # First-stage F-test
  y1 <- data[,endog]
  z1 <- x[,aa[2,aa[1,]==0]]
  bet1 <- solve(crossprod(z)) %*% crossprod(z,y1)
  bet2 <- solve(crossprod(z1)) %*% crossprod(z1,y1)
  rss1 <- sum((y1 - z %*% bet1)^2)
  rss2 <- sum((y1 - z1 %*% bet2)^2)
  p1 <- length(bet1)
  p2 <- length(bet2)
  n1 <- length(y)
  fs <- abs((rss2-rss1)/(p2-p1))/(rss1/(n1-p1))
  firststage <- c(fs)
  firststage <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),firststage)),ncol=length(firststage))
  colnames(firststage) <- c("First Stage F-test")
  # Hausman tests
  bols <- solve(crossprod(x)) %*% crossprod(x,y) 
  sigols <- crossprod((y - x %*% bols),(y - x %*% bols))/(length(y)-length(bols))
  vbols <- as.numeric(sigols)*solve(crossprod(x))
  sigml <- crossprod((y - x %*% bols),(y - x %*% bols))/(length(y))
  x1 <- x[,!(colnames(x) %in% "(Intercept)")]
  z1 <- z[,!(colnames(z) %in% "(Intercept)")]
  pz1 <- z1 %*% (solve(crossprod(z1))) %*% t(z1)
  biv1 <- biv[!(rownames(biv) %in% "(Intercept)"),]
  bols1 <- bols[!(rownames(bols) %in% "(Intercept)"),]
  # Durbin-Wu-Hausman chi-sq test:
  # haus <- t(biv1-bols1) %*% ginv(as.numeric(sigml)*(solve(crossprod(x1,pz1) %*% x1)-solve(crossprod(x1)))) %*% (biv1-bols1)
  # hpvl <- 1-pchisq(haus,df=1)
  # Wu-Hausman F test
  resids <- NULL
  resids <- cbind(resids,y1 - z %*% solve(crossprod(z)) %*% crossprod(z,y1))
  x2 <- cbind(x,resids)
  bet1 <- solve(crossprod(x2)) %*% crossprod(x2,y)
  bet2 <- solve(crossprod(x)) %*% crossprod(x,y)
  rss1 <- sum((y - x2 %*% bet1)^2)
  rss2 <- sum((y - x %*% bet2)^2)
  p1 <- length(bet1)
  p2 <- length(bet2)
  n1 <- length(y)
  fs <- abs((rss2-rss1)/(p2-p1))/(rss1/(n1-p1))
  fpval <- 1-pf(fs, p1-p2, n1-p1)
  #hawu <- c(haus,hpvl,fs,fpval)
  hawu <- c(fs,fpval)
  hawu <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),hawu)),ncol=length(hawu))
  #colnames(hawu) <- c("Durbin-Wu-Hausman chi-sq test","p-val","Wu-Hausman F-test","p-val")
  colnames(hawu) <- c("Wu-Hausman F-test","p-val")  
  # Sargan Over-id test
  ivres <- y - (x %*% biv)
  oid <- solve(crossprod(z)) %*% crossprod(z,ivres)
  sstot <- sum((ivres-mean(ivres))^2)
  sserr <- sum((ivres - (z %*% oid))^2)
  rsq <- 1-(sserr/sstot)
  sargan <- length(ivres)*rsq
  spval <- 1-pchisq(sargan,df=length(iv)-1)
  overid <- c(sargan,spval)
  overid <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),overid)),ncol=length(overid))
  colnames(overid) <- c("Sargan test of over-identifying restrictions","p-val")
  if(length(iv)-1==0){
    overid <- t(matrix(c("No test performed. Model is just identified")))
    colnames(overid) <- c("Sargan test of over-identifying restrictions")
  }
  full <- list(results=res, weakidtest=firststage, endogeneity=hawu, overid=overid)
  return(full)
}

About these ads

28 thoughts on “An ivreg2 function for R

  1. Is there any valid readon for using solve() here? If you use qr(), then chol2inv() on the $qr component, you avoid numerical issues associated with variable scaling and aliasing too (if you check for rank). See also:

    @article{Rnews:Bates:2004,
    author = {Douglas Bates},
    title = {Least Squares Calculations in {R}},
    journal = {R News},
    year = 2004,
    volume = 4,
    number = 1,
    pages = {17–20},
    month = {June},
    url = {http://CRAN.R-project.org/doc/Rnews/},
    pdf = {http://CRAN.R-project.org/doc/Rnews/Rnews_2004-1.pdf}
    }

    • Hi Roger,

      Thanks for the comment. Other than ignorance, I did not have a valid reason for using of the solve() command.

      When I extend this function I will definitely take your advice, and be sure to read that paper you recommend.

      Kind regards.

  2. Look at the packages plm and Formula. They implement various form of supplying instrumental variables directly in the formula. Also I suggest splitting code for statistical test functions into separate functions. This is how usually it is implemented in R. You can print these statistics using print method, or summary. Look into print.summary.lm and summary.lm functions on how this is implemented.

    • As far as I am aware no function in plm returns the test statistics in a manner similar to ivreg2 in Stata. That is all I wanted to replicate with this simple function, which works OK (for me anyway).

      I understand that further development requires refinement, and, to this end, I appreciate your input. These comments are helpful.

      Best wishes.

  3. Great code, thank for this.

    However, I can’t specify more than one IV in the -iv=c()- option if I go iv=c(“z1″,”z2″):

    Error in `[.data.frame`(data, , iv) : undefined columns selected

    -cbind-ing the two variables into Z and then specifying iv=c(“X”) produces the same result.

    This is a bit strange. What am I doing wrong?

    Clive

  4. Also, why are the coefficient, SE and p values for the endogenous variable not reported? That’s a mystery to me as well – they are in the IV routines in AER and -plm-!!

    Thanks again.

    • Hi Clive. Thanks for the comment. I wrote this blog post a while ago but haven’t updated the code in a while. For what it’s worth I would use the aer or lfe packages for IV regressions in R. Hopefully, I will get the chance to work on this code when I get the chance.

      • I now have what looks to be a *complete* solution for fitting IV models in R (particularly to panel data), very kindly provided to me in a series of private exchanges with Dr Tony Cookson of the University of Colorado via his -tonymisc- package, which provides a routine for -iv()- and in his additional code called -laggard-, which automatically generates lags for your variables (not yet part of -tonymisc-). See below, which assumes you have -plm- already installed:

        library(plm)
        data(Grunfeld)
        grunfeld=plm.data(Grunfeld,c(“firm”,”year”))
        Grunfeld$date = as.Date(paste(Grunfeld$year,”01″,”01″,sep=”-“), format=”%Y-%m-%d” )

        laggard = function (data, by.var, lag.vars, num.lags, direction, freq = “daily”)
        {
        if (class(data$date) != “Date”) {
        cat(“date is not in Date format. No lagging operations performed.”)
        retdat = NULL
        }
        if (class(data$date) == “Date”) {
        data$year = format(data$date, “%Y”)
        data$month = format(data$date, “%m”)
        data$quarter = as.factor(quarters(data$date))
        buy.var = “ordermaker”
        data[, buy.var] <- as.numeric(as.factor(data[, by.var]))
        if (freq == "daily") {
        data$date.num <- as.numeric(data$date)
        bad.obs <- 4
        }
        if (freq == "monthly") {
        data$date.num <- as.numeric(data$year) * 100 + (as.numeric(data$month) *
        100/12)
        bad.obs <- 100/11
        }
        if (freq == "quarterly") {
        data$date.num <- as.numeric(data$year) * 10 + (as.numeric(data$quarter) *
        10/4)
        bad.obs <- 3
        }
        if (freq == "yearly") {
        data$date.num <- as.numeric(data$year)
        bad.obs <- 3
        }
        if (direction == "lag") {
        data <- data[order(data[, buy.var], data$date.num),
        ]
        }
        if (direction == "lead") {
        data <- data[order(data[, buy.var], data$date.num,
        decreasing = TRUE), ]
        }
        lags <- c(buy.var, "date.num", lag.vars)
        nums <- c(rep(max(num.lags), 2), num.lags)
        g <- data.frame(lags, nums, stringsAsFactors = FALSE)
        for (v in 1:nrow(g)) {
        var <- g[v, 1]
        print(var)
        for (n in 1:g[v, 2]) {
        data[, paste(var, paste(direction, n, sep = ""),
        sep = ".")] n * bad.obs,
        paste(var, paste(direction, n, sep = “”),
        sep = “.”)] |t|)
        (Intercept) -47.184779 10.392510 -4.540 5.62e-06 ***
        capital 0.224308 0.027582 8.132 4.44e-16 ***
        value.hat 0.120248 0.006633 18.129 < 2e-16 ***

        Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

        Residual standard error: 95.96 on 177 degrees of freedom
        Multiple R-squared: 0.8157, Adjusted R-squared: 0.7924
        F-statistic: 342.6 on 2 and 177 DF, p-value: S.stat ]
        1 0.8040214 0.3698938

        mtable(fit)

        Calls:
        fit: lm(formula = second.form, data = data2)

        =================================
        (Intercept) -47.185***
        (10.393)
        capital 0.224***
        (0.028)
        value.hat 0.120***
        (0.007)
        ———————————
        R-squared 0.816
        F (first stage) 840.264
        Anderson CC (p-value) 0.000
        Sargan (p-value) 0.370
        F (omnibus) 342.630
        p-val (omnibus) 0.000
        N 180
        =================================

        Notice that in this admittedly toy case, the post-estimation tests reject or retain the null in all the desired directions. (However, I didn’t make all the usual pre-model checks as to whether the lags of -value- really were suitable IVs, not that it really matters here.)

        One note of caution is that -iv()- and -laggard-, like any other IV routine, will drop data if there are breaks in time for any panels as it generates additional missing values. I found this annoying and eventually settled for manually creating the lagged variables in the form I needed by cutting and pasting the data in my Excel spreadsheet(s), which isn’t really the most desirable thing to do. (Learning how to -split- and -lapply- the data back together again to achieve the same effect sounds too painful for contemplation.)

        Hopefully, this should be helpful to any of your followers who wish to fit their own IV models to panel data in R.

        Yours,
        Clive Nicholas

  5. Okay, good news! I’ve now been able to run the code successfully. I made the stupid mistake of not creating the second instrument in my spreadsheet file before loading it into R (I was relying on -attach-ing the data and then creating the variable that way, but your routine wasn’t picking up that variable.

    There are, however, two outstanding issues to which I hope you will return when you get a chance to work on this code:

    (1) the output produces no fit statistics (R-squared and friends), which is a bit odd;
    (2) it would be excellent if there were an option of presenting the output with cluster-robust standard errors. Going the way of (if I name the model object -fit-)

    library(lmtest)
    coeftest(fit,vcov=vcovHC(fit,type=”HC0″,group=”groupvar”))

    currently produces errors.

    I hope all of this helps rather than hinders. :)

    C

  6. Oops! I forgot a third consideration:

    (3) you can’t specify factor variables as IVs (eg, -iv=c(factor(foo))-) in your code, either. Doing it that way, or after -cbind-ing a bunch of fixed effects into one after -attaching- the data, doesn’t work. The workaround solution is to create the FE in the spreadsheet, load it into R and *then* specify each and every FE in the IV option. But if you’ve got, say, 100-odd FEs, that would be long and tedious!

    Again, I hope this helps.

    C

  7. Not all of the code came out right. Here it is again.

    library(plm)
    data(Grunfeld)
    grunfeld=plm.data(Grunfeld,c(“firm”,”year”))
    Grunfeld$date = as.Date(paste(Grunfeld$year,”01″,”01″,sep=”-“), format=”%Y-%m-%d” )

    laggard = function (data, by.var, lag.vars, num.lags, direction, freq = “daily”)
    {
    if (class(data$date) != “Date”) {
    cat(“date is not in Date format. No lagging operations performed.”)
    retdat = NULL
    }
    if (class(data$date) == “Date”) {
    data$year = format(data$date, “%Y”)
    data$month = format(data$date, “%m”)
    data$quarter = as.factor(quarters(data$date))
    buy.var = “ordermaker”
    data[, buy.var] <- as.numeric(as.factor(data[, by.var]))
    if (freq == "daily") {
    data$date.num <- as.numeric(data$date)
    bad.obs <- 4
    }
    if (freq == "monthly") {
    data$date.num <- as.numeric(data$year) * 100 + (as.numeric(data$month) *
    100/12)
    bad.obs <- 100/11
    }
    if (freq == "quarterly") {
    data$date.num <- as.numeric(data$year) * 10 + (as.numeric(data$quarter) *
    10/4)
    bad.obs <- 3
    }
    if (freq == "yearly") {
    data$date.num <- as.numeric(data$year)
    bad.obs <- 3
    }
    if (direction == "lag") {
    data <- data[order(data[, buy.var], data$date.num),
    ]
    }
    if (direction == "lead") {
    data <- data[order(data[, buy.var], data$date.num,
    decreasing = TRUE), ]
    }
    lags <- c(buy.var, "date.num", lag.vars)
    nums <- c(rep(max(num.lags), 2), num.lags)
    g <- data.frame(lags, nums, stringsAsFactors = FALSE)
    for (v in 1:nrow(g)) {
    var <- g[v, 1]
    print(var)
    for (n in 1:g[v, 2]) {
    data[, paste(var, paste(direction, n, sep = ""),
    sep = ".")] n * bad.obs,
    paste(var, paste(direction, n, sep = “”),
    sep = “.”)] <- NA
    }
    assign("data", data)
    }
    }
    retdat = data[, names(data)[!grepl("date.num", names(data)) &
    !grepl("ordermaker", names(data))]]
    }
    return(retdat)
    }

    library("tonymisc", lib.loc="/home/clive/R/i686-pc-linux-gnu-library/3.0")
    fit=iv(inv~capital+value,value~value.lag1+value.lag2,data=mydat[!is.na(mydat$value.lag2),])
    sum_iv(fit,sargan=T) # option -first=T- also available
    mtable(fit)

  8. Hi,

    First, thanks for the code. Just one question: why can’t we input two or more endogenous variables in the ivreg? It seems to me that the Stata command has this option. For example:

    ivreg2 y (x1 x2 = z1 z2) x3 x4, ffirst endog(x1 x2)robust

    Thanks for your time and attention,

    Claudio

    • Claudio,

      I’m not sure which post you’re referring to, looking at this. Is it to my last post about -iv()- from -tonymisc-?

      If it is, the answer is that you can do that, I just didn’t demonstrate it. Just add it like so:

      fit=iv(y~x1+x2+x3,x2+x3~z1+z2,data=foo)

      Not tested, however.

      Good luck,
      C

  9. Thanks Clive. I was talking about something like:

    ivreg2 y (x z = w k), ffirst endog(x z)robust

    That is, x is instrumented by w and z is instrumented by k. I will try to test your example. Thanks again!

    • Hi Claudio, in addition to Clive’s comments, I would say that the easiest way of doing what you want would be to use the ivreg function in the AER package, or if you have fixed effects you can use the felm function in the lfe package.

      • DP,

        Thanks for the tip about -felm- in -pfe-, which I shall investigate, but does it give you all the postestimation tests that the routine I advertised give you?

        Clive

      • DP,

        No point looking at it then. We need the right IV routine not just to fit the models right, but which also allow us to produce the post-estimation test statistics from which we can write peer-reviewed papers that get published; it’s what journals often demand. The -twosls- routine in -zelig- has the same problem. I’ll stick with what I’m using for now, I think, until something better than even this comes along (and it’s worth pointing out that Tony Cookson compared his results from some earlier IV code he wrote that wasn’t even as good as -iv()- in -tonymisc- with equivalent IV commands in Stata. He found that they matched.)

        Clive

      • I wouldn’t say there’s no point in looking lfe. It’s a great package, and there’s no counterpart in stata or any other stats package that runs either IV or OLS with multiple fixed effects. For example, I am using it in my research and found it pretty easy to write some additional code that estimates Angrist-Pishke weak IV F-tests for multiple instruments. I was also able to adjust these for clustering.

        However, I agree with the general point of your comment that the infrastructure of R is not as advanced as stata for IV models. Although it is very encouraging to see diagnostics being added to the excellent AER package.

  10. Hi everyone. I just noticed that the discussion were still going on… Just wanted to let you know that I now extended the summary() method for AER::ivreg to include the diagnostics proposed in this blog post. For example you can first run

    example(“ivreg”, package = “AER”)

    and then

    summary(fm, diagnostics = TRUE)

    I hope this makes using AER’s ivreg() function more similar to Stata’s ivreg2.

    • Achim,

      Just tested out your -ivreg()- routine from AER. Very good! The advantage with your routine is that I can obtain robust standard errors from -coeftest-, eg:

      > coeftest(fit,vcov=vcovHC(fit,type=”HC0″,group=”country”,time=”year”))

      which are, in effect, cluster-robust SEs. I readily concede that this can’t be done with Tony Cookson’s -iv()- routine or, at least, not yet).

      Yours with thanks,
      Clive Nicholas

    • This is great news! In fact, it has already made my life easier. But still, I would give two suggestions for improvement on the diagnostics output: 1) first, it needs to state either in the output or in help, which are the null hypothesis of the tests; 2) the one thing that is missing from the STATA version is the Partial R^2 for assessing instruments strength, which some fields require for publishing. Of course, nothing too hard to workaround, but would be neat to have built-in.

      • Note, however, that -ivreg- affords you the luxury to re-run the model and obtain robust standard via -coeftest- from the -lmtest- package, thus:

        library(AER)
        fit=ivreg(y~x+z|w+x,data=foodata)
        summary(fit,diagnostics=T)
        library(lmtest)
        coeftest(fit,vcov=vcovHC(fit,type=”HC0″,group=”groupvar”,time=”timevar”))

        for White standard errors, for example.

        -iv- from -tonymisc- does not yet allow this, regrettably.

        Yours,
        Clive Nicholas

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s