texreg: A package for beautiful and easily customizable LaTeX regression tables from R

There was a very informative post last week showing how the R package stargazer is used to generate nice LaTeX tables from a number of R objects. This package looks very useful. However, I would like to extol the virtues of another R package that converts model objects in R into LaTeX code: texreg.

For me, the texreg package has one very important advantage compared to stargazer, which is (and please correct me if I am wrong on this point) that the regression model’s output is very easy to customize. There are a number of examples where this feature of texreg is important, such as the inclusion of robust/cluster-robust standard errors, or changing the coefficients of a generalized linear model to read as marginal effects.

In the example below, I have replicated (with some modifications) the code used for my last blog post showing the importance of clustering standard errors. Basically this code simulates a bunch of data, estimates a regression model, then duplicates the dataset three times, which in turn reduces the standard errors of the regression model. The last step shows how the clustered standard errors will not be reduced by duplicating the dataset. Once I perform this analysis, I use the output code to generate a texreg object that gives the code needed to create the regression table in LaTeX.

In this example, I have used the texreg override commands in the texreg function so that the standard errors and p-values (needed to calculate the stars in the regression table) reflect the clustering performed with the robust.se() function. The output from this, once the code is run through a LaTeX compiler (I use TeXworks), is shown in the image below (I have also adjusted the caption so that it is on the top of the table rather than the bottom). One can download the latest version of texreg from the packages R-forge project page. This page also features a very helpful and active forum.

 texregtab

rm(list=ls())
library(lmtest) ; library(sandwich)

# data sim
library(texreg)

set.seed(1)
x <- rnorm(1000)
y <- 5 + 2*x + rnorm(1000,0,40)
# regression
m1 <- lm(y~x)
summary(m1)

# triple data
dat <- data.frame(x=c(x,x,x),y=c(y,y,y),g=c(1:1000,1:1000,1:1000))
# regressions
m2 <- lm(y~x, dat) # smaller StErrs

# cluster robust standard error function
robust.se <- function(model, cluster){
  require(sandwich)
  require(lmtest)
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- model$rank
  dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
  uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
  rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N)
  rcse.se <- coeftest(model, rcse.cov)
  return(list(rcse.cov, rcse.se))
}

m3 <- robust.se(m2,dat$g)[[2]] # StErrs now back to what they are

texreg(list(m1,m2,m2),
       caption="The Importance of Clustering Standard Errors",
       dcolumn=FALSE,
       model.names=c("M1","M2","M3"),
       override.se=list(summary(m1)$coef[,2],
                        summary(m2)$coef[,2],
                        m3[,2]),
       override.pval=list(summary(m1)$coef[,4],
                          summary(m2)$coef[,4],
                          m3[,4]))
Advertisements

The Cluster Bootstrap

Adjusting standard errors for clustering can be a very important part of any statistical analysis. For example, duplicating a data set will reduce the standard errors dramatically despite there being no new information.

I have previously dealt with this topic with reference to the linear regression model. However, in many cases one would like to obtain cluster-robust standard errors for more elaborate statistical analyses than a simple linear regression model. Typically, increases in model complexity are met with similar increases in the calculations required to estimate the model’s uncertainty via analytical means. Adding the required adjustment for group clustering furthers this often burdensome process.

Bootstrapping procedures offer a simple solution to this issue. Essentially, a nonparametric bootstrap procedure estimates a model for a specified number of repetitions using samples of the data frame. The simplest bootstrap draws a random sample from the data frame such that the sample data frame is the same dimension as the main data frame, although the observations are not the same because it allows for random sampling with replacement. For each repetition the main analysis is repeated on the sample data, and the estimate stored (so for a linear regression model this would be the model’s coefficients). Once all repetitions have been computed the standard errors can be calculated by taking the standard deviation of the stored model estimates.

Adapting this technique to tackle clustering is relatively straightforward. Instead of drawing the observation units with replacement, one draws the cluster units with replacement. For example, imagine we have a data frame consisting of 1,000 students from 100 schools. The non-cluster bootstrap draws 1,000 observations from the 1,000 students with replacement for each repetition. The cluster bootstrap will instead draw 100 schools with replacement.

In the below, I show how to formulate a simple cluster bootstrap procedure for a linear regression in R. In this analysis, I simulate some data and then falsely replicate the data frame three times which causes the standard errors to drop. I then show how a simple cluster-robust command solves this issue (i.e. gives standard errors like the original data frame that had not been replicated). Finally, I show how these cluster-robust standard errors can be obtained via bootstrapping. The code is anything but succinct (and therefore probably very inefficient), so I would be happy to take suggestions on how one could improve my clusbootreg() function. Nevertheless, it is clear from the output that this function is giving the right answers.

# cluster-boot
library(lmtest) ; library(sandwich)
rm(list=ls())
# data sim
x <- rnorm(1000)
y <- 5 + 2*x + rnorm(1000)
# regression
summary(lm(y~x))

# triple data
dat <- data.frame(x=c(x,x,x),y=c(y,y,y),g=c(1:1000,1:1000,1:1000))
# regressions
m1 <- lm(y~x, dat) # smaller StErrs
summary(m1) 

# cluster robust standard error function
robust.se <- function(model, cluster){
  require(sandwich)
  require(lmtest)
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- model$rank
  dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
  uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
  rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N)
  rcse.se <- coeftest(model, rcse.cov)
  return(list(rcse.cov, rcse.se))
}

robust.se(m1,dat$g)[[2]] # StErrs now back to what they are

# cluster bootstrap function
clusbootreg <- function(formula, data, cluster, reps=1000){
  reg1 <- lm(formula, data)
  clusters <- names(table(cluster))
  sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1)))
  for(i in 1:reps){
    index <- sample(1:length(clusters), length(clusters), replace=TRUE)
    aa <- clusters[index]
    bb <- table(aa)
    bootdat <- NULL
    for(j in 1:max(bb)){
      cc <- data[cluster %in% names(bb[bb %in% j]),]
      for(k in 1:j){
        bootdat <- rbind(bootdat, cc)
      }
    }
    sterrs[i,] <- coef(lm(formula, bootdat))
  }
  val <- cbind(coef(reg1),apply(sterrs,2,sd))
  colnames(val) <- c("Estimate","Std. Error")
  return(val)
}
> # OUTPUT
  > summary(lm(y~x)) #correct StErrs

Call:
  lm(formula = y ~ x)

Residuals:
  Min       1Q   Median       3Q      Max 
-3.10461 -0.63043  0.02924  0.70622  2.78926 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.96345    0.03084  160.97   <2e-16 ***
x            2.02747    0.03048   66.51   <2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.9748 on 998 degrees of freedom
Multiple R-squared: 0.8159,  Adjusted R-squared: 0.8157 
F-statistic:  4424 on 1 and 998 DF,  p-value: < 2.2e-16 

> summary(lm(y~x, dat)) #smaller StErrs

Call:
  lm(formula = y ~ x, data = dat)

Residuals:
  Min       1Q   Median       3Q      Max 
-3.10461 -0.63043  0.02924  0.70622  2.78926 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.96345    0.01779   279.0   <2e-16 ***
x            2.02747    0.01759   115.3   <2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.9742 on 2998 degrees of freedom
Multiple R-squared: 0.8159,	Adjusted R-squared: 0.8159 
F-statistic: 1.329e+04 on 1 and 2998 DF,  p-value: < 2.2e-16 

> robust.se(m1,dat$g)[[2]] #correct StErrs

t test of coefficients:
  
  Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 4.963447   0.030810 161.099 < 2.2e-16 ***
x           2.027469   0.029347  69.086 < 2.2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> clusbootreg(formula=y~x, data=dat, cluster=dat$g, reps=1000) #correct StErrs
Estimate Std. Error
(Intercept) 4.963447 0.03197737
x           2.027469 0.02920071