The Cluster Bootstrap

Adjusting standard errors for clustering can be a very important part of any statistical analysis. For example, duplicating a data set will reduce the standard errors dramatically despite there being no new information.

I have previously dealt with this topic with reference to the linear regression model. However, in many cases one would like to obtain cluster-robust standard errors for more elaborate statistical analyses than a simple linear regression model. Typically, increases in model complexity are met with similar increases in the calculations required to estimate the model’s uncertainty via analytical means. Adding the required adjustment for group clustering furthers this often burdensome process.

Bootstrapping procedures offer a simple solution to this issue. Essentially, a nonparametric bootstrap procedure estimates a model for a specified number of repetitions using samples of the data frame. The simplest bootstrap draws a random sample from the data frame such that the sample data frame is the same dimension as the main data frame, although the observations are not the same because it allows for random sampling with replacement. For each repetition the main analysis is repeated on the sample data, and the estimate stored (so for a linear regression model this would be the model’s coefficients). Once all repetitions have been computed the standard errors can be calculated by taking the standard deviation of the stored model estimates.

Adapting this technique to tackle clustering is relatively straightforward. Instead of drawing the observation units with replacement, one draws the cluster units with replacement. For example, imagine we have a data frame consisting of 1,000 students from 100 schools. The non-cluster bootstrap draws 1,000 observations from the 1,000 students with replacement for each repetition. The cluster bootstrap will instead draw 100 schools with replacement.

In the below, I show how to formulate a simple cluster bootstrap procedure for a linear regression in R. In this analysis, I simulate some data and then falsely replicate the data frame three times which causes the standard errors to drop. I then show how a simple cluster-robust command solves this issue (i.e. gives standard errors like the original data frame that had not been replicated). Finally, I show how these cluster-robust standard errors can be obtained via bootstrapping. The code is anything but succinct (and therefore probably very inefficient), so I would be happy to take suggestions on how one could improve my clusbootreg() function. Nevertheless, it is clear from the output that this function is giving the right answers.

# cluster-boot
library(lmtest) ; library(sandwich)
rm(list=ls())
# data sim
x <- rnorm(1000)
y <- 5 + 2*x + rnorm(1000)
# regression
summary(lm(y~x))

# triple data
dat <- data.frame(x=c(x,x,x),y=c(y,y,y),g=c(1:1000,1:1000,1:1000))
# regressions
m1 <- lm(y~x, dat) # smaller StErrs
summary(m1) 

# cluster robust standard error function
robust.se <- function(model, cluster){
  require(sandwich)
  require(lmtest)
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- model$rank
  dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
  uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
  rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N)
  rcse.se <- coeftest(model, rcse.cov)
  return(list(rcse.cov, rcse.se))
}

robust.se(m1,dat$g)[[2]] # StErrs now back to what they are

# cluster bootstrap function
clusbootreg <- function(formula, data, cluster, reps=1000){
  reg1 <- lm(formula, data)
  clusters <- names(table(cluster))
  sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1)))
  for(i in 1:reps){
    index <- sample(1:length(clusters), length(clusters), replace=TRUE)
    aa <- clusters[index]
    bb <- table(aa)
    bootdat <- NULL
    for(j in 1:max(bb)){
      cc <- data[cluster %in% names(bb[bb %in% j]),]
      for(k in 1:j){
        bootdat <- rbind(bootdat, cc)
      }
    }
    sterrs[i,] <- coef(lm(formula, bootdat))
  }
  val <- cbind(coef(reg1),apply(sterrs,2,sd))
  colnames(val) <- c("Estimate","Std. Error")
  return(val)
}
> # OUTPUT
  > summary(lm(y~x)) #correct StErrs

Call:
  lm(formula = y ~ x)

Residuals:
  Min       1Q   Median       3Q      Max 
-3.10461 -0.63043  0.02924  0.70622  2.78926 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.96345    0.03084  160.97   <2e-16 ***
x            2.02747    0.03048   66.51   <2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.9748 on 998 degrees of freedom
Multiple R-squared: 0.8159,  Adjusted R-squared: 0.8157 
F-statistic:  4424 on 1 and 998 DF,  p-value: < 2.2e-16 

> summary(lm(y~x, dat)) #smaller StErrs

Call:
  lm(formula = y ~ x, data = dat)

Residuals:
  Min       1Q   Median       3Q      Max 
-3.10461 -0.63043  0.02924  0.70622  2.78926 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.96345    0.01779   279.0   <2e-16 ***
x            2.02747    0.01759   115.3   <2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.9742 on 2998 degrees of freedom
Multiple R-squared: 0.8159,	Adjusted R-squared: 0.8159 
F-statistic: 1.329e+04 on 1 and 2998 DF,  p-value: < 2.2e-16 

> robust.se(m1,dat$g)[[2]] #correct StErrs

t test of coefficients:
  
  Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 4.963447   0.030810 161.099 < 2.2e-16 ***
x           2.027469   0.029347  69.086 < 2.2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> clusbootreg(formula=y~x, data=dat, cluster=dat$g, reps=1000) #correct StErrs
Estimate Std. Error
(Intercept) 4.963447 0.03197737
x           2.027469 0.02920071

23 thoughts on “The Cluster Bootstrap

  1. Thanks for putting this together. I look forward to applying it for my own work. A couple suggestions on how to improve the efficiency:

    1. Set a constant (e.g., “cluster.len”) outside of the loop to stand in for “length(clusters).” That’s a calculation that needs to be performed only once, not 2*reps.
    2. Using sample.int(cluster.len, replace=TRUE) instead of sample(1:cluster.len, cluster.len, replace=TRUE) would save some time, too.
    3. Replicate(reps) would probably run much faster than for (i in 1:reps).
    4. You might be able to work those other for loops into sapply or lapply commands.

    There might be a few other tricks, but those are the ones that jump out at me.

  2. Pingback: texreg: A package for beautiful and easily customizable LaTeX regression tables from R « DiffusePrioR

  3. I came across this post as I was looking for a cluster bootstrap function in R. I had a quick play with the code and managed to get it down to about 30-40% of the time. I have put the code below if you are interested. It uses glm instead of lm as I was using logit. I’m pretty sure you could knock some more time off somewhere. The bootstrap function is itself perfectly parallelisable so you could use multicore using the snow or foreach packages.

    clusterBootSE<-function(formula,data,clustervar,reps){
    reg1 <- glm(formula, data, family=binomial(link="logit"))
    #1. generate the bootstrap dataset
    clusters<-unique(data[,paste(clustervar)])
    sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1)))
    for(i in 1:reps){
    units<-sample(clusters,size=length(clusters),replace=T)
    df.bs<-sapply(units,function(x)which(data[,paste(clustervar)]==x))
    df.bs<-dat[unlist(df.bs),]
    sterrs[i,]<-coef(glm(formula, df.bs, family=binomial(link="logit")))
    cat("\r",i*100/reps," % done ");flush.console()
    }
    val <- cbind(coef(reg1),apply(sterrs,2,sd))
    rownames(val)<-names(coef(reg1)); colnames(val) <- c("Estimate","Std. Error")
    cat("\n")
    return(val)
    }

  4. I can’t get clusterBootSE to work. There seems to be a problem with

    clusters<-unique(data[,paste(clustervar)])

    is this just me or is there a problem with your code, Sam?

    • Just some quick thoughts. The procedure in either the paper you link to or Cameron, Gelbach, and Miller (2008) should make it clear. If you want wild cluster boot strap standard errors for every coefficient then you would have to do each one separately. You need to create your new pseudo-sample in each loop to re-estimate the parameter of interest. Assuming you are estimating a linear model with dep. var. ‘y’ and two indep. vars. ‘x1’ and ‘x2’ and here you are estimating the SE for b2, the coefficient on x2 then you need to create a new dep. var in each iteration equal to the estimated coefficients from a restricted model, with b2=0, and with the residual added or subtracted (with probability 0.5) from the same restricted model. Within the loop you could have (some of the names of things are taken from my earlier post):

      for(i in 1:reps){
      ones<-sample(c(1,-1),length(clusters),replace=T)
      ones<-rep(ones,times=freq)
      ds.bs<-data
      ds.bs[,'y']<-c[[1]]+c[[2]]*ds.bs[,'x1']+resids*ones
      b2[i]<-coef(lm(y~x1+x2,ds.bs))[[3]]
      }

      where freq is the frequency of each cluster which you could obtain using

      freq<-as.data.frame(table(data[,paste(clustervar)]))
      freq<-freq[,'Freq]

      resids are the residuals and c is the coefficients from the original restricted model

      reg1<-lm(y~x1,data)
      c<-reg1$coefficients
      resids<-reg1$residuals

      obtained before the bootstrap loop, and b2 is a vector to store all the b2 estimates in.
      There are probably quicker and simpler ways, as ever, but I hope that helps.

  5. Dear Diffuseprior, I have a quick question: based on the example given in the article, I understand that we should sample 100 schools (the students are associated with their schools) for cluster boostrap regression. However, what if different schools have different number of students? If this is true (very likely), then the size of our boostraped sample at each iteration (or replication) will be different: sometimes it is above 1000 (the original number of students), sometime it is below 1000. Is this a problem? Can you provide some textbook reference? Thank you very much!

  6. Thanks for this post. I have two questions:
    1. How can I proceed if I have double clusters (two level of ID)?
    2. Do you have any reference papers to the sampling method you provide here?
    Thank you so much!

  7. Pingback: Speeding up the Cluster Bootstrap in R | DiffusePrioR

  8. Pingback: Speeding up the Cluster Bootstrap in R | A bunch of data

Leave a reply to diffuseprior Cancel reply