The Cluster Bootstrap

Posted in R, Standard Errors by diffuseprior

Adjusting standard errors for clustering can be a very important part of any statistical analysis. For example, duplicating a data set will reduce the standard errors dramatically despite there being no new information.

I have previously dealt with this topic with reference to the linear regression model. However, in many cases one would like to obtain cluster-robust standard errors for more elaborate statistical analyses than a simple linear regression model. Typically, increases in model complexity are met with similar increases in the calculations required to estimate the model’s uncertainty via analytical means. Adding the required adjustment for group clustering furthers this often burdensome process.

Bootstrapping procedures offer a simple solution to this issue. Essentially, a nonparametric bootstrap procedure estimates a model for a specified number of repetitions using samples of the data frame. The simplest bootstrap draws a random sample from the data frame such that the sample data frame is the same dimension as the main data frame, although the observations are not the same because it allows for random sampling with replacement. For each repetition the main analysis is repeated on the sample data, and the estimate stored (so for a linear regression model this would be the model’s coefficients). Once all repetitions have been computed the standard errors can be calculated by taking the standard deviation of the stored model estimates.

Adapting this technique to tackle clustering is relatively straightforward. Instead of drawing the observation units with replacement, one draws the cluster units with replacement. For example, imagine we have a data frame consisting of 1,000 students from 100 schools. The non-cluster bootstrap draws 1,000 observations from the 1,000 students with replacement for each repetition. The cluster bootstrap will instead draw 100 schools with replacement.

In the below, I show how to formulate a simple cluster bootstrap procedure for a linear regression in R. In this analysis, I simulate some data and then falsely replicate the data frame three times which causes the standard errors to drop. I then show how a simple cluster-robust command solves this issue (i.e. gives standard errors like the original data frame that had not been replicated). Finally, I show how these cluster-robust standard errors can be obtained via bootstrapping. The code is anything but succinct (and therefore probably very inefficient), so I would be happy to take suggestions on how one could improve my clusbootreg() function. Nevertheless, it is clear from the output that this function is giving the right answers.

# cluster-boot
library(lmtest) ; library(sandwich)
rm(list=ls())
# data sim
x <- rnorm(1000)
y <- 5 + 2*x + rnorm(1000)
# regression
summary(lm(y~x))

# triple data
dat <- data.frame(x=c(x,x,x),y=c(y,y,y),g=c(1:1000,1:1000,1:1000))
# regressions
m1 <- lm(y~x, dat) # smaller StErrs
summary(m1) 

# cluster robust standard error function
robust.se <- function(model, cluster){
  require(sandwich)
  require(lmtest)
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- model$rank
  dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
  uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
  rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N)
  rcse.se <- coeftest(model, rcse.cov)
  return(list(rcse.cov, rcse.se))
}

robust.se(m1,dat$g)[[2]] # StErrs now back to what they are

# cluster bootstrap function
clusbootreg <- function(formula, data, cluster, reps=1000){
  reg1 <- lm(formula, data)
  clusters <- names(table(cluster))
  sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1)))
  for(i in 1:reps){
    index <- sample(1:length(clusters), length(clusters), replace=TRUE)
    aa <- clusters[index]
    bb <- table(aa)
    bootdat <- NULL
    for(j in 1:max(bb)){
      cc <- data[cluster %in% names(bb[bb %in% j]),]
      for(k in 1:j){
        bootdat <- rbind(bootdat, cc)
      }
    }
    sterrs[i,] <- coef(lm(formula, bootdat))
  }
  val <- cbind(coef(reg1),apply(sterrs,2,sd))
  colnames(val) <- c("Estimate","Std. Error")
  return(val)
}

> # OUTPUT
  > summary(lm(y~x)) #correct StErrs

Call:
  lm(formula = y ~ x)

Residuals:
  Min       1Q   Median       3Q      Max 
-3.10461 -0.63043  0.02924  0.70622  2.78926 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.96345    0.03084  160.97   <2e-16 ***
x            2.02747    0.03048   66.51   <2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.9748 on 998 degrees of freedom
Multiple R-squared: 0.8159,  Adjusted R-squared: 0.8157 
F-statistic:  4424 on 1 and 998 DF,  p-value: < 2.2e-16 

> summary(lm(y~x, dat)) #smaller StErrs

Call:
  lm(formula = y ~ x, data = dat)

Residuals:
  Min       1Q   Median       3Q      Max 
-3.10461 -0.63043  0.02924  0.70622  2.78926 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.96345    0.01779   279.0   <2e-16 ***
x            2.02747    0.01759   115.3   <2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.9742 on 2998 degrees of freedom
Multiple R-squared: 0.8159,	Adjusted R-squared: 0.8159 
F-statistic: 1.329e+04 on 1 and 2998 DF,  p-value: < 2.2e-16 

> robust.se(m1,dat$g)[[2]] #correct StErrs

t test of coefficients:
  
  Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 4.963447   0.030810 161.099 < 2.2e-16 ***
x           2.027469   0.029347  69.086 < 2.2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> clusbootreg(formula=y~x, data=dat, cluster=dat$g, reps=1000) #correct StErrs
Estimate Std. Error
(Intercept) 4.963447 0.03197737
x           2.027469 0.02920071

23 thoughts on “The Cluster Bootstrap”

Ben

January 12, 2013 at 8:21 pm

Thanks for putting this together. I look forward to applying it for my own work. A couple suggestions on how to improve the efficiency:

1. Set a constant (e.g., “cluster.len”) outside of the loop to stand in for “length(clusters).” That’s a calculation that needs to be performed only once, not 2*reps.
2. Using sample.int(cluster.len, replace=TRUE) instead of sample(1:cluster.len, cluster.len, replace=TRUE) would save some time, too.
3. Replicate(reps) would probably run much faster than for (i in 1:reps).
4. You might be able to work those other for loops into sapply or lapply commands.

There might be a few other tricks, but those are the ones that jump out at me.

Reply
- diffuseprior
  
  January 13, 2013 at 3:51 pm
  
  Thanks for the comment Ben. I will try and use your suggestions.
  
  Reply
Pingback: texreg: A package for beautiful and easily customizable LaTeX regression tables from R « DiffusePrioR
Sam Watson

August 29, 2013 at 3:21 pm

I came across this post as I was looking for a cluster bootstrap function in R. I had a quick play with the code and managed to get it down to about 30-40% of the time. I have put the code below if you are interested. It uses glm instead of lm as I was using logit. I’m pretty sure you could knock some more time off somewhere. The bootstrap function is itself perfectly parallelisable so you could use multicore using the snow or foreach packages.

clusterBootSE<-function(formula,data,clustervar,reps){
reg1 <- glm(formula, data, family=binomial(link="logit"))
#1. generate the bootstrap dataset
clusters<-unique(data[,paste(clustervar)])
sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1)))
for(i in 1:reps){
units<-sample(clusters,size=length(clusters),replace=T)
df.bs<-sapply(units,function(x)which(data[,paste(clustervar)]==x))
df.bs<-dat[unlist(df.bs),]
sterrs[i,]<-coef(glm(formula, df.bs, family=binomial(link="logit")))
cat("\r",i*100/reps," % done ");flush.console()
}
val <- cbind(coef(reg1),apply(sterrs,2,sd))
rownames(val)<-names(coef(reg1)); colnames(val) <- c("Estimate","Std. Error")
cat("\n")
return(val)
}

Reply
- diffuseprior
  
  August 29, 2013 at 3:25 pm
  
  Thanks for the code Sam. I might write another blog post on this.
  
  Reply
- MichaelChirico
  
  October 3, 2015 at 2:57 am
  
  Hi Sam,
  
  Unfortunately I believe this approach is incorrect, since glm gives inconsistent & biased coefficients at each iteration (see here: http://davegiles.blogspot.com/2013/05/robust-standard-errors-for-nonlinear.html).
  
  Reply
  - Sam Watson
    
    December 15, 2015 at 10:28 am
    
    Just replace glm with another routine then. It was just a comment about speeding up bootstrap. I think you could still get it much further down though.
Jens B

November 19, 2013 at 9:16 am

I can’t get clusterBootSE to work. There seems to be a problem with

clusters<-unique(data[,paste(clustervar)])

is this just me or is there a problem with your code, Sam?

Reply
Jens B

November 19, 2013 at 9:47 am

Sorry about that, it was a problem on my computer. Fixed now. Thanks for the code guys!

Reply
- diffuseprior
  
  November 20, 2013 at 9:09 am
  
  No problem Jens. I am glad you found it useful.
  
  Reply
- SinghS
  
  August 17, 2014 at 12:30 pm
  
  Hi Jens – what exactly did you do? I got the same error upon using the code for a logit.
  Error in `[.data.frame`(data, , paste(clustervar)) :
  undefined columns selected
  
  Reply
Louisp

November 24, 2013 at 11:31 pm

Hello,

I was wondering how to modify this code for the wild bootstrap using the method found in Cameron, Gelbach and Miller (2008). Procedure can be found here: http://qed.econ.queensu.ca/working_papers/papers/qed_wp_1314.pdf.
Thank you so much.

Reply
- diffuseprior
  
  November 25, 2013 at 3:59 pm
  
  Hi Louis, thanks for you comment. I am very busy at the moment, but might return to this at a later date.
  
  Reply
- Sam Watson
  
  November 25, 2013 at 5:45 pm
  
  Just some quick thoughts. The procedure in either the paper you link to or Cameron, Gelbach, and Miller (2008) should make it clear. If you want wild cluster boot strap standard errors for every coefficient then you would have to do each one separately. You need to create your new pseudo-sample in each loop to re-estimate the parameter of interest. Assuming you are estimating a linear model with dep. var. ‘y’ and two indep. vars. ‘x1’ and ‘x2’ and here you are estimating the SE for b2, the coefficient on x2 then you need to create a new dep. var in each iteration equal to the estimated coefficients from a restricted model, with b2=0, and with the residual added or subtracted (with probability 0.5) from the same restricted model. Within the loop you could have (some of the names of things are taken from my earlier post):
  
  for(i in 1:reps){
  ones<-sample(c(1,-1),length(clusters),replace=T)
  ones<-rep(ones,times=freq)
  ds.bs<-data
  ds.bs[,'y']<-c[[1]]+c[[2]]*ds.bs[,'x1']+resids*ones
  b2[i]<-coef(lm(y~x1+x2,ds.bs))[[3]]
  }
  
  where freq is the frequency of each cluster which you could obtain using
  
  freq<-as.data.frame(table(data[,paste(clustervar)]))
  freq<-freq[,'Freq]
  
  resids are the residuals and c is the coefficients from the original restricted model
  
  reg1<-lm(y~x1,data)
  c<-reg1$coefficients
  resids<-reg1$residuals
  
  obtained before the bootstrap loop, and b2 is a vector to store all the b2 estimates in.
  There are probably quicker and simpler ways, as ever, but I hope that helps.
  
  Reply
  - Louisp
    
    November 26, 2013 at 7:45 pm
    
    Thank Sam, that seems pretty straightforward. I will look at it more closely, to see how it works.
  - diffuseprior
    
    November 27, 2013 at 10:17 am
    
    Thanks for the comment Sam.
Chang

June 16, 2014 at 7:04 pm

Dear Diffuseprior, I have a quick question: based on the example given in the article, I understand that we should sample 100 schools (the students are associated with their schools) for cluster boostrap regression. However, what if different schools have different number of students? If this is true (very likely), then the size of our boostraped sample at each iteration (or replication) will be different: sometimes it is above 1000 (the original number of students), sometime it is below 1000. Is this a problem? Can you provide some textbook reference? Thank you very much!

Reply
- diffuseprior
  
  June 16, 2014 at 7:26 pm
  
  Yes, sometimes the bootstrapped sample sizes won’t all be the same size. That’s not a problem. I don’t know a textbook example off hand.
  
  Reply
- Stat
  
  March 22, 2016 at 4:00 pm
  
  @Chang, I think this is a problem here. Unless I’m mistaken, the code doesn’t seem to account for unequal sample sizes. This is best handled by sampling a bit differently. See the answer to this post: http://stats.stackexchange.com/questions/202916/cluster-boostrap-with-unequally-sized-clusters/202924#202924 for additional information on how this is typically handled with unequal cluster sizes.
  
  Reply
Nancy

July 23, 2015 at 9:01 pm

Thanks for this post. I have two questions:
1. How can I proceed if I have double clusters (two level of ID)?
2. Do you have any reference papers to the sampling method you provide here?
Thank you so much!

Reply
Pingback: Speeding up the Cluster Bootstrap in R | DiffusePrioR
Stat

March 22, 2016 at 4:01 pm

@chang, see this post. It describes how to handle uneven cluster sizes. http://stats.stackexchange.com/questions/202916/cluster-boostrap-with-unequally-sized-clusters/202924#202924

Reply
Pingback: Speeding up the Cluster Bootstrap in R | A bunch of data