# The Cluster Bootstrap

Adjusting standard errors for clustering can be a very important part of any statistical analysis. For example, duplicating a data set will reduce the standard errors dramatically despite there being no new information.

I have previously dealt with this topic with reference to the linear regression model. However, in many cases one would like to obtain cluster-robust standard errors for more elaborate statistical analyses than a simple linear regression model. Typically, increases in model complexity are met with similar increases in the calculations required to estimate the model’s uncertainty via analytical means. Adding the required adjustment for group clustering furthers this often burdensome process.

Bootstrapping procedures offer a simple solution to this issue. Essentially, a nonparametric bootstrap procedure estimates a model for a specified number of repetitions using samples of the data frame. The simplest bootstrap draws a random sample from the data frame such that the sample data frame is the same dimension as the main data frame, although the observations are not the same because it allows for random sampling with replacement. For each repetition the main analysis is repeated on the sample data, and the estimate stored (so for a linear regression model this would be the model’s coefficients). Once all repetitions have been computed the standard errors can be calculated by taking the standard deviation of the stored model estimates.

Adapting this technique to tackle clustering is relatively straightforward. Instead of drawing the observation units with replacement, one draws the cluster units with replacement. For example, imagine we have a data frame consisting of 1,000 students from 100 schools. The non-cluster bootstrap draws 1,000 observations from the 1,000 students with replacement for each repetition. The cluster bootstrap will instead draw 100 schools with replacement.

In the below, I show how to formulate a simple cluster bootstrap procedure for a linear regression in R. In this analysis, I simulate some data and then falsely replicate the data frame three times which causes the standard errors to drop. I then show how a simple cluster-robust command solves this issue (i.e. gives standard errors like the original data frame that had not been replicated). Finally, I show how these cluster-robust standard errors can be obtained via bootstrapping. The code is anything but succinct (and therefore probably very inefficient), so I would be happy to take suggestions on how one could improve my clusbootreg() function. Nevertheless, it is clear from the output that this function is giving the right answers.

```# cluster-boot
library(lmtest) ; library(sandwich)
rm(list=ls())
# data sim
x <- rnorm(1000)
y <- 5 + 2*x + rnorm(1000)
# regression
summary(lm(y~x))

# triple data
dat <- data.frame(x=c(x,x,x),y=c(y,y,y),g=c(1:1000,1:1000,1:1000))
# regressions
m1 <- lm(y~x, dat) # smaller StErrs
summary(m1)

# cluster robust standard error function
robust.se <- function(model, cluster){
require(sandwich)
require(lmtest)
M <- length(unique(cluster))
N <- length(cluster)
K <- model\$rank
dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N)
rcse.se <- coeftest(model, rcse.cov)
return(list(rcse.cov, rcse.se))
}

robust.se(m1,dat\$g)[] # StErrs now back to what they are

# cluster bootstrap function
clusbootreg <- function(formula, data, cluster, reps=1000){
reg1 <- lm(formula, data)
clusters <- names(table(cluster))
sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1)))
for(i in 1:reps){
index <- sample(1:length(clusters), length(clusters), replace=TRUE)
aa <- clusters[index]
bb <- table(aa)
bootdat <- NULL
for(j in 1:max(bb)){
cc <- data[cluster %in% names(bb[bb %in% j]),]
for(k in 1:j){
bootdat <- rbind(bootdat, cc)
}
}
sterrs[i,] <- coef(lm(formula, bootdat))
}
val <- cbind(coef(reg1),apply(sterrs,2,sd))
colnames(val) <- c("Estimate","Std. Error")
return(val)
}
```
```> # OUTPUT
> summary(lm(y~x)) #correct StErrs

Call:
lm(formula = y ~ x)

Residuals:
Min       1Q   Median       3Q      Max
-3.10461 -0.63043  0.02924  0.70622  2.78926

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.96345    0.03084  160.97   <2e-16 ***
x            2.02747    0.03048   66.51   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9748 on 998 degrees of freedom
Multiple R-squared: 0.8159,  Adjusted R-squared: 0.8157
F-statistic:  4424 on 1 and 998 DF,  p-value: < 2.2e-16

> summary(lm(y~x, dat)) #smaller StErrs

Call:
lm(formula = y ~ x, data = dat)

Residuals:
Min       1Q   Median       3Q      Max
-3.10461 -0.63043  0.02924  0.70622  2.78926

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.96345    0.01779   279.0   <2e-16 ***
x            2.02747    0.01759   115.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9742 on 2998 degrees of freedom
Multiple R-squared: 0.8159,	Adjusted R-squared: 0.8159
F-statistic: 1.329e+04 on 1 and 2998 DF,  p-value: < 2.2e-16

> robust.se(m1,dat\$g)[] #correct StErrs

t test of coefficients:

Estimate Std. Error t value  Pr(>|t|)
(Intercept) 4.963447   0.030810 161.099 < 2.2e-16 ***
x           2.027469   0.029347  69.086 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> clusbootreg(formula=y~x, data=dat, cluster=dat\$g, reps=1000) #correct StErrs
Estimate Std. Error
(Intercept) 4.963447 0.03197737
x           2.027469 0.02920071
```

## 23 thoughts on “The Cluster Bootstrap”

1. Ben

Thanks for putting this together. I look forward to applying it for my own work. A couple suggestions on how to improve the efficiency:

1. Set a constant (e.g., “cluster.len”) outside of the loop to stand in for “length(clusters).” That’s a calculation that needs to be performed only once, not 2*reps.
2. Using sample.int(cluster.len, replace=TRUE) instead of sample(1:cluster.len, cluster.len, replace=TRUE) would save some time, too.
3. Replicate(reps) would probably run much faster than for (i in 1:reps).
4. You might be able to work those other for loops into sapply or lapply commands.

There might be a few other tricks, but those are the ones that jump out at me.

• diffuseprior

Thanks for the comment Ben. I will try and use your suggestions.

2. Sam Watson

I came across this post as I was looking for a cluster bootstrap function in R. I had a quick play with the code and managed to get it down to about 30-40% of the time. I have put the code below if you are interested. It uses glm instead of lm as I was using logit. I’m pretty sure you could knock some more time off somewhere. The bootstrap function is itself perfectly parallelisable so you could use multicore using the snow or foreach packages.

clusterBootSE<-function(formula,data,clustervar,reps){
#1. generate the bootstrap dataset
clusters<-unique(data[,paste(clustervar)])
sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1)))
for(i in 1:reps){
units<-sample(clusters,size=length(clusters),replace=T)
df.bs<-sapply(units,function(x)which(data[,paste(clustervar)]==x))
df.bs<-dat[unlist(df.bs),]
cat("\r",i*100/reps," % done ");flush.console()
}
val <- cbind(coef(reg1),apply(sterrs,2,sd))
rownames(val)<-names(coef(reg1)); colnames(val) <- c("Estimate","Std. Error")
cat("\n")
return(val)
}

• diffuseprior

Thanks for the code Sam. I might write another blog post on this.

• Sam Watson

Just replace glm with another routine then. It was just a comment about speeding up bootstrap. I think you could still get it much further down though.

3. Jens B

I can’t get clusterBootSE to work. There seems to be a problem with

clusters<-unique(data[,paste(clustervar)])

is this just me or is there a problem with your code, Sam?

4. Jens B

Sorry about that, it was a problem on my computer. Fixed now. Thanks for the code guys!

• diffuseprior

No problem Jens. I am glad you found it useful.

• SinghS

Hi Jens – what exactly did you do? I got the same error upon using the code for a logit.
Error in `[.data.frame`(data, , paste(clustervar)) :
undefined columns selected

5. Louisp

Hello,

I was wondering how to modify this code for the wild bootstrap using the method found in Cameron, Gelbach and Miller (2008). Procedure can be found here: http://qed.econ.queensu.ca/working_papers/papers/qed_wp_1314.pdf.
Thank you so much.

• diffuseprior

Hi Louis, thanks for you comment. I am very busy at the moment, but might return to this at a later date.

• Sam Watson

Just some quick thoughts. The procedure in either the paper you link to or Cameron, Gelbach, and Miller (2008) should make it clear. If you want wild cluster boot strap standard errors for every coefficient then you would have to do each one separately. You need to create your new pseudo-sample in each loop to re-estimate the parameter of interest. Assuming you are estimating a linear model with dep. var. ‘y’ and two indep. vars. ‘x1’ and ‘x2’ and here you are estimating the SE for b2, the coefficient on x2 then you need to create a new dep. var in each iteration equal to the estimated coefficients from a restricted model, with b2=0, and with the residual added or subtracted (with probability 0.5) from the same restricted model. Within the loop you could have (some of the names of things are taken from my earlier post):

for(i in 1:reps){
ones<-sample(c(1,-1),length(clusters),replace=T)
ones<-rep(ones,times=freq)
ds.bs<-data
ds.bs[,'y']<-c[]+c[]*ds.bs[,'x1']+resids*ones
b2[i]<-coef(lm(y~x1+x2,ds.bs))[]
}

where freq is the frequency of each cluster which you could obtain using

freq<-as.data.frame(table(data[,paste(clustervar)]))
freq<-freq[,'Freq]

resids are the residuals and c is the coefficients from the original restricted model

reg1<-lm(y~x1,data)
c<-reg1\$coefficients
resids<-reg1\$residuals

obtained before the bootstrap loop, and b2 is a vector to store all the b2 estimates in.
There are probably quicker and simpler ways, as ever, but I hope that helps.

• Louisp

Thank Sam, that seems pretty straightforward. I will look at it more closely, to see how it works.

• diffuseprior

Thanks for the comment Sam.

6. Chang

Dear Diffuseprior, I have a quick question: based on the example given in the article, I understand that we should sample 100 schools (the students are associated with their schools) for cluster boostrap regression. However, what if different schools have different number of students? If this is true (very likely), then the size of our boostraped sample at each iteration (or replication) will be different: sometimes it is above 1000 (the original number of students), sometime it is below 1000. Is this a problem? Can you provide some textbook reference? Thank you very much!

• diffuseprior

Yes, sometimes the bootstrapped sample sizes won’t all be the same size. That’s not a problem. I don’t know a textbook example off hand.

• Stat

@Chang, I think this is a problem here. Unless I’m mistaken, the code doesn’t seem to account for unequal sample sizes. This is best handled by sampling a bit differently. See the answer to this post: http://stats.stackexchange.com/questions/202916/cluster-boostrap-with-unequally-sized-clusters/202924#202924 for additional information on how this is typically handled with unequal cluster sizes.

7. Nancy

Thanks for this post. I have two questions:
1. How can I proceed if I have double clusters (two level of ID)?
2. Do you have any reference papers to the sampling method you provide here?
Thank you so much!