Adjusting standard errors for clustering can be a very important part of any statistical analysis. For example, duplicating a data set will reduce the standard errors dramatically despite there being no new information.

I have previously dealt with this topic with reference to the linear regression model. However, in many cases one would like to obtain cluster-robust standard errors for more elaborate statistical analyses than a simple linear regression model. Typically, increases in model complexity are met with similar increases in the calculations required to estimate the model’s uncertainty via analytical means. Adding the required adjustment for group clustering furthers this often burdensome process.

Bootstrapping procedures offer a simple solution to this issue. Essentially, a nonparametric bootstrap procedure estimates a model for a specified number of repetitions using samples of the data frame. The simplest bootstrap draws a random sample from the data frame such that the sample data frame is the same dimension as the main data frame, although the observations are not the same because it allows for random sampling with replacement. For each repetition the main analysis is repeated on the sample data, and the estimate stored (so for a linear regression model this would be the model’s coefficients). Once all repetitions have been computed the standard errors can be calculated by taking the standard deviation of the stored model estimates.

Adapting this technique to tackle clustering is relatively straightforward. Instead of drawing the observation units with replacement, one draws the cluster units with replacement. For example, imagine we have a data frame consisting of 1,000 students from 100 schools. The non-cluster bootstrap draws 1,000 observations from the 1,000 students with replacement for each repetition. The cluster bootstrap will instead draw 100 schools with replacement.

In the below, I show how to formulate a simple cluster bootstrap procedure for a linear regression in R. In this analysis, I simulate some data and then falsely replicate the data frame three times which causes the standard errors to drop. I then show how a simple cluster-robust command solves this issue (i.e. gives standard errors like the original data frame that had not been replicated). Finally, I show how these cluster-robust standard errors can be obtained via bootstrapping. The code is anything but succinct (and therefore probably very inefficient), so I would be happy to take suggestions on how one could improve my clusbootreg() function. Nevertheless, it is clear from the output that this function is giving the right answers.

# cluster-boot library(lmtest) ; library(sandwich) rm(list=ls()) # data sim x <- rnorm(1000) y <- 5 + 2*x + rnorm(1000) # regression summary(lm(y~x)) # triple data dat <- data.frame(x=c(x,x,x),y=c(y,y,y),g=c(1:1000,1:1000,1:1000)) # regressions m1 <- lm(y~x, dat) # smaller StErrs summary(m1) # cluster robust standard error function robust.se <- function(model, cluster){ require(sandwich) require(lmtest) M <- length(unique(cluster)) N <- length(cluster) K <- model$rank dfc <- (M/(M - 1)) * ((N - 1)/(N - K)) uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum)); rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N) rcse.se <- coeftest(model, rcse.cov) return(list(rcse.cov, rcse.se)) } robust.se(m1,dat$g)[[2]] # StErrs now back to what they are # cluster bootstrap function clusbootreg <- function(formula, data, cluster, reps=1000){ reg1 <- lm(formula, data) clusters <- names(table(cluster)) sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1))) for(i in 1:reps){ index <- sample(1:length(clusters), length(clusters), replace=TRUE) aa <- clusters[index] bb <- table(aa) bootdat <- NULL for(j in 1:max(bb)){ cc <- data[cluster %in% names(bb[bb %in% j]),] for(k in 1:j){ bootdat <- rbind(bootdat, cc) } } sterrs[i,] <- coef(lm(formula, bootdat)) } val <- cbind(coef(reg1),apply(sterrs,2,sd)) colnames(val) <- c("Estimate","Std. Error") return(val) }

> # OUTPUT > summary(lm(y~x)) #correct StErrs Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -3.10461 -0.63043 0.02924 0.70622 2.78926 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.96345 0.03084 160.97 <2e-16 *** x 2.02747 0.03048 66.51 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9748 on 998 degrees of freedom Multiple R-squared: 0.8159, Adjusted R-squared: 0.8157 F-statistic: 4424 on 1 and 998 DF, p-value: < 2.2e-16 > summary(lm(y~x, dat)) #smaller StErrs Call: lm(formula = y ~ x, data = dat) Residuals: Min 1Q Median 3Q Max -3.10461 -0.63043 0.02924 0.70622 2.78926 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.96345 0.01779 279.0 <2e-16 *** x 2.02747 0.01759 115.3 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9742 on 2998 degrees of freedom Multiple R-squared: 0.8159, Adjusted R-squared: 0.8159 F-statistic: 1.329e+04 on 1 and 2998 DF, p-value: < 2.2e-16 > robust.se(m1,dat$g)[[2]] #correct StErrs t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.963447 0.030810 161.099 < 2.2e-16 *** x 2.027469 0.029347 69.086 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > clusbootreg(formula=y~x, data=dat, cluster=dat$g, reps=1000) #correct StErrs Estimate Std. Error (Intercept) 4.963447 0.03197737 x 2.027469 0.02920071

Thanks for putting this together. I look forward to applying it for my own work. A couple suggestions on how to improve the efficiency:

1. Set a constant (e.g., “cluster.len”) outside of the loop to stand in for “length(clusters).” That’s a calculation that needs to be performed only once, not 2*reps.

2. Using sample.int(cluster.len, replace=TRUE) instead of sample(1:cluster.len, cluster.len, replace=TRUE) would save some time, too.

3. Replicate(reps) would probably run much faster than for (i in 1:reps).

4. You might be able to work those other for loops into sapply or lapply commands.

There might be a few other tricks, but those are the ones that jump out at me.

Thanks for the comment Ben. I will try and use your suggestions.

Pingback: texreg: A package for beautiful and easily customizable LaTeX regression tables from R « DiffusePrioR

I came across this post as I was looking for a cluster bootstrap function in R. I had a quick play with the code and managed to get it down to about 30-40% of the time. I have put the code below if you are interested. It uses glm instead of lm as I was using logit. I’m pretty sure you could knock some more time off somewhere. The bootstrap function is itself perfectly parallelisable so you could use multicore using the snow or foreach packages.

clusterBootSE<-function(formula,data,clustervar,reps){

reg1 <- glm(formula, data, family=binomial(link="logit"))

#1. generate the bootstrap dataset

clusters<-unique(data[,paste(clustervar)])

sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1)))

for(i in 1:reps){

units<-sample(clusters,size=length(clusters),replace=T)

df.bs<-sapply(units,function(x)which(data[,paste(clustervar)]==x))

df.bs<-dat[unlist(df.bs),]

sterrs[i,]<-coef(glm(formula, df.bs, family=binomial(link="logit")))

cat("\r",i*100/reps," % done ");flush.console()

}

val <- cbind(coef(reg1),apply(sterrs,2,sd))

rownames(val)<-names(coef(reg1)); colnames(val) <- c("Estimate","Std. Error")

cat("\n")

return(val)

}

Thanks for the code Sam. I might write another blog post on this.

Hi Sam,

Unfortunately I believe this approach is incorrect, since glm gives inconsistent & biased coefficients at each iteration (see here: http://davegiles.blogspot.com/2013/05/robust-standard-errors-for-nonlinear.html).

Just replace glm with another routine then. It was just a comment about speeding up bootstrap. I think you could still get it much further down though.

I can’t get clusterBootSE to work. There seems to be a problem with

clusters<-unique(data[,paste(clustervar)])

is this just me or is there a problem with your code, Sam?

Sorry about that, it was a problem on my computer. Fixed now. Thanks for the code guys!

No problem Jens. I am glad you found it useful.

Hi Jens – what exactly did you do? I got the same error upon using the code for a logit.

Error in `[.data.frame`(data, , paste(clustervar)) :

undefined columns selected

Hello,

I was wondering how to modify this code for the wild bootstrap using the method found in Cameron, Gelbach and Miller (2008). Procedure can be found here: http://qed.econ.queensu.ca/working_papers/papers/qed_wp_1314.pdf.

Thank you so much.

Hi Louis, thanks for you comment. I am very busy at the moment, but might return to this at a later date.

Just some quick thoughts. The procedure in either the paper you link to or Cameron, Gelbach, and Miller (2008) should make it clear. If you want wild cluster boot strap standard errors for every coefficient then you would have to do each one separately. You need to create your new pseudo-sample in each loop to re-estimate the parameter of interest. Assuming you are estimating a linear model with dep. var. ‘y’ and two indep. vars. ‘x1’ and ‘x2’ and here you are estimating the SE for b2, the coefficient on x2 then you need to create a new dep. var in each iteration equal to the estimated coefficients from a restricted model, with b2=0, and with the residual added or subtracted (with probability 0.5) from the same restricted model. Within the loop you could have (some of the names of things are taken from my earlier post):

for(i in 1:reps){

ones<-sample(c(1,-1),length(clusters),replace=T)

ones<-rep(ones,times=freq)

ds.bs<-data

ds.bs[,'y']<-c[[1]]+c[[2]]*ds.bs[,'x1']+resids*ones

b2[i]<-coef(lm(y~x1+x2,ds.bs))[[3]]

}

where freq is the frequency of each cluster which you could obtain using

freq<-as.data.frame(table(data[,paste(clustervar)]))

freq<-freq[,'Freq]

resids are the residuals and c is the coefficients from the original restricted model

reg1<-lm(y~x1,data)

c<-reg1$coefficients

resids<-reg1$residuals

obtained before the bootstrap loop, and b2 is a vector to store all the b2 estimates in.

There are probably quicker and simpler ways, as ever, but I hope that helps.

Thank Sam, that seems pretty straightforward. I will look at it more closely, to see how it works.

Thanks for the comment Sam.

Dear Diffuseprior, I have a quick question: based on the example given in the article, I understand that we should sample 100 schools (the students are associated with their schools) for cluster boostrap regression. However, what if different schools have different number of students? If this is true (very likely), then the size of our boostraped sample at each iteration (or replication) will be different: sometimes it is above 1000 (the original number of students), sometime it is below 1000. Is this a problem? Can you provide some textbook reference? Thank you very much!

Yes, sometimes the bootstrapped sample sizes won’t all be the same size. That’s not a problem. I don’t know a textbook example off hand.

@Chang, I think this is a problem here. Unless I’m mistaken, the code doesn’t seem to account for unequal sample sizes. This is best handled by sampling a bit differently. See the answer to this post: http://stats.stackexchange.com/questions/202916/cluster-boostrap-with-unequally-sized-clusters/202924#202924 for additional information on how this is typically handled with unequal cluster sizes.

Thanks for this post. I have two questions:

1. How can I proceed if I have double clusters (two level of ID)?

2. Do you have any reference papers to the sampling method you provide here?

Thank you so much!

Pingback: Speeding up the Cluster Bootstrap in R | DiffusePrioR

@chang, see this post. It describes how to handle uneven cluster sizes. http://stats.stackexchange.com/questions/202916/cluster-boostrap-with-unequally-sized-clusters/202924#202924

Pingback: Speeding up the Cluster Bootstrap in R | A bunch of data