# The Cluster Bootstrap

Adjusting standard errors for clustering can be a very important part of any statistical analysis. For example, duplicating a data set will reduce the standard errors dramatically despite there being no new information.

I have previously dealt with this topic with reference to the linear regression model. However, in many cases one would like to obtain cluster-robust standard errors for more elaborate statistical analyses than a simple linear regression model. Typically, increases in model complexity are met with similar increases in the calculations required to estimate the model’s uncertainty via analytical means. Adding the required adjustment for group clustering furthers this often burdensome process.

Bootstrapping procedures offer a simple solution to this issue. Essentially, a nonparametric bootstrap procedure estimates a model for a specified number of repetitions using samples of the data frame. The simplest bootstrap draws a random sample from the data frame such that the sample data frame is the same dimension as the main data frame, although the observations are not the same because it allows for random sampling with replacement. For each repetition the main analysis is repeated on the sample data, and the estimate stored (so for a linear regression model this would be the model’s coefficients). Once all repetitions have been computed the standard errors can be calculated by taking the standard deviation of the stored model estimates.

Adapting this technique to tackle clustering is relatively straightforward. Instead of drawing the observation units with replacement, one draws the cluster units with replacement. For example, imagine we have a data frame consisting of 1,000 students from 100 schools. The non-cluster bootstrap draws 1,000 observations from the 1,000 students with replacement for each repetition. The cluster bootstrap will instead draw 100 schools with replacement.

In the below, I show how to formulate a simple cluster bootstrap procedure for a linear regression in R. In this analysis, I simulate some data and then falsely replicate the data frame three times which causes the standard errors to drop. I then show how a simple cluster-robust command solves this issue (i.e. gives standard errors like the original data frame that had not been replicated). Finally, I show how these cluster-robust standard errors can be obtained via bootstrapping. The code is anything but succinct (and therefore probably very inefficient), so I would be happy to take suggestions on how one could improve my clusbootreg() function. Nevertheless, it is clear from the output that this function is giving the right answers.

# cluster-boot
library(lmtest) ; library(sandwich)
rm(list=ls())
# data sim
x <- rnorm(1000)
y <- 5 + 2*x + rnorm(1000)
# regression
summary(lm(y~x))

# triple data
dat <- data.frame(x=c(x,x,x),y=c(y,y,y),g=c(1:1000,1:1000,1:1000))
# regressions
m1 <- lm(y~x, dat) # smaller StErrs
summary(m1)

# cluster robust standard error function
robust.se <- function(model, cluster){
require(sandwich)
require(lmtest)
M <- length(unique(cluster))
N <- length(cluster)
K <- model$rank dfc <- (M/(M - 1)) * ((N - 1)/(N - K)) uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum)); rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N) rcse.se <- coeftest(model, rcse.cov) return(list(rcse.cov, rcse.se)) } robust.se(m1,dat$g)[[2]] # StErrs now back to what they are

# cluster bootstrap function
clusbootreg <- function(formula, data, cluster, reps=1000){
reg1 <- lm(formula, data)
clusters <- names(table(cluster))
sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1)))
for(i in 1:reps){
index <- sample(1:length(clusters), length(clusters), replace=TRUE)
aa <- clusters[index]
bb <- table(aa)
bootdat <- NULL
for(j in 1:max(bb)){
cc <- data[cluster %in% names(bb[bb %in% j]),]
for(k in 1:j){
bootdat <- rbind(bootdat, cc)
}
}
sterrs[i,] <- coef(lm(formula, bootdat))
}
val <- cbind(coef(reg1),apply(sterrs,2,sd))
colnames(val) <- c("Estimate","Std. Error")
return(val)
}

> # OUTPUT
> summary(lm(y~x)) #correct StErrs

Call:
lm(formula = y ~ x)

Residuals:
Min       1Q   Median       3Q      Max
-3.10461 -0.63043  0.02924  0.70622  2.78926

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.96345    0.03084  160.97   <2e-16 ***
x            2.02747    0.03048   66.51   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9748 on 998 degrees of freedom
Multiple R-squared: 0.8159,  Adjusted R-squared: 0.8157
F-statistic:  4424 on 1 and 998 DF,  p-value: < 2.2e-16

> summary(lm(y~x, dat)) #smaller StErrs

Call:
lm(formula = y ~ x, data = dat)

Residuals:
Min       1Q   Median       3Q      Max
-3.10461 -0.63043  0.02924  0.70622  2.78926

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  4.96345    0.01779   279.0   <2e-16 ***
x            2.02747    0.01759   115.3   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9742 on 2998 degrees of freedom
Multiple R-squared: 0.8159,	Adjusted R-squared: 0.8159
F-statistic: 1.329e+04 on 1 and 2998 DF,  p-value: < 2.2e-16

> robust.se(m1,dat$g)[[2]] #correct StErrs t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.963447 0.030810 161.099 < 2.2e-16 *** x 2.027469 0.029347 69.086 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > clusbootreg(formula=y~x, data=dat, cluster=dat$g, reps=1000) #correct StErrs
Estimate Std. Error
(Intercept) 4.963447 0.03197737
x           2.027469 0.02920071


# Standard, Robust, and Clustered Standard Errors Computed in R

Where do these come from? Since most statistical packages calculate these estimates automatically, it is not unreasonable to think that many researchers using applied econometrics are unfamiliar with the exact details of their computation.

For the purposes of illustration, I am going to estimate different standard errors from a basic linear regression model: $\textbf{y}=\textbf{X} \mathbf{\beta}+\textbf{u}$, using the fertil2 dataset used in Christopher Baum’s book. Let’s load these data, and estimate a linear regression with the lm function (which estimates the parameters $\hat{\mathbf{\beta}}$ using the all too familiar: $( \textbf{X}'\textbf{X})^{-1}\textbf{X}'\textbf{y}$ least squares estimator.

rm(list=ls())
library(foreign)
# lm formula and data
form <- ceb ~ age + agefbrth + usemeth
data <- children
# run regression
r1 <- lm(form, data)
# get stand errs
> summary(r1)

Call:
lm(formula = form, data = data)

Residuals:
Min      1Q  Median      3Q     Max
-6.8900 -0.7213 -0.0017  0.6950  6.2657

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.358134   0.173783   7.815 7.39e-15 ***
age          0.223737   0.003448  64.888  < 2e-16 ***
agefbrth    -0.260663   0.008795 -29.637  < 2e-16 ***
usemeth      0.187370   0.055430   3.380 0.000733 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.463 on 3209 degrees of freedom
(1148 observations deleted due to missingness)
Multiple R-squared: 0.5726,	Adjusted R-squared: 0.5722
F-statistic:  1433 on 3 and 3209 DF,  p-value: < 2.2e-16


When the error terms are assumed homoskedastic IID, the calculation of standard errors comes from taking the square root of the diagonal elements of the variance-covariance matrix which is formulated:

$E[\textbf{uu}'|\textbf{X}] = \mathbf{\Sigma_{u}}$

$\mathbf{\Sigma_{u}} = \sigma^2 I_{N}$

$\textrm{Var}[\hat{\mathbf{\beta}}|\textbf{X}] = (\textbf{X}'\textbf{X})^{-1} (\textbf{X}' \mathbf{\Sigma_{u}} \textbf{X}) (\textbf{X}'\textbf{X})^{-1}$

$\textrm{Var}[\hat{\mathbf{\beta}}|\textbf{X}] = \sigma_{u}^{2}(\textbf{X}'\textbf{X})^{-1}$

In practice, and in R, this is easy to do. Estimate the variance by taking the average of the ‘squared’ residuals $\textbf{uu}'$, with the appropriate degrees of freedom adjustment. Code is below. As you can see, these standard errors correspond exactly to those reported using the lm function.

# get X matrix/predictors
X <- model.matrix(r1)
# number of obs
n <- dim(X)[1]
# n of predictors
k <- dim(X)[2]
# calculate stan errs as in the above
# sq root of diag elements in vcov
se <- sqrt(diag(solve(crossprod(X)) * as.numeric(crossprod(resid(r1))/(n-k))))
> se
(Intercept)         age    agefbrth     usemeth
0.173782844 0.003448024 0.008795350 0.055429804


In the presence of heteroskedasticity, the errors are not IID. Consequentially, it is inappropriate to use the average squared residuals. The robust approach, as advocated by White (1980) (and others too), captures heteroskedasticity by assuming that the variance of the residual, while non-constant, can be estimated as a diagonal matrix of each squared residual. In other words, the diagonal terms in $\mathbf{\Sigma_{u}}$ will, for the most part, be different , so the j-th row-column element will be $\hat{u}_{j}^{2}$. Once again, in R this is trivially implemented.

# residual vector
u <- matrix(resid(r1))
# meat part Sigma is a diagonal with u^2 as elements
meat1 <- t(X) %*% diag(diag(crossprod(t(u)))) %*% X
dfc <- n/(n-k)
# like before
se <- sqrt(dfc*diag(solve(crossprod(X)) %*% meat1 %*% solve(crossprod(X))))
> se
(Intercept)         age    agefbrth     usemeth
0.167562394 0.004661912 0.009561617 0.060644558


Adjusting standard errors for clustering can be important. For example, replicating a dataset 100 times should not increase the precision of parameter estimates. However, performing this procedure with the IID assumption will actually do this. Another example is in economics of education research, it is reasonable to expect that the error terms for children in the same class are not independent.

Clustering standard errors can correct for this. Assume m clusters. Like in the robust case, it is $\textbf{X}' \mathbf{\Sigma_{u}} \textbf{X}$ or ‘meat’ part, that needs to be adjusted for clustering. In practice, this involves multiplying the residuals by the predictors for each cluster separately, and obtaining $\tilde{\textbf{u}}_{j} = \sum^{N_{k}}_{i=1} \hat{u}_{i}\textbf{x}_{i}$, an m by k matrix (where k is the number of predictors). ‘Squaring’ $\tilde{\textbf{u}}_{j}$ results in a k by k matrix (the meat part). To get the standard errors, one performs the same steps as before, after adjusting the degrees of freedom for clusters.

# cluster name
cluster <- "children"
# matrix for loops
clus <- cbind(X,data[,cluster],resid(r1))
colnames(clus)[(dim(clus)[2]-1):dim(clus)[2]] <- c(cluster,"resid")
# number of clusters
m <- dim(table(clus[,cluster]))
dfc <- (m/(m-1))*((n-1)/(n-k))
# uj matrix
uclust <- matrix(NA, nrow = m, ncol = k)
gs <- names(table(data[,cluster]))
for(i in 1:m){
uclust[i,] <- t(matrix(clus[clus[,cluster]==gs[i],k+2])) %*% clus[clus[,cluster]==gs[i],1:k]
}
se <- sqrt(diag(solve(crossprod(X)) %*% (t(uclust) %*% uclust) %*% solve(crossprod(X)))*dfc
> se
(Intercept)         age    agefbrth     usemeth
0.42485889  0.03150865  0.03542962  0.09435531


For calculating robust standard errors in R, both with more goodies and in (probably) a more efficient way, look at the sandwich package. The same applies to clustering and this paper. However, here is a simple function called ols which carries out all of the calculations discussed in the above.

ols <- function(form, data, robust=FALSE, cluster=NULL,digits=3){
r1 <- lm(form, data)
if(length(cluster)!=0){
data <- na.omit(data[,c(colnames(r1\$model),cluster)])
r1 <- lm(form, data)
}
X <- model.matrix(r1)
n <- dim(X)[1]
k <- dim(X)[2]
if(robust==FALSE & length(cluster)==0){
se <- sqrt(diag(solve(crossprod(X)) * as.numeric(crossprod(resid(r1))/(n-k))))
res <- cbind(coef(r1),se)
}
if(robust==TRUE){
u <- matrix(resid(r1))
meat1 <- t(X) %*% diag(diag(crossprod(t(u)))) %*% X
dfc <- n/(n-k)
se <- sqrt(dfc*diag(solve(crossprod(X)) %*% meat1 %*% solve(crossprod(X))))
res <- cbind(coef(r1),se)
}
if(length(cluster)!=0){
clus <- cbind(X,data[,cluster],resid(r1))
colnames(clus)[(dim(clus)[2]-1):dim(clus)[2]] <- c(cluster,"resid")
m <- dim(table(clus[,cluster]))
dfc <- (m/(m-1))*((n-1)/(n-k))
uclust  <- apply(resid(r1)*X,2, function(x) tapply(x, clus[,cluster], sum))
se <- sqrt(diag(solve(crossprod(X)) %*% (t(uclust) %*% uclust) %*% solve(crossprod(X)))*dfc)
res <- cbind(coef(r1),se)
}
res <- cbind(res,res[,1]/res[,2],(1-pnorm(abs(res[,1]/res[,2])))*2)
res1 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res)),nrow=dim(res)[1])
rownames(res1) <- rownames(res)
colnames(res1) <- c("Estimate","Std. Error","t value","Pr(>|t|)")
return(res1)
}

# with data as before
> ols(ceb ~ age + agefbrth + usemeth,children)
Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.358      0.174   7.815    0.000
age            0.224      0.003  64.888    0.000
agefbrth      -0.261      0.009 -29.637    2.000
usemeth        0.187      0.055   3.380    0.001
> ols(ceb ~ age + agefbrth + usemeth,children,robust=T)
Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.358      0.168   8.105    0.000
age            0.224      0.005  47.993    0.000
agefbrth      -0.261      0.010 -27.261    2.000
usemeth        0.187      0.061   3.090    0.002
> ols(ceb ~ age + agefbrth + usemeth,children,cluster="children")
Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.358      0.425   3.197    0.001
age            0.224      0.032   7.101    0.000
agefbrth      -0.261      0.035  -7.357    2.000
usemeth        0.187      0.094   1.986    0.047