Speeding up the Cluster Bootstrap in R

Back in January 2013 I wrote a blog post showing how to implement a basic cluster/block bootstrap in R. One drawback of the cluster bootstap is the length of time it takes to sample with replacement and create the data samples. Thankfully some of the comments on my previous post illustrated simple ways to get speed gains. However, even with these gains this procedure is extremely time consuming.

I have been using the cluster bootstrap in some of my research and have found another way to speed things up—use parallel processing power. I appreciate I might be somewhat late to the multicore functions, but hopefully somebody has been having a similar issue as me can take solace from this post.

In the code below I demonstrate how the function “clusterApply” from the package “snow” can be used as a replacement for the regular “apply” function. Note the cluster in clusterApply refers to the mulitcore clusters rather than the clusters in the data frame. My code sets up a simple regression problem, wherein the standard error of the the regressor is 0.4. To demonstrate the clustering phenomenon I duplicate the data frame of  10,000 observations 20 times. As a result of this the standard error falls to 0.09 based on the naive estimate of the variance-covariance matrix.

The clustering problem can easily be corrected using the “felm” function from (what I consider the best R package) “lfe”. However, there are many occasions where researchers might want to use econometric techniques that do not lend themselves to a simple variance-covariance correction like the OLS or 2SLS estimators. These are the situations where you wan to use the cluster bootstrap.

The code below demonstrates how this can be done with and without using parallel processing. The only difference is that the parallel processing requires the user to set the number of clusters (again not clusters in the data!) and use clusterApply instead of apply. In this application, using parallel processing reduces the cluster bootstrap time down from 5 mins 42 seconds to 4 mins 6 seconds. This might seem reasonably trivial however in this simple application I am using a relatively small number of observations (10,000). The parallel processing method will get relatively quicker the larger the number of observations. Also, you can increase this by having a computer with more cores.

I appreciate any comments or criticism people might have on the code below. If anybody can think of a way that would help me to speed this up even more I would be delighted to hear it.

# cluster bootstrap with paralell processing

# packages for cluster standard errors
# use multicore functions

# set up simulation
n <- 10000 # number of observations
x <- rnorm(n)
y <- 5 + 2*x + rnorm(n, 0, 40)
# regression
m1 <- lm(y ~ x)
# standard error is 0.4

# duplicate data
dpt <- 20 # dpt times
dat <- data.frame(x = rep(x, dpt) , y = rep(y, dpt), g = rep(1:n, dpt))

# regressions with no clustering
m2 <- lm(y ~ x, data = dat) # smaller StErrs
# standard errors are like m1 = 0.09

# now cluster
summary(felm(y ~ x | 0 | 0 | g, data = dat))
# standard errors are like m1 = 0.4

# lets do this with a regular cluster bootstap
reps <- 50 # 50 reps in practice do more
clusters <- unique(dat$g)
boot.res1 <- matrix(NA, nrow = reps, ncol = 1)

# open time stamp
t1 <- Sys.time()
# set the seed 
# do in loop
for(i in 1:reps){
  # sample the clusters with replacement
  units <- sample(clusters, size = length(clusters), replace=T)
  # create bootstap sample with sapply
  df.bs <- sapply(units, function(x) which(dat[,"g"]==x))
  df.bs <- dat[unlist(df.bs),]
  boot.res1[i] <- coef(lm(y ~ x, data = df.bs))[2]
# close time stamp
t2 <- Sys.time()
t2 - t1 

sd(boot.res1) # good bootstrap standard errors are = 0.4

# now lets speed up the sapply function from the previous example

boot.res2 <- matrix(NA, nrow = reps, ncol = 1)

# set the seed 

cl <- makeCluster(10)
# open time stamp
t3 <- Sys.time()
# do in loop
for(i in 1:reps){
  # sample the clusters with replacement
  units <- sample(clusters, size = length(clusters), replace = T)
  # now use the 10 cores instead of 1!
  clusterExport(cl, c("dat", "units"))
  # cluster apply instead of regular apply
  df.bs = clusterApply(cl, units, function(x) which(dat$g == x))
  df.bs <- dat[unlist(df.bs),]
  boot.res2[i] <- coef(lm(y ~ x, data = df.bs))[2]
# close time stamp
t4 <- Sys.time()
t4 - t3 


sd(boot.res2) # good bootstrap standard errors are = 0.4


2 thoughts on “Speeding up the Cluster Bootstrap in R

  1. Pingback: Distilled News | Data Analytics & R

  2. As far as speeding up your code, I suspect that the for(i in 1:reps) loop could be replaced with parLapply or similar (I use the parallel package rather than snow, though they’re obviously similar) and the code rearranged a little to use the list it outputs.

    The multiwayvcov package (which, full disclosure, I maintain) provides a cluster.boot function that automates a lot of this and includes support for parallel execution, either through the boot package or by supplying your own cluster object. It has some limitations, but I think it’s a reasonable starting point. I’m also under the impression that the felm function from lfe (I totally agree that lfe is the most useful CRAN package in R, at least for econometrics) has bootstrapping as an option.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s