Endogenous Spatial Lags for the Linear Regression Model

Over the past number of years, I have noted that spatial econometric methods have been gaining popularity. This is a welcome trend in my opinion, as the spatial structure of data is something that should be explicitly included in the empirical modelling procedure. Omitting spatial effects assumes that the location co-ordinates for observations are unrelated to the observable characteristics that the researcher is trying to model. Not a good assumption, particularly in empirical macroeconomics where the unit of observation is typically countries or regions.

Starting out with the prototypical linear regression model: y = X \beta + \epsilon, we can modify this equation in a number of ways to account for the spatial structure of the data. In this blog post, I will concentrate on the spatial lag model. I intend to examine spatial error models in a future blog post.

The spatial lag model is of the form: y= \rho W y + X \beta + \epsilon, where the term \rho W y measures the potential spill-over effect that occurs in the outcome variable if this outcome is influenced by other unit’s outcomes, where the location or distance to other observations is a factor in for this spill-over. In other words, the neighbours for each observation have greater (or in some cases less) influence to what happens to that observation, independent of the other explanatory variables (X). The W matrix is a matrix of spatial weights, and the \rho parameter measures the degree of spatial correlation. The value of \rho is bounded between -1 and 1. When \rho is zero, the spatial lag model collapses to the prototypical linear regression model.

The spatial weights matrix should be specified by the researcher. For example, let us have a dataset that consists of 3 observations, spatially located on a 1-dimensional Euclidean space wherein the first observation is a neighbour of the second and the second is a neighbour of the third. The spatial weights matrix for this dataset should be a 3 \times 3 matrix, where the diagonal consists of 3 zeros (you are not a neighbour with yourself). Typically, this matrix will also be symmetric. It is also at the user’s discretion to choose the weights in W. Common schemes include nearest k neighbours (where k is again at the users discretion), inverse-distance, and other schemes based on spatial contiguities. Row-standardization is usually performed, such that all the row elements in W sum to one. In our simple example, the first row of a contiguity-based scheme would be: [0, 1, 0]. The second: [0.5, 0, 0.5]. And the third: [0, 1, 0].

While the spatial-lag model represents a modified version of the basic linear regression model, estimation via OLS is problematic because the spatially lagged variable (Wy) is endogenous. The endogeneity results from what Charles Manski calls the ‘reflection problem’: your neighbours influence you, but you also influence your neighbours. This feedback effect results in simultaneity which renders bias on the OLS estimate of the spatial lag term. A further problem presents itself when the independent variables (X) are themselves spatially correlated. In this case, completely omitting the spatial lag from the model specification will bias the \beta coefficient values due to omitted variable bias.

Fortunately, remedying these biases is relatively simple, as a number of estimators exist that will yield an unbiased estimate of the spatial lag, and consequently the coefficients for the other explanatory variables—assuming, of course, that these explanatory variables are themselves exogenous. Here, I will consider two: the Maximum Likelihood estimator (denoted ML) as described in Ord (1975), and a generalized two-stage least squares regression model (2SLS) wherein spatial lags, and spatial lags lags (i.e. W^{2} X) of the explanatory variables are used as instruments for Wy. Alongside these two models, I also estimate the misspecified OLS both without (OLS1) and with (OLS2) the spatially lagged dependent variable.

To examine the properties of these four estimators, I run a Monte Carlo experiment. First, let us assume that we have 225 observations equally spread over a 15 \times 15 lattice grid. Second, we assume that neighbours are defined by what is known as the Rook contiguity, so a neighbour exists if they are in the grid space either above or below or on either side. Once we create the spatial weight matrix we row-standardize.

Taking our spatial weights matrix as defined, we want to simulate the following linear model: y = \rho Wy + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon, where we set \rho=0.4 , \beta_{1}=0.5, \beta_{2}=-0.5, \beta_{3}=1.75. The explanatory variables are themselves spatially autocorrelated, so our simulation procedure first simulates a random normal variable for both x_{2} and x_{3} from: N(0, 1), then assuming a autocorrelation parameter of \rho_{x}=0.25, generates both variables such that: x_{j} = (1-\rho_{x}W)^{-1} N(0, 1) for j \in \{ 1,2 \}. In the next step we simulate the error term \epsilon. We introduce endogeneity into the spatial lag by assuming that the error term \epsilon is a function of a random normal v so \epsilon = \alpha v + N(0, 1) where v = N(0, 1) and \alpha=0.2, and that the spatial lag term includes v. We modify the regression model to incorporate this: y = \rho (Wy + v) + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon. From this we can calculate the reduced form model: y = (1 - \rho W)^{-1} (\rho v + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon), and simulate values for our dependent variable y.

Performing 1,000 repetitions of the above simulation permits us to examine the distributions of the coefficient estimates produced by the four models outlined in the above. The distributions of these coefficients are displayed in the graphic in the beginning of this post. The spatial autocorrelation parameter \rho is in the bottom-right quadrant. As we can see, the OLS model that includes the spatial effect but does not account for simultaneity (OLS2) over-estimates the importance of spatial spill-overs. Both the ML and 2SLS estimators correctly identify the \rho parameter. The remaining quadrants highlight what happens to the coefficients of the explanatory variables. Clearly, the OLS1 estimator provides the worst estimate of these coefficients. Thus, it appears preferable to use OLS2, with the biased autocorrelation parameter, than the simpler OLS1 estimator. However, the OLS2 estimator also yields biased parameter estimates for the \beta coefficients. Furthermore, since researchers may want to know the marginal effects in spatial equilibrium (i.e. taking into account the spatial spill-over effects) the overestimated \rho parameter creates an additional bias.

To perform these calculations I used the spdep package in R, with the graphic created via ggplot2. Please see the R code I used in the below.

library(spdep) ; library(ggplot2) ; library(reshape)

n = 225
data = data.frame(n1=1:n)
# coords
data$lat = rep(1:sqrt(n), sqrt(n))
data$long = sort(rep(1:sqrt(n), sqrt(n)))
# create W matrix
wt1 = as.matrix(dist(cbind(data$long, data$lat), method = "euclidean", upper=TRUE))
wt1 = ifelse(wt1==1, 1, 0)
diag(wt1) = 0
# row standardize
rs = rowSums(wt1)
wt1 = apply(wt1, 2, function(x) x/rs)
lw1 = mat2listw(wt1, style="W")

rx = 0.25
rho = 0.4
b1 = 0.5
b2 = -0.5
b3 = 1.75
alp = 0.2
inv1 = invIrW(lw1, rho=rx, method="solve", feasible=NULL)
inv2 = invIrW(lw1, rho=rho, method="solve", feasible=NULL)

sims = 1000
beta1results = matrix(NA, ncol=4, nrow=sims)
beta2results = matrix(NA, ncol=4, nrow=sims)
beta3results = matrix(NA, ncol=4, nrow=sims)
rhoresults = matrix(NA, ncol=3, nrow=sims)

for(i in 1:sims){
  u1 = rnorm(n)
  x2 = inv1 %*% u1
  u2 = rnorm(n)
  x3 = inv1 %*% u2
  v1 = rnorm(n)
  e1 = alp*v1 + rnorm(n)  
  data1 = data.frame(cbind(x2, x3),lag.listw(lw1, cbind(x2, x3)))
  names(data1) = c("x2","x3","wx2","wx3")
  data1$y1 = inv2 %*% (b1 + b2*x2 + b3*x3 + rho*v1 + e1)
  data1$wy1 = lag.listw(lw1, data1$y1)
  data1$w2x2 = lag.listw(lw1, data1$wx2)
  data1$w2x3 = lag.listw(lw1, data1$wx3)
  data1$w3x2 = lag.listw(lw1, data1$w2x2)
  data1$w3x3 = lag.listw(lw1, data1$w2x3)

  m1 = coef(lm(y1 ~ x2 + x3, data1))
  m2 = coef(lm(y1 ~ wy1 + x2 + x3, data1))
  m3 = coef(lagsarlm(y1 ~ x2 + x3, data1, lw1))
  m4 = coef(stsls(y1 ~ x2 + x3, data1, lw1))
  beta1results[i,] = c(m1[1], m2[1], m3[2], m4[2])
  beta2results[i,] = c(m1[2], m2[3], m3[3], m4[3])
  beta3results[i,] = c(m1[3], m2[4], m3[4], m4[4])
  rhoresults[i,] = c(m2[2],m3[1], m4[1])  

apply(rhoresults, 2, mean) ; apply(rhoresults, 2, sd)
apply(beta1results, 2, mean) ; apply(beta1results, 2, sd)
apply(beta2results, 2, mean) ; apply(beta2results, 2, sd)
apply(beta3results, 2, mean) ; apply(beta3results, 2, sd)

colnames(rhoresults) = c("OLS2","ML","2SLS")
colnames(beta1results) = c("OLS1","OLS2","ML","2SLS")
colnames(beta2results) = c("OLS1","OLS2","ML","2SLS")
colnames(beta3results) = c("OLS1","OLS2","ML","2SLS")

rhoresults = melt(rhoresults)
rhoresults$coef = "rho"
rhoresults$true = 0.4

beta1results = melt(beta1results)
beta1results$coef = "beta1"
beta1results$true = 0.5

beta2results = melt(beta2results)
beta2results$coef = "beta2"
beta2results$true = -0.5

beta3results = melt(beta3results)
beta3results$coef = "beta3"
beta3results$true = 1.75

data = rbind(rhoresults,beta1results,beta2results,beta3results)
data$Estimator = data$X2

ggplot(data, aes(x=value, colour=Estimator, fill=Estimator)) + 
  geom_density(alpha=.3) + 
  facet_wrap(~ coef, scales= "free") +
  geom_vline(aes(xintercept=true)) + 
  scale_y_continuous("Density") +
  scale_x_continuous("Effect Size") +
  opts(legend.position = 'bottom', legend.direction = 'horizontal')

How Much Should Bale Cost Real?

It looks increasingly likely that Gareth Bale will transfer from Tottenham to Real Madrid for a world record transfer fee. Negotiations are ongoing, with both parties keen to get the best deal possible deal with the transfer fee. Reports speculate that this transfer fee will be anywhere in the very wide range of £80m to £120m.

Given the topical nature of this transfer saga, I decided to explore the world record breaking transfer fee data, and see if these data can help predict what the Gareth Bale transfer fee should be. According to this Wikipedia article, there have been 41 record breaking transfers, from Willie Groves going from West Brom to Aston Villa in 1893 for £100, to Cristiano Ronaldo’s £80m 2009 transfer to Real Madrid from Manchester United.

When comparing any historical price data it is very important that we are comparing like with like. Clearly, a fee of £100 in 1893 is not the same as £100 in 2009. Therefore, the world record transfer fees need to be adjusted for inflation. To do this, I used the excellent measuringworth website, and converted all of the transfer fees into 2011 pounds sterling.


The plot above demonstrates a very strong linear relationship between logged real world record transfer fees and time. The R-squared indicates that the year of the transfer fee explains roughly 97% of the variation in price.

So, if Real Madrid are to pay a world transfer fee for Bale, how much does this model predict the fee will be? The above plot demonstrates what happens when the simple log-linear model is extrapolated to predict the world record transfer fee in 2013. The outcome here is 18.37, so around £96m, in 2011 prices. We can update this value to 2013 prices. Assuming a modest inflation rate of 2% we get £96m[exp(0.02*2)]=£99.4m. No small potatoes.


bale = read.csv("bale.csv")
# data from:
# http://en.wikipedia.org/wiki/World_football_transfer_record
# http://www.measuringworth.com/ukcompare/

ols1 = lm(log(real2011)~year, bale)

# price
# inflate lets say 2% inflation

# nice ggplot
bale$lnprice2011 = log(bale$real2011)
addon = data.frame(year=2013,nominal=0,real2011=0,name="Bale?",

ggplot(bale, aes(x=year, y=lnprice2011, label=name)) + 
  geom_text(hjust=0.4, vjust=0.4) +
  stat_smooth(method = "lm",fullrange = TRUE, level = 0.975) +
  theme_bw(base_size = 12, base_family = "") +
  xlim(1885, 2020) + ylim(8, 20) +
  xlab("Year") + ylab("ln(Price)") +
  ggtitle("World Transfer Records, Real 2011 Prices (£)")+
  geom_point(aes(col="red"),size=4,data=addon) + 
  geom_text(aes(col="red", fontface=3),hjust=-0.1, vjust=0,size=7,data=addon) + 

The Frisch–Waugh–Lovell Theorem for Both OLS and 2SLS

The Frisch–Waugh–Lovell (FWL) theorem is of great practical importance for econometrics. FWL establishes that it is possible to re-specify a linear regression model in terms of orthogonal complements. In other words, it permits econometricians to partial out right-hand-side, or control, variables. This is useful in a variety of settings. For example, there may be cases where a researcher would like to obtain the effect and cluster-robust standard error from a model that includes many regressors, and therefore a computationally infeasible variance-covariance matrix.

Here are a number of practical examples. The first just takes a simple linear regression model, with two regressors: x1 and x2. To partial out the coefficients on the constant term and x2, we first regress x2 on y1 and save the residuals. We then regress x2 on x1 and save the residuals. The final stage regresses the second residuals on the first. The following code illustrates how one can obtain an identical coefficient on x1 by applying the FWL theorem.

x1 = rnorm(100)
x2 = rnorm(100)
y1 = 1 + x1 - x2 + rnorm(100)

r1 = residuals(lm(y1 ~ x2))
r2 = residuals(lm(x1 ~ x2))
# ols
coef(lm(y1 ~ x1 + x2))
# fwl ols
coef(lm(r1 ~ -1 + r2))

FWL is also relevant for all linear instrumental variable (IV) estimators. Here, I will show how this extends to the 2SLS estimator, where slightly more work is required compared to the OLS example in the above. Here we have a matrix of instruments (Z), exogenous variables (X), and endogenous variables (Y1). Let us imagine we want the coefficient on one endogenous variable y1. In this case we can apply FWL as follows. Regress X on each IV in Z in separate regressions, saving the residuals. Then regress X on y1, and X on y2, saving the residuals for both. In the last stage, perform a two-stage-least-squares regression of the X on y2 residuals on the X on y2 residuals using the residuals from X on each Z as instruments. An example of this is shown in the below code.


ov = rnorm(100)
z1 = rnorm(100) 
z2 = rnorm(100)
y1 = rnorm(100) + z1 + z2 + 1.5*ov
x1 = rnorm(100) + 0.5*z1 - z2
x2 = rnorm(100) 
y2 = 1 + y1 - x1 + 0.3*x2 + ov + rnorm(100)

r1 = residuals(lm(z1 ~ x1 + x2))
r2 = residuals(lm(z2 ~ x1 + x2))
r3 = residuals(lm(y1 ~ x1 + x2))
r4 = residuals(lm(y2 ~ x2 + x2))

# biased coef on y1 as expected for ols
# 2sls
# fwl 2sls

The FWL can also be extended to cases where there are multiple endogenous variables. I have demonstrated this case by extending the above example to model x1 as an endogenous variable.

# 2 endogenous variables
r5 = residuals(lm(z1 ~ x2))
r6 = residuals(lm(z2 ~ x2))
r7 = residuals(lm(y1 ~ x2))
r8 = residuals(lm(x1 ~ x2))
r9 = residuals(lm(y2 ~ x2))

# 2sls coefficients
p1 = fitted.values(lm(y1~z1+z2+x2))
p2 = fitted.values(lm(x1~z1+z2+x2))
lm(y2 ~ p1 + p2 + x2)

# 2sls fwl coefficients
p3 = fitted.values(lm(r7~-1+r5+r6))
p4 = fitted.values(lm(r8~-1+r5+r6))
lm(r9 ~ p3 + p4)

Kalkalash! Pinpointing the Moments “The Simpsons” became less Cromulent

Whenever somebody mentions “The Simpsons” it always stirs up feelings of nostalgia in me. The characters, uproarious gags, zingy one-liners, and edgy animation all contributed towards making, arguably, the greatest TV ever. However, it’s easy to forget that as a TV show “The Simpsons” is still ongoing—in its twenty-fourth season no less.

For me, and most others, the latter episodes bear little resemblance to older ones. The current incarnation of the show is stale, and has been for a long time. I haven’t watched a new episode in over ten years, and don’t intend to any time soon. When did this decline begin? Was it part of a slow secular trend, or was there a sudden drop in the quality, from which there was no recovery?

To answer these questions I use the Global Episode Opinion Survey (GEOS) episode ratings data, which are published online. A simple web scrape of the “all episodes” page provides me with 423 episode ratings, spanning from the first episode of season 1, to the third episode of season 20. After S20E03, the ratings become too sparse, which is probably a function of how bad the show, in its current condition, is. To detect changepoints in show ratings, I have used the R package changepoint. An informative introduction of both the package and changepoint analyses can be found in this accompanying vignette.


The figure above provides a summary of my results. Five breakpoints were detected. The first occurring in the first episode of the ninth season: The City of New York Vs. Homer Simpson. Most will remember this; Homer goes to New York to collect his clamped car and ends up going berserk. Good episode, although this essentially marked the beginning of the end.

According to the changepoint results, the decline occurred in three stages. The first lasted from the New York episode up until episode 11 in season 10. The shows in this stage have an average rating of about 7, and the episode where the breakpoint is detected is: Wild Barts Can’t Be Broken. The next stage roughly marks my transition, as it is about this point that I stopped watching. This stage lasts as far as S15E09, whereupon the show suffers the further ignominy of another ratings downgrade. Things possibly couldn’t get any worse, and they don’t, as the show earns a minor reprieve after the twentieth episode of season 18.

So now you know. R code for the analysis can be found in the below.

# packages
library(Hmisc) ; library(changepoint)
# clear ws

# webscrape data
page1 = "http://www.geos.tv/index.php/list?sid=159&collection=all"
home1 = readLines(con<-url(page1)); close(con)

# pick out lines with ratings
means = '<td width="60px" align="right" nowrap>'
epis = home1[grep(means, home1)]
epis = epis[57:531]
epis = epis[49:474]

# prune data
loc = function(x) substring.location(x,"</span>")$first[1]
epis = data.frame(epis)
epis = cbind(epis,apply(epis, 1, loc))
epis$cut = NA
for(i in 1:dim(epis)[1]){
  epis[i,3] = substr(epis[i,1], epis[i,2]-4, epis[i,2]-1) 
#create data frame
ts1 = data.frame(rate=epis$cut, episode=50:475)
# remove out of season shows and movie
ts1 = ts1[!(ts1$episode %in% c(178,450,451)),]
# make numeric
ts1$rate = as.numeric(as.character(ts1$rate))

# changepoint function
mean2.pelt = cpt.mean(ts1$rate,method='PELT')

# plot results
     ylab='Average Rating',cpt.width=2, main="Changepoints in 'The Simpsons' Ratings")

# what episodes ?
# The City of New York vs. Homer Simpson
# Wild Barts Can't Be Broken
# I, (Annoyed Grunt)-Bot - 
# Stop Or My Dog Will Shoot!

texreg: A package for beautiful and easily customizable LaTeX regression tables from R

There was a very informative post last week showing how the R package stargazer is used to generate nice LaTeX tables from a number of R objects. This package looks very useful. However, I would like to extol the virtues of another R package that converts model objects in R into LaTeX code: texreg.

For me, the texreg package has one very important advantage compared to stargazer, which is (and please correct me if I am wrong on this point) that the regression model’s output is very easy to customize. There are a number of examples where this feature of texreg is important, such as the inclusion of robust/cluster-robust standard errors, or changing the coefficients of a generalized linear model to read as marginal effects.

In the example below, I have replicated (with some modifications) the code used for my last blog post showing the importance of clustering standard errors. Basically this code simulates a bunch of data, estimates a regression model, then duplicates the dataset three times, which in turn reduces the standard errors of the regression model. The last step shows how the clustered standard errors will not be reduced by duplicating the dataset. Once I perform this analysis, I use the output code to generate a texreg object that gives the code needed to create the regression table in LaTeX.

In this example, I have used the texreg override commands in the texreg function so that the standard errors and p-values (needed to calculate the stars in the regression table) reflect the clustering performed with the robust.se() function. The output from this, once the code is run through a LaTeX compiler (I use TeXworks), is shown in the image below (I have also adjusted the caption so that it is on the top of the table rather than the bottom). One can download the latest version of texreg from the packages R-forge project page. This page also features a very helpful and active forum.


library(lmtest) ; library(sandwich)

# data sim

x <- rnorm(1000)
y <- 5 + 2*x + rnorm(1000,0,40)
# regression
m1 <- lm(y~x)

# triple data
dat <- data.frame(x=c(x,x,x),y=c(y,y,y),g=c(1:1000,1:1000,1:1000))
# regressions
m2 <- lm(y~x, dat) # smaller StErrs

# cluster robust standard error function
robust.se <- function(model, cluster){
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- model$rank
  dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
  uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
  rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N)
  rcse.se <- coeftest(model, rcse.cov)
  return(list(rcse.cov, rcse.se))

m3 <- robust.se(m2,dat$g)[[2]] # StErrs now back to what they are

       caption="The Importance of Clustering Standard Errors",

The Cluster Bootstrap

Adjusting standard errors for clustering can be a very important part of any statistical analysis. For example, duplicating a data set will reduce the standard errors dramatically despite there being no new information.

I have previously dealt with this topic with reference to the linear regression model. However, in many cases one would like to obtain cluster-robust standard errors for more elaborate statistical analyses than a simple linear regression model. Typically, increases in model complexity are met with similar increases in the calculations required to estimate the model’s uncertainty via analytical means. Adding the required adjustment for group clustering furthers this often burdensome process.

Bootstrapping procedures offer a simple solution to this issue. Essentially, a nonparametric bootstrap procedure estimates a model for a specified number of repetitions using samples of the data frame. The simplest bootstrap draws a random sample from the data frame such that the sample data frame is the same dimension as the main data frame, although the observations are not the same because it allows for random sampling with replacement. For each repetition the main analysis is repeated on the sample data, and the estimate stored (so for a linear regression model this would be the model’s coefficients). Once all repetitions have been computed the standard errors can be calculated by taking the standard deviation of the stored model estimates.

Adapting this technique to tackle clustering is relatively straightforward. Instead of drawing the observation units with replacement, one draws the cluster units with replacement. For example, imagine we have a data frame consisting of 1,000 students from 100 schools. The non-cluster bootstrap draws 1,000 observations from the 1,000 students with replacement for each repetition. The cluster bootstrap will instead draw 100 schools with replacement.

In the below, I show how to formulate a simple cluster bootstrap procedure for a linear regression in R. In this analysis, I simulate some data and then falsely replicate the data frame three times which causes the standard errors to drop. I then show how a simple cluster-robust command solves this issue (i.e. gives standard errors like the original data frame that had not been replicated). Finally, I show how these cluster-robust standard errors can be obtained via bootstrapping. The code is anything but succinct (and therefore probably very inefficient), so I would be happy to take suggestions on how one could improve my clusbootreg() function. Nevertheless, it is clear from the output that this function is giving the right answers.

# cluster-boot
library(lmtest) ; library(sandwich)
# data sim
x <- rnorm(1000)
y <- 5 + 2*x + rnorm(1000)
# regression

# triple data
dat <- data.frame(x=c(x,x,x),y=c(y,y,y),g=c(1:1000,1:1000,1:1000))
# regressions
m1 <- lm(y~x, dat) # smaller StErrs

# cluster robust standard error function
robust.se <- function(model, cluster){
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- model$rank
  dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
  uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
  rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N)
  rcse.se <- coeftest(model, rcse.cov)
  return(list(rcse.cov, rcse.se))

robust.se(m1,dat$g)[[2]] # StErrs now back to what they are

# cluster bootstrap function
clusbootreg <- function(formula, data, cluster, reps=1000){
  reg1 <- lm(formula, data)
  clusters <- names(table(cluster))
  sterrs <- matrix(NA, nrow=reps, ncol=length(coef(reg1)))
  for(i in 1:reps){
    index <- sample(1:length(clusters), length(clusters), replace=TRUE)
    aa <- clusters[index]
    bb <- table(aa)
    bootdat <- NULL
    for(j in 1:max(bb)){
      cc <- data[cluster %in% names(bb[bb %in% j]),]
      for(k in 1:j){
        bootdat <- rbind(bootdat, cc)
    sterrs[i,] <- coef(lm(formula, bootdat))
  val <- cbind(coef(reg1),apply(sterrs,2,sd))
  colnames(val) <- c("Estimate","Std. Error")
  > summary(lm(y~x)) #correct StErrs

  lm(formula = y ~ x)

  Min       1Q   Median       3Q      Max 
-3.10461 -0.63043  0.02924  0.70622  2.78926 

  Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.96345    0.03084  160.97   <2e-16 ***
x            2.02747    0.03048   66.51   <2e-16 ***
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.9748 on 998 degrees of freedom
Multiple R-squared: 0.8159,  Adjusted R-squared: 0.8157 
F-statistic:  4424 on 1 and 998 DF,  p-value: < 2.2e-16 

> summary(lm(y~x, dat)) #smaller StErrs

  lm(formula = y ~ x, data = dat)

  Min       1Q   Median       3Q      Max 
-3.10461 -0.63043  0.02924  0.70622  2.78926 

  Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.96345    0.01779   279.0   <2e-16 ***
x            2.02747    0.01759   115.3   <2e-16 ***
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 0.9742 on 2998 degrees of freedom
Multiple R-squared: 0.8159,	Adjusted R-squared: 0.8159 
F-statistic: 1.329e+04 on 1 and 2998 DF,  p-value: < 2.2e-16 

> robust.se(m1,dat$g)[[2]] #correct StErrs

t test of coefficients:
  Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 4.963447   0.030810 161.099 < 2.2e-16 ***
x           2.027469   0.029347  69.086 < 2.2e-16 ***
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> clusbootreg(formula=y~x, data=dat, cluster=dat$g, reps=1000) #correct StErrs
Estimate Std. Error
(Intercept) 4.963447 0.03197737
x           2.027469 0.02920071

Identical Champions League Draw: What Were the Odds?

cldraw1A number of news outlets have reported a peculiar quirk that arose during Friday’s Champions League draw. Apparently, the sport’s European governing body, UEFA, ran a trial run the day before the main event, and the schedule chosen during this event was identical to that of the actual draw on Friday.

Given this strange coincidence, a number of people have been expressing the various odds of this occurrence. For example, the author of this newspaper article claimed that ‘bookies’ calculated the odds at 5,000 to 1. In other words, the probability of this event was 0.0002.

The same article also says that the probability of this event (two random draws being identical) occurring is not as low as one might think. However, this article does not give the probability or odds of this event occurring. The oblivious reason for this is that such a calculation is difficult. Since teams from the same domestic league and teams from the same country cannot play each other, such a calculation involves using conditional probabilities over a variety of scenarios.

Despite my training in Mathematics and interest in quantitative pursuits, I have always struggled to calculate the probability of multiple conditional events. Given that there are many different ways in which two identical draws can be made, such a calculation is, unfortunately, beyond my admittedly limited ability.

Thankfully, there’s a cheats way to getting a rough answer: use Monte Carlo simulation. The code below shows how to write up a function in R that performs synthetic draws for the Champions League given the aforementioned conditions. With this function, I performed two draws 200,000 times, and calculated that the probability of the identical draws is: 0.00011, so the odds are around: 1 in 9,090. This probability is subject to some sampling error, however getting a more accurate measure via simulation would require more computing power like that enabled by Rcpp (which I really need to learn). Nevertheless, the answer is clearly lower than that proposed either by the ‘bookies’ or the newspaper article’s author.

# cl draw
dat <- read.csv("cldraw.csv")

> dat
team iso pos group
1       Galatasaray TUR  RU     H
2           Schalke GER  WI     B
3            Celtic SCO  RU     G
4          Juventus ITA  WI     E
5           Arsenal ENG  RU     B
6            Bayern GER  WI     F
7  Shakhtar Donetsk UKR  RU     E
8          Dortmund GER  WI     D
9             Milan ITA  RU     C
10        Barcelona ESP  WI     G
11      Real Madrid ESP  RU     D
12      Man. United ENG  WI     H
13         Valencia ESP  RU     F
14              PSG FRA  WI     A
15            Porto POR  RU     A
16           Malaga ESP  WI     C

draw <- function(x){
  fixtures <- matrix(NA,nrow=8,ncol=2)
  p <- 0
    for(j in 1:8){
      k <- 0
      n <- 0
        n <- n + 1
        aa <- x[x[,"pos"]=="RU",]
        t1 <- aa[sample(1:dim(aa)[1],1),]
        bb <- x[x[,"pos"]=="WI",]
        t2 <- bb[sample(1:dim(bb)[1],1),]
        k <- ifelse(t1[,"iso"]!=t2[,"iso"] & t1[,"group"]!=t2[,"group"],1,0)
      fixtures[j,1] <- as.character(t1[,"team"])
      fixtures[j,2] <- as.character(t2[,"team"])
      x <- x[!(x[,"team"] %in% c(as.character(t1[,"team"]),
    if(n>50){p <- 0}    
    p <- ifelse(sum(as.numeric(is.na(fixtures)))==0,1,0)

drawtwo <- function(x){ 
  f1 <- as.vector(unlist(x))
  joinup <- data.frame(team=f1[1:16], iso=f1[17:32],
                       pos=f1[33:48], group=f1[49:64])
  check1 <- data.frame(draw(joinup))
  check2 <- data.frame(draw(joinup))
  rightdraw <- ifelse(sum(na.omit(check1[order(check1),2])==
    na.omit(check2[order(check2),2]))==8, 1, 0)


dat2 <- rbind(as.vector(unlist(dat)),
dat3 <- dat2[rep(1,1000),]

vals <- 0
for(i in 1:200){
  yy <- apply(dat3, 1, drawtwo)
  vals <- sum(yy) + vals
# Probability 
> vals/200000
[1] 0.00011
# Odds
> 1/(vals/200000)-1
[1] 9089.909

Linear Models with Multiple Fixed Effects

Estimating a least squares linear regression model with fixed effects is a common task in applied econometrics, especially with panel data. For example, one might have a panel of countries and want to control for fixed country factors. In this case the researcher will effectively include this fixed identifier as a factor variable, and then proceed to estimate the model that includes as many dummy variables (minus one if an intercept is included in the modelling equation) as there are countries. Obviously, this approach is computationally problematic when there are many fixed factors. In our simple example, an extra country will add an extra column to the X matrix used in the least squares calculation.

Fortunately, there are a number of data transformations that can be used in this panel setting. These include demeaning each within unit observation, using first differences, or including the group means as additional explanatory variables (as suggested by (Mundlak 1978)). However, these approaches only work well when there is one factor that the researcher wants to include fixed effects to account for.

Simen Gaure offers a solution this problem that allows for multiple fixed effects without resorting to a computationally burdensome methodology. Essentially the solution involves an elaboration of the group demeaning transformation mentioned in the above. More technical details can be found here or by referring to Gaure’s forthcoming article in Computational Statistics & Data Analysis. Those interested in implementing this estimation strategy in R can use the lfe package available on CRAN.

In the below, I have included a simple example of how the package works. In this example, the model needs to be set up to calculate fixed effects for two factor variables. Obviously, adding 2,000 columns to the data frame is not a convenient way to estimate the model that includes fixed effects for both the x2 and x3 variables. However, the felm function tackles this problem with ease. Stata has a similar function to feml, areg, although the areg function only allows for absorbed fixed effects in one variable.

# clear workspace
# load lfe package

# create data frame
x1 <- rnorm(10000)
x2 <- rep(1:1000,10)
x3 <- rep(1:1000,10)
e1 <- sin(x2) + 0.02*x3^2 + rnorm(10000)
y <- 10 + 2.5*x1 + (e1-mean(e1))
dat <- data.frame(x1,x2,x3,y)

# simple lm
# lm with fixed effects
felm(dat$y ~ dat$x1 + G(dat$x2) + G(dat$x3))

# output
# simple lm
> lm(y~x1)
Call: lm(formula = y ~ x1)
  (Intercept)           x1  
        10.47       -36.95  
> # lm with fixed effects
> felm(dat$y ~ dat$x1 + G(dat$x2) + G(dat$x3))

Minimizing Bias in Observational Studies

Measuring the effect of a binary treatment on a measured outcome is one of the most common tasks in applied statistics. Examples of these applications abound, like the effect of smoking on health, or the effect of low birth weight on cognitive development. In an ideal world we would like to be able to assign one group of people to receive some form of treatment, and an identical group to not receive this treatment. In this world, the average treatment effect (ATE) is the difference in outcomes between the treatment and control groups.

However, the real world is far from ideal, and the problematic nature of measuring causal effects has (justifiably) spawned a wide literature. Many solutions have been proposed to this problem. The simplest one involves controlling for the pre-treatment characteristics defining both the treatment and control groups. For example, in the case of low birth weight, a researcher may want to adjust for variation in the parent’s socioeconomic status, as one would expect that the ‘treatment’ of low birth weight not to be randomly assigned amongst different socioeconomic strata. Once controls for various confounding variables are introduced, it is then feasible to measure the causal effect of the treatment. Methods that perform this adjustment include simple linear regression modelling, or various forms of matching estimators.

The key assumption in the above is that the researcher is able to separate out differences between the treatment and control groups based on observable characteristics (selection on observables). However, there are many cases, especially in the social sciences, where it is not unreasonable to suspect this assumption does not hold (selection based on the unobservable). In this scenario, the use of instrumental variable estimators represents a viable solution.

Unfortunately, suitable instrumental variables are commonly not available to researchers. In this instance is there anything a researcher can do? Yes, according to a recently published paper by Daniel L. Millimet and Rusty Tchernis. Millimet and Tchernis examine this problem from the point of view of a researcher trying to minimize the bias associated with selection on unobservables. Their paper demonstrates how the bias of ATE can be derived from a regression model of the probability of treatment on observable characteristics. Using this regression model, it is possible to find a bias minimizing propensity score (P*). Once this score is calculated, the researcher is able to estimate a bias reduced ATE by trimming observations outside a pre-specified neighborhood around P*.

Millimet and Tchernis propose a number of estimators that one can use to produce bias minimized ATEs. Those interested in potentially using this bias minimization strategy should refer to their paper for a more detailed examination of these estimators, and their potential uses/misuses. This paper also features a neat empirical application that suggests why their approach might be better than the more conventional methods.

In the below, I have supplied an image summarizing the output produced from a Monte Carlo exercise to highlight the efficacy of the bias corrected approach. Those interested in design of this experiment should look at the Millimet and Tchernis paper, since I have simply replicated their MC design (250 datasets with 5,000 observations in each), with the assumption that the correlation between the treatment equation error term and outcome equation is −0.6 (so a really strong correlation). Additionally, I have set the trimming parameter (theta) to 0.05, so at least 5% of the treatment and 5% of the control group are contained in the trimmed sample.

Note that full descriptions of the acronyms in this image can be found in the paper, although MB.BC and IPW.BC refer to the bias corrected measures of the ATE, and from the image it is clear to see that these estimators are much closer to the true average treatment effect of 1, albeit with a higher variance. This simulation was conducted with R using a self-written function. I have benchmarked this function against Daniel Millimet’s STATA function, and the results are identical. I hope to possibly release this function as an R package in the future, although I would be happy to supply my function’s code to anyone who is interested.

Heteroskedastic GLM in R

A commenter on my previous blog entry has drawn my attention to an R function called hetglm() that estimates heteroskedastic probit models. This function is contained in the glmx package. The glmx package is not available on CRAN yet, but thankfully can be downloaded here.

The hetglm() function has a number of computational advantages compared with the crude method outlined in my previous post. The following example replicates the previous analysis showing the speed advantage associated with using the hetglm() function.

rm(list=ls()) # clear ws
n <- 1000 # no. obs
x1 <- runif(n,-1,1) # predictor 1
x2 <- runif(n,-1,1) # " 2
e1 <- rnorm(n,0,1) # normal error
e2 <- (1 + 0.45*(x1+x2))*e1 # hetero error
y <- ifelse(0.5 + 0.5*x1 -0.5*x2 - e2 >0, 1, 0) # outcome
# estimate normal probit
r1 <- glm(y~x1+x2,family=binomial(link="probit")) 
system.time(ml1 <- maxLik(hll,start=c(0,0,0,0,0))) # maximize
system.time(h1 <- hetglm(y ~ x1 + x2))
# output
> system.time(ml1 <- maxLik(hll,start=c(0,0,0,0,0))) # maximize
   user  system elapsed 
   4.43    0.00    4.59 
> system.time(h1 <- hetglm(y ~ x1 + x2))
   user  system elapsed 
   0.11    0.00    0.11