Endogenous Spatial Lags for the Linear Regression Model

Over the past number of years, I have noted that spatial econometric methods have been gaining popularity. This is a welcome trend in my opinion, as the spatial structure of data is something that should be explicitly included in the empirical modelling procedure. Omitting spatial effects assumes that the location co-ordinates for observations are unrelated to the observable characteristics that the researcher is trying to model. Not a good assumption, particularly in empirical macroeconomics where the unit of observation is typically countries or regions.

Starting out with the prototypical linear regression model: y = X \beta + \epsilon, we can modify this equation in a number of ways to account for the spatial structure of the data. In this blog post, I will concentrate on the spatial lag model. I intend to examine spatial error models in a future blog post.

The spatial lag model is of the form: y= \rho W y + X \beta + \epsilon, where the term \rho W y measures the potential spill-over effect that occurs in the outcome variable if this outcome is influenced by other unit’s outcomes, where the location or distance to other observations is a factor in for this spill-over. In other words, the neighbours for each observation have greater (or in some cases less) influence to what happens to that observation, independent of the other explanatory variables (X). The W matrix is a matrix of spatial weights, and the \rho parameter measures the degree of spatial correlation. The value of \rho is bounded between -1 and 1. When \rho is zero, the spatial lag model collapses to the prototypical linear regression model.

The spatial weights matrix should be specified by the researcher. For example, let us have a dataset that consists of 3 observations, spatially located on a 1-dimensional Euclidean space wherein the first observation is a neighbour of the second and the second is a neighbour of the third. The spatial weights matrix for this dataset should be a 3 \times 3 matrix, where the diagonal consists of 3 zeros (you are not a neighbour with yourself). Typically, this matrix will also be symmetric. It is also at the user’s discretion to choose the weights in W. Common schemes include nearest k neighbours (where k is again at the users discretion), inverse-distance, and other schemes based on spatial contiguities. Row-standardization is usually performed, such that all the row elements in W sum to one. In our simple example, the first row of a contiguity-based scheme would be: [0, 1, 0]. The second: [0.5, 0, 0.5]. And the third: [0, 1, 0].

While the spatial-lag model represents a modified version of the basic linear regression model, estimation via OLS is problematic because the spatially lagged variable (Wy) is endogenous. The endogeneity results from what Charles Manski calls the ‘reflection problem’: your neighbours influence you, but you also influence your neighbours. This feedback effect results in simultaneity which renders bias on the OLS estimate of the spatial lag term. A further problem presents itself when the independent variables (X) are themselves spatially correlated. In this case, completely omitting the spatial lag from the model specification will bias the \beta coefficient values due to omitted variable bias.

Fortunately, remedying these biases is relatively simple, as a number of estimators exist that will yield an unbiased estimate of the spatial lag, and consequently the coefficients for the other explanatory variables—assuming, of course, that these explanatory variables are themselves exogenous. Here, I will consider two: the Maximum Likelihood estimator (denoted ML) as described in Ord (1975), and a generalized two-stage least squares regression model (2SLS) wherein spatial lags, and spatial lags lags (i.e. W^{2} X) of the explanatory variables are used as instruments for Wy. Alongside these two models, I also estimate the misspecified OLS both without (OLS1) and with (OLS2) the spatially lagged dependent variable.

To examine the properties of these four estimators, I run a Monte Carlo experiment. First, let us assume that we have 225 observations equally spread over a 15 \times 15 lattice grid. Second, we assume that neighbours are defined by what is known as the Rook contiguity, so a neighbour exists if they are in the grid space either above or below or on either side. Once we create the spatial weight matrix we row-standardize.

Taking our spatial weights matrix as defined, we want to simulate the following linear model: y = \rho Wy + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon, where we set \rho=0.4 , \beta_{1}=0.5, \beta_{2}=-0.5, \beta_{3}=1.75. The explanatory variables are themselves spatially autocorrelated, so our simulation procedure first simulates a random normal variable for both x_{2} and x_{3} from: N(0, 1), then assuming a autocorrelation parameter of \rho_{x}=0.25, generates both variables such that: x_{j} = (1-\rho_{x}W)^{-1} N(0, 1) for j \in \{ 1,2 \}. In the next step we simulate the error term \epsilon. We introduce endogeneity into the spatial lag by assuming that the error term \epsilon is a function of a random normal v so \epsilon = \alpha v + N(0, 1) where v = N(0, 1) and \alpha=0.2, and that the spatial lag term includes v. We modify the regression model to incorporate this: y = \rho (Wy + v) + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon. From this we can calculate the reduced form model: y = (1 - \rho W)^{-1} (\rho v + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon), and simulate values for our dependent variable y.

Performing 1,000 repetitions of the above simulation permits us to examine the distributions of the coefficient estimates produced by the four models outlined in the above. The distributions of these coefficients are displayed in the graphic in the beginning of this post. The spatial autocorrelation parameter \rho is in the bottom-right quadrant. As we can see, the OLS model that includes the spatial effect but does not account for simultaneity (OLS2) over-estimates the importance of spatial spill-overs. Both the ML and 2SLS estimators correctly identify the \rho parameter. The remaining quadrants highlight what happens to the coefficients of the explanatory variables. Clearly, the OLS1 estimator provides the worst estimate of these coefficients. Thus, it appears preferable to use OLS2, with the biased autocorrelation parameter, than the simpler OLS1 estimator. However, the OLS2 estimator also yields biased parameter estimates for the \beta coefficients. Furthermore, since researchers may want to know the marginal effects in spatial equilibrium (i.e. taking into account the spatial spill-over effects) the overestimated \rho parameter creates an additional bias.

To perform these calculations I used the spdep package in R, with the graphic created via ggplot2. Please see the R code I used in the below.

library(spdep) ; library(ggplot2) ; library(reshape)

n = 225
data = data.frame(n1=1:n)
# coords
data$lat = rep(1:sqrt(n), sqrt(n))
data$long = sort(rep(1:sqrt(n), sqrt(n)))
# create W matrix
wt1 = as.matrix(dist(cbind(data$long, data$lat), method = "euclidean", upper=TRUE))
wt1 = ifelse(wt1==1, 1, 0)
diag(wt1) = 0
# row standardize
rs = rowSums(wt1)
wt1 = apply(wt1, 2, function(x) x/rs)
lw1 = mat2listw(wt1, style="W")

rx = 0.25
rho = 0.4
b1 = 0.5
b2 = -0.5
b3 = 1.75
alp = 0.2
inv1 = invIrW(lw1, rho=rx, method="solve", feasible=NULL)
inv2 = invIrW(lw1, rho=rho, method="solve", feasible=NULL)

sims = 1000
beta1results = matrix(NA, ncol=4, nrow=sims)
beta2results = matrix(NA, ncol=4, nrow=sims)
beta3results = matrix(NA, ncol=4, nrow=sims)
rhoresults = matrix(NA, ncol=3, nrow=sims)

for(i in 1:sims){
  u1 = rnorm(n)
  x2 = inv1 %*% u1
  u2 = rnorm(n)
  x3 = inv1 %*% u2
  v1 = rnorm(n)
  e1 = alp*v1 + rnorm(n)  
  data1 = data.frame(cbind(x2, x3),lag.listw(lw1, cbind(x2, x3)))
  names(data1) = c("x2","x3","wx2","wx3")
  data1$y1 = inv2 %*% (b1 + b2*x2 + b3*x3 + rho*v1 + e1)
  data1$wy1 = lag.listw(lw1, data1$y1)
  data1$w2x2 = lag.listw(lw1, data1$wx2)
  data1$w2x3 = lag.listw(lw1, data1$wx3)
  data1$w3x2 = lag.listw(lw1, data1$w2x2)
  data1$w3x3 = lag.listw(lw1, data1$w2x3)

  m1 = coef(lm(y1 ~ x2 + x3, data1))
  m2 = coef(lm(y1 ~ wy1 + x2 + x3, data1))
  m3 = coef(lagsarlm(y1 ~ x2 + x3, data1, lw1))
  m4 = coef(stsls(y1 ~ x2 + x3, data1, lw1))
  beta1results[i,] = c(m1[1], m2[1], m3[2], m4[2])
  beta2results[i,] = c(m1[2], m2[3], m3[3], m4[3])
  beta3results[i,] = c(m1[3], m2[4], m3[4], m4[4])
  rhoresults[i,] = c(m2[2],m3[1], m4[1])  

apply(rhoresults, 2, mean) ; apply(rhoresults, 2, sd)
apply(beta1results, 2, mean) ; apply(beta1results, 2, sd)
apply(beta2results, 2, mean) ; apply(beta2results, 2, sd)
apply(beta3results, 2, mean) ; apply(beta3results, 2, sd)

colnames(rhoresults) = c("OLS2","ML","2SLS")
colnames(beta1results) = c("OLS1","OLS2","ML","2SLS")
colnames(beta2results) = c("OLS1","OLS2","ML","2SLS")
colnames(beta3results) = c("OLS1","OLS2","ML","2SLS")

rhoresults = melt(rhoresults)
rhoresults$coef = "rho"
rhoresults$true = 0.4

beta1results = melt(beta1results)
beta1results$coef = "beta1"
beta1results$true = 0.5

beta2results = melt(beta2results)
beta2results$coef = "beta2"
beta2results$true = -0.5

beta3results = melt(beta3results)
beta3results$coef = "beta3"
beta3results$true = 1.75

data = rbind(rhoresults,beta1results,beta2results,beta3results)
data$Estimator = data$X2

ggplot(data, aes(x=value, colour=Estimator, fill=Estimator)) + 
  geom_density(alpha=.3) + 
  facet_wrap(~ coef, scales= "free") +
  geom_vline(aes(xintercept=true)) + 
  scale_y_continuous("Density") +
  scale_x_continuous("Effect Size") +
  opts(legend.position = 'bottom', legend.direction = 'horizontal')

Identical Champions League Draw: What Were the Odds?

cldraw1A number of news outlets have reported a peculiar quirk that arose during Friday’s Champions League draw. Apparently, the sport’s European governing body, UEFA, ran a trial run the day before the main event, and the schedule chosen during this event was identical to that of the actual draw on Friday.

Given this strange coincidence, a number of people have been expressing the various odds of this occurrence. For example, the author of this newspaper article claimed that ‘bookies’ calculated the odds at 5,000 to 1. In other words, the probability of this event was 0.0002.

The same article also says that the probability of this event (two random draws being identical) occurring is not as low as one might think. However, this article does not give the probability or odds of this event occurring. The oblivious reason for this is that such a calculation is difficult. Since teams from the same domestic league and teams from the same country cannot play each other, such a calculation involves using conditional probabilities over a variety of scenarios.

Despite my training in Mathematics and interest in quantitative pursuits, I have always struggled to calculate the probability of multiple conditional events. Given that there are many different ways in which two identical draws can be made, such a calculation is, unfortunately, beyond my admittedly limited ability.

Thankfully, there’s a cheats way to getting a rough answer: use Monte Carlo simulation. The code below shows how to write up a function in R that performs synthetic draws for the Champions League given the aforementioned conditions. With this function, I performed two draws 200,000 times, and calculated that the probability of the identical draws is: 0.00011, so the odds are around: 1 in 9,090. This probability is subject to some sampling error, however getting a more accurate measure via simulation would require more computing power like that enabled by Rcpp (which I really need to learn). Nevertheless, the answer is clearly lower than that proposed either by the ‘bookies’ or the newspaper article’s author.

# cl draw
dat <- read.csv("cldraw.csv")

> dat
team iso pos group
1       Galatasaray TUR  RU     H
2           Schalke GER  WI     B
3            Celtic SCO  RU     G
4          Juventus ITA  WI     E
5           Arsenal ENG  RU     B
6            Bayern GER  WI     F
7  Shakhtar Donetsk UKR  RU     E
8          Dortmund GER  WI     D
9             Milan ITA  RU     C
10        Barcelona ESP  WI     G
11      Real Madrid ESP  RU     D
12      Man. United ENG  WI     H
13         Valencia ESP  RU     F
14              PSG FRA  WI     A
15            Porto POR  RU     A
16           Malaga ESP  WI     C

draw <- function(x){
  fixtures <- matrix(NA,nrow=8,ncol=2)
  p <- 0
    for(j in 1:8){
      k <- 0
      n <- 0
        n <- n + 1
        aa <- x[x[,"pos"]=="RU",]
        t1 <- aa[sample(1:dim(aa)[1],1),]
        bb <- x[x[,"pos"]=="WI",]
        t2 <- bb[sample(1:dim(bb)[1],1),]
        k <- ifelse(t1[,"iso"]!=t2[,"iso"] & t1[,"group"]!=t2[,"group"],1,0)
      fixtures[j,1] <- as.character(t1[,"team"])
      fixtures[j,2] <- as.character(t2[,"team"])
      x <- x[!(x[,"team"] %in% c(as.character(t1[,"team"]),
    if(n>50){p <- 0}    
    p <- ifelse(sum(as.numeric(is.na(fixtures)))==0,1,0)

drawtwo <- function(x){ 
  f1 <- as.vector(unlist(x))
  joinup <- data.frame(team=f1[1:16], iso=f1[17:32],
                       pos=f1[33:48], group=f1[49:64])
  check1 <- data.frame(draw(joinup))
  check2 <- data.frame(draw(joinup))
  rightdraw <- ifelse(sum(na.omit(check1[order(check1),2])==
    na.omit(check2[order(check2),2]))==8, 1, 0)


dat2 <- rbind(as.vector(unlist(dat)),
dat3 <- dat2[rep(1,1000),]

vals <- 0
for(i in 1:200){
  yy <- apply(dat3, 1, drawtwo)
  vals <- sum(yy) + vals
# Probability 
> vals/200000
[1] 0.00011
# Odds
> 1/(vals/200000)-1
[1] 9089.909

Instrumental Variables without Traditional Instruments

Typically, regression models in empirical economic research suffer from at least one form of endogeneity bias.

The classic example is economic returns to schooling, where researchers want to know how much increased levels of education affect income. Estimation using a simple linear model, regressing income on schooling, alongside a bunch of control variables, will typically not yield education’s true effect on income. The problem here is one of omitted variables – notably unobserved ability. People who are more educated may be more motivated or have other unobserved characteristics which simultaneously affect schooling and future lifetime earnings.

Endogeneity bias plagues empirical research. However, there are solutions, the most common being instrumental variables (IVs). Unfortunately, the exclusion restrictions needed to justify the use of traditional IV methodology may be impossible to find.

So, what if you have an interesting research question, some data, but endogeneity with no IVs. You should give up, right? Wrong. According to Lewbel (forthcoming in Journal of Business and Economic Statistics), it is possible to overcome the endogeneity problem without the use of a traditional IV approach.

Lewbel’s paper demonstrates how higher order moment restrictions can be used to tackle endogeneity in triangular systems. Without going into too much detail (interested readers can consult Lewbel’s paper), this method is like the traditional two-stage instrumental variable approach, except the first-stage exclusion restriction is generated by the control, or exogenous, variables which we know are heteroskedastic (interested practitioners can test for this in the usual way, i.e. a White test).

In the code below, I demonstrate how one could employ this approach in R using the GMM framework outlined by Lewbel. My code only relates to a simple example with one endogenous variable and two exogenous variables. However, it would be easy to modify this code depending on the model.

# gmm function for 1 endog variable with 2 hetero exogenous variable
# outcome in the first column of 'datmat', endog variable in second
# constant and exog variables in the next three
# hetero exog in the last two (i.e no constant)
g1 <- function(theta, datmat) {
  #set up data
  y1 <- matrix(datmat[,1],ncol=1)
  y2 <- matrix(datmat[,2],ncol=1)
  x1 <- matrix(datmat[,3:5],ncol=3)
  z1 <- matrix(datmat[,4:5],ncol=2)
  # if the variable in the 4th col was not hetero
  # this could be modified so:
  # z1 <- matrix(datmat[,5],ncol=1)

  #set up moment conditions
  in1 <- (y1 -theta[1]*x1[,1]-theta[2]*x1[,2]-theta[3]*x1[,3])
  M <- NULL
  for(i in 1:dim(z1)[2]){
    M <- cbind(M,(z1[,i]-mean(z1[,i])))
  in2 <- (y2 -theta[4]*x1[,1]-theta[5]*x1[,2]-theta[6]*x1[,3]-theta[7]*y1)
  for(i in 1:dim(x1)[2]){M <- cbind(M,in1*x1[,i])}
  for(i in 1:dim(x1)[2]){M <- cbind(M,in2*x1[,i])}
  for(i in 1:dim(z1)[2]){M <- cbind(M,in2*((z1[,i]-mean(z1[,i]))*in1))}
# so estimation is easy
# gmm(function(...), data matrix, initial values vector)
# e.g : gmm(g1, x =as.matrix(dat),c(1,1,1,1,1,1,1))

I also tested the performance of Lewbel’s GMM estimator in comparison a mis-specified OLS estimator. In the code below, I perform 500 simulations of a triangular system containing an omitted variable. For the GMM estimator, it is useful to have good initial starting values. In this simple example, I use the OLS coefficients. In more complicated settings, it is advisable to use the estimates from the 2SLS procedure outlined in Lewbel’s paper. The distributions of the coefficient estimates are shown in the plot below. The true value, indicated by the vertical line, is one. It is pretty evident that the Lewbel approach works very well. I think this method could be very useful in a number of research disciplines.

beta1 <- beta2 <- NULL
for(k in 1:500){
  #generate data (including intercept)
  x1 <- rnorm(1000,0,1)
  x2 <- rnorm(1000,0,1)
  u <- rnorm(1000,0,1)
  s1 <- rnorm(1000,0,1)
  s2 <- rnorm(1000,0,1)
  ov <- rnorm(1000,0,1)
  e1 <- u + exp(x1)*s1 + exp(x2)*s1
  e2 <- u + exp(-x1)*s2 + exp(-x2)*s2
  y1 <- 1 + x1 + x2 + ov + e2 
  y2 <- 1 + x1 + x2 + y1 + 2*ov + e1
  x3 <- rep(1,1000)
  dat <- cbind(y1,y2,x3,x1,x2)
  #record ols estimate
  beta1 <- c(beta1,coef(lm(y2~x1+x2+y1))[4])
  #init values for iv-gmm
  init <- c(coef(lm(y2~x1+x2+y1)),coef(lm(y1~x1+x2)))
  #record gmm estimate
  beta2 <- c(beta2,coef(gmm(g1, x =as.matrix(dat),init))[7])

d <- data.frame(rbind(cbind(beta1,"OLS"),cbind(beta2,"IV-GMM")))
d$beta1 <- as.numeric(as.character(d$beta1))
sm.density.compare(d$beta1, d$V2,xlab=("Endogenous Coefficient"))
 title("Lewbel and OLS Estimates")
 legend("topright", levels(d$V2),lty=c(1,2,3),col=c(2,3,4),bty="n")

Monty Hall Meets Monte Carlo

In 1990, Parade magazine columnist Marilyn vos Savant caused a giant kerfuffle over an answer she gave to a question posed by a reader. The problem, as stated by the omnipotent Wikipedia, goes like this:

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1 [but the door is not opened], and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

Vos Savant’s answer was you should always switch. The probability of winning the star prize (a brand new Rover Vitesse Fastback) is 2/3 if you switch, and only 1/3 if you stick. However, this answer outraged many people, and 10,000 furious souls dabbed their quills in some ink and wrote strongly worded, and possibly impolite, letters to Vos Savant refuting her analysis. It was Vos Savant who would have the last laugh though, as her answer was indeed correct (LOL).

That people would get this puzzle wrong is understandable. I know I did the first time I was faced with it. The reason is deeply rooted in behavioural psychology. Humans are not good at dealing with conditional probabilities. If only Vos Savant’s irate readers knew how to use Bayes theorem, they could have saved themselves the ink and the emotional turmoil by deriving the analytic solution to the problem (although, on reflection, doing this would have also led to some ink usage).

What if you’re too lazy or unsure how to derive an analytic solution to the problem? Is there a way for you, you feckless badger? Why yes Kent, there is – use simulation. In the below, I have included some simple R code which simulates the problem. As you can see you get close enough to the Vos Savant’s correct answer.

The code below is an example of how it is possible to use simulation techniques to derive solutions when you do not want to, or are too stupid to do the mathematics. The beauty of computer simulation is that it can be used in much more complex settings where analytic solutions are not feasible. Extracting results via computer simulation is crucial in Bayesian statistical analysis, and also in frequentist calculations of model uncertainty like bootstrapping.

# clean
# outcome matrix
cars <- matrix(NA,nrow=10000,ncol=2)
for(i in 1:10000){
   # initial choices
   choices <- 1:3
   cardoor <- sample(choices,1)
   goatdoor <- choices[-cardoor]
   contest <- sample(choices,1)
   choice2 <- c(contest,NA)
   choice2[2] <- ifelse(contest==cardoor,sample(goatdoor,1),cardoor)
   # the guy who sticks
   cars[i,1] <- ifelse(contest==cardoor,1,0)
   # the guy who changes
   cars[i,2] <- ifelse(choice2[choice2!=contest]==cardoor,1,0)