Endogenous Spatial Lags for the Linear Regression Model

Over the past number of years, I have noted that spatial econometric methods have been gaining popularity. This is a welcome trend in my opinion, as the spatial structure of data is something that should be explicitly included in the empirical modelling procedure. Omitting spatial effects assumes that the location co-ordinates for observations are unrelated to the observable characteristics that the researcher is trying to model. Not a good assumption, particularly in empirical macroeconomics where the unit of observation is typically countries or regions.

Starting out with the prototypical linear regression model: y = X \beta + \epsilon, we can modify this equation in a number of ways to account for the spatial structure of the data. In this blog post, I will concentrate on the spatial lag model. I intend to examine spatial error models in a future blog post.

The spatial lag model is of the form: y= \rho W y + X \beta + \epsilon, where the term \rho W y measures the potential spill-over effect that occurs in the outcome variable if this outcome is influenced by other unit’s outcomes, where the location or distance to other observations is a factor in for this spill-over. In other words, the neighbours for each observation have greater (or in some cases less) influence to what happens to that observation, independent of the other explanatory variables (X). The W matrix is a matrix of spatial weights, and the \rho parameter measures the degree of spatial correlation. The value of \rho is bounded between -1 and 1. When \rho is zero, the spatial lag model collapses to the prototypical linear regression model.

The spatial weights matrix should be specified by the researcher. For example, let us have a dataset that consists of 3 observations, spatially located on a 1-dimensional Euclidean space wherein the first observation is a neighbour of the second and the second is a neighbour of the third. The spatial weights matrix for this dataset should be a 3 \times 3 matrix, where the diagonal consists of 3 zeros (you are not a neighbour with yourself). Typically, this matrix will also be symmetric. It is also at the user’s discretion to choose the weights in W. Common schemes include nearest k neighbours (where k is again at the users discretion), inverse-distance, and other schemes based on spatial contiguities. Row-standardization is usually performed, such that all the row elements in W sum to one. In our simple example, the first row of a contiguity-based scheme would be: [0, 1, 0]. The second: [0.5, 0, 0.5]. And the third: [0, 1, 0].

While the spatial-lag model represents a modified version of the basic linear regression model, estimation via OLS is problematic because the spatially lagged variable (Wy) is endogenous. The endogeneity results from what Charles Manski calls the ‘reflection problem’: your neighbours influence you, but you also influence your neighbours. This feedback effect results in simultaneity which renders bias on the OLS estimate of the spatial lag term. A further problem presents itself when the independent variables (X) are themselves spatially correlated. In this case, completely omitting the spatial lag from the model specification will bias the \beta coefficient values due to omitted variable bias.

Fortunately, remedying these biases is relatively simple, as a number of estimators exist that will yield an unbiased estimate of the spatial lag, and consequently the coefficients for the other explanatory variables—assuming, of course, that these explanatory variables are themselves exogenous. Here, I will consider two: the Maximum Likelihood estimator (denoted ML) as described in Ord (1975), and a generalized two-stage least squares regression model (2SLS) wherein spatial lags, and spatial lags lags (i.e. W^{2} X) of the explanatory variables are used as instruments for Wy. Alongside these two models, I also estimate the misspecified OLS both without (OLS1) and with (OLS2) the spatially lagged dependent variable.

To examine the properties of these four estimators, I run a Monte Carlo experiment. First, let us assume that we have 225 observations equally spread over a 15 \times 15 lattice grid. Second, we assume that neighbours are defined by what is known as the Rook contiguity, so a neighbour exists if they are in the grid space either above or below or on either side. Once we create the spatial weight matrix we row-standardize.

Taking our spatial weights matrix as defined, we want to simulate the following linear model: y = \rho Wy + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon, where we set \rho=0.4 , \beta_{1}=0.5, \beta_{2}=-0.5, \beta_{3}=1.75. The explanatory variables are themselves spatially autocorrelated, so our simulation procedure first simulates a random normal variable for both x_{2} and x_{3} from: N(0, 1), then assuming a autocorrelation parameter of \rho_{x}=0.25, generates both variables such that: x_{j} = (1-\rho_{x}W)^{-1} N(0, 1) for j \in \{ 1,2 \}. In the next step we simulate the error term \epsilon. We introduce endogeneity into the spatial lag by assuming that the error term \epsilon is a function of a random normal v so \epsilon = \alpha v + N(0, 1) where v = N(0, 1) and \alpha=0.2, and that the spatial lag term includes v. We modify the regression model to incorporate this: y = \rho (Wy + v) + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon. From this we can calculate the reduced form model: y = (1 - \rho W)^{-1} (\rho v + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon), and simulate values for our dependent variable y.

Performing 1,000 repetitions of the above simulation permits us to examine the distributions of the coefficient estimates produced by the four models outlined in the above. The distributions of these coefficients are displayed in the graphic in the beginning of this post. The spatial autocorrelation parameter \rho is in the bottom-right quadrant. As we can see, the OLS model that includes the spatial effect but does not account for simultaneity (OLS2) over-estimates the importance of spatial spill-overs. Both the ML and 2SLS estimators correctly identify the \rho parameter. The remaining quadrants highlight what happens to the coefficients of the explanatory variables. Clearly, the OLS1 estimator provides the worst estimate of these coefficients. Thus, it appears preferable to use OLS2, with the biased autocorrelation parameter, than the simpler OLS1 estimator. However, the OLS2 estimator also yields biased parameter estimates for the \beta coefficients. Furthermore, since researchers may want to know the marginal effects in spatial equilibrium (i.e. taking into account the spatial spill-over effects) the overestimated \rho parameter creates an additional bias.

To perform these calculations I used the spdep package in R, with the graphic created via ggplot2. Please see the R code I used in the below.

library(spdep) ; library(ggplot2) ; library(reshape)

n = 225
data = data.frame(n1=1:n)
# coords
data$lat = rep(1:sqrt(n), sqrt(n))
data$long = sort(rep(1:sqrt(n), sqrt(n)))
# create W matrix
wt1 = as.matrix(dist(cbind(data$long, data$lat), method = "euclidean", upper=TRUE))
wt1 = ifelse(wt1==1, 1, 0)
diag(wt1) = 0
# row standardize
rs = rowSums(wt1)
wt1 = apply(wt1, 2, function(x) x/rs)
lw1 = mat2listw(wt1, style="W")

rx = 0.25
rho = 0.4
b1 = 0.5
b2 = -0.5
b3 = 1.75
alp = 0.2
inv1 = invIrW(lw1, rho=rx, method="solve", feasible=NULL)
inv2 = invIrW(lw1, rho=rho, method="solve", feasible=NULL)

sims = 1000
beta1results = matrix(NA, ncol=4, nrow=sims)
beta2results = matrix(NA, ncol=4, nrow=sims)
beta3results = matrix(NA, ncol=4, nrow=sims)
rhoresults = matrix(NA, ncol=3, nrow=sims)

for(i in 1:sims){
  u1 = rnorm(n)
  x2 = inv1 %*% u1
  u2 = rnorm(n)
  x3 = inv1 %*% u2
  v1 = rnorm(n)
  e1 = alp*v1 + rnorm(n)  
  data1 = data.frame(cbind(x2, x3),lag.listw(lw1, cbind(x2, x3)))
  names(data1) = c("x2","x3","wx2","wx3")
  data1$y1 = inv2 %*% (b1 + b2*x2 + b3*x3 + rho*v1 + e1)
  data1$wy1 = lag.listw(lw1, data1$y1)
  data1$w2x2 = lag.listw(lw1, data1$wx2)
  data1$w2x3 = lag.listw(lw1, data1$wx3)
  data1$w3x2 = lag.listw(lw1, data1$w2x2)
  data1$w3x3 = lag.listw(lw1, data1$w2x3)

  m1 = coef(lm(y1 ~ x2 + x3, data1))
  m2 = coef(lm(y1 ~ wy1 + x2 + x3, data1))
  m3 = coef(lagsarlm(y1 ~ x2 + x3, data1, lw1))
  m4 = coef(stsls(y1 ~ x2 + x3, data1, lw1))
  beta1results[i,] = c(m1[1], m2[1], m3[2], m4[2])
  beta2results[i,] = c(m1[2], m2[3], m3[3], m4[3])
  beta3results[i,] = c(m1[3], m2[4], m3[4], m4[4])
  rhoresults[i,] = c(m2[2],m3[1], m4[1])  

apply(rhoresults, 2, mean) ; apply(rhoresults, 2, sd)
apply(beta1results, 2, mean) ; apply(beta1results, 2, sd)
apply(beta2results, 2, mean) ; apply(beta2results, 2, sd)
apply(beta3results, 2, mean) ; apply(beta3results, 2, sd)

colnames(rhoresults) = c("OLS2","ML","2SLS")
colnames(beta1results) = c("OLS1","OLS2","ML","2SLS")
colnames(beta2results) = c("OLS1","OLS2","ML","2SLS")
colnames(beta3results) = c("OLS1","OLS2","ML","2SLS")

rhoresults = melt(rhoresults)
rhoresults$coef = "rho"
rhoresults$true = 0.4

beta1results = melt(beta1results)
beta1results$coef = "beta1"
beta1results$true = 0.5

beta2results = melt(beta2results)
beta2results$coef = "beta2"
beta2results$true = -0.5

beta3results = melt(beta3results)
beta3results$coef = "beta3"
beta3results$true = 1.75

data = rbind(rhoresults,beta1results,beta2results,beta3results)
data$Estimator = data$X2

ggplot(data, aes(x=value, colour=Estimator, fill=Estimator)) + 
  geom_density(alpha=.3) + 
  facet_wrap(~ coef, scales= "free") +
  geom_vline(aes(xintercept=true)) + 
  scale_y_continuous("Density") +
  scale_x_continuous("Effect Size") +
  opts(legend.position = 'bottom', legend.direction = 'horizontal')

Simple Spatial Correlograms for Cross-Country Analysis in R

Accounting for temporal dependence in econometric analysis is important, as the presence of temporal dependence violates the assumption that observations are independent units. Historically, much less attention has been paid to correcting for spatial dependence, which, if present, also violates this independence assumption.

The comparability of temporal and spatial dependence is useful for illustrating why spatial dependence may matter. For example, a time-series analysis of a county’s GDP would examine how related this year’s figure was to preceding years, whereas a spatial analysis would examine how related one country’s figure was to that of their neighbours.

However, these temporal and spatial issues are not completely analogous. This difference is highlighted with the creation of lagged variables. In time-series, the lag structure is intuitively easy to comprehend. A variable y_{t} has lags of y_{t-1}, …, y_{t-k}.

Lagging a spatial variable introduces more ambiguity. A spatially lagged variable of y_{i} can be calculated as \rho(y_{i}) where \rho is a spatial weighting matrix. Interested readers should consult further sources on this, but needless to say there are a number of ways (nearest neighbour, distance etc.) in which we could define \rho.

The difference between spatial and temporal processes in the creation of lagged variables is somewhat compounded when we consider that it is not possible to create ‘lags of lags’ in the same manner as in time-series (so \rho^2 is not the valid). Your neighbour’s neighbour might be your neighbour.

Given the difficulty in creating spatial lags of spatial lags, is there an easy way of recreating a simple time-series correlogram for a spatial variable? Yes, indeed it is. The solution I present here is not an exclusive way to analyze a spatially dependent variable.

For this analysis I use different countries spatial coordinates and create a correlogram of population density in 1900 (these data are taken from this paper). The first step involves loading these data into the R workspace such that the data frame contains the latitude and longitude coordinates alongside the potentially spatially autocorrelated variable.

> pop <- read.csv("dat2.csv")
> head(pop)
        lon       lat              country        dens
1  66.02554  33.83319          Afghanistan  79.2543748
2  17.55091 -12.29862               Angola  24.2375278
3  20.07006  41.14267              Albania 282.7854408
4  54.33089  23.91317 United Arab Emirates   5.0438742
5  44.93819  40.29368              Armenia   0.2136146
6 134.48679 -25.73251            Australia   4.8896695

With the data frame, I calculated a distance matrix with the ‘rdist.earth’ function from the package ‘fields’. This function measures the distance between each set of data points based on their coordinates. This function recognizes that the world is not flat, and as such calculates what are known as great-circle distances.

# calculate great cirlce distances with fields package
# popdists matrix
popdists <- as.matrix(rdist.earth(cbind(pop$lon, pop$lat), miles = F, R = NULL))
diag(popdists) <- 0

With this distance based matrix it is possible to calculate a correlogram. This is performed through the function ‘autocorr’. There are three arguments. The first is w is the distance-based weights matrix. The second, x is the variable of interest (note: the order of x and w should be the same as in the original data frame), and the third are the distance bands.

# autocorr function
autocorr <- function(w,x,dist=500){
  aa <- ceiling(max(w)/dist)
  dists <- seq(0,aa*dist,dist)
  cors <- NULL
  for(i in 1:aa){
    w1 <- ifelse(w > dists[i] & w <= dists[i+1], 1, 0) 
    w2 <- w1
    for(j in 1:dim(w1)[1]){
      nu <- sum(w1[j,])
        w2[j,] <- w1[j,]/nu
    lag <- w2 %*% x
    cors <- c(cors,cor(x,lag))

The function works in the following manner. For each step there is a distance band. Here, I have specified that the distance bands are 500km. For the first-step, the function calculates the correlation of the variable of interest (x) and its spatially lagged neighbours – who are defined as having a centre-point within a 500km radius. The next step calculates this correlation for neighbours falling into the 500km to 1,000 km distance band, so there will be a new spatial weighting matrix \rho. This is an iterative procedure, and continues up until the bands exceed the furthest neighbours in the data frame.

# at 500km
ac1 <- autocorr(w=popdists,x=pop$dens,dist=500)
# 1000 km
ac2 <- autocorr(w=popdists,x=pop$dens,dist=1000)

# MC analysis
it <- 1000
mc <- matrix(NA,nrow=it,ncol=length(ac1))
for(i in 1:it){
  pop$rand <- sample(pop$dens,length(pop$dens),replace=F)
  mc[i,] <- autocorr(w=popdists,x=pop$rand,dist=500)

ac1 <- data.frame(cbind(ac1,seq(500,20000,500)))
ac1 <- cbind(ac1,t(apply(mc,2,quantile, probs = c(0.025,0.975))))
names(ac1) <- c("ac","dist","lci","uci")

ggplot(ac1, aes(dist, ac)) +
  geom_point(colour = "darkblue", size = 3) +
  geom_line(colour = "red") +
  scale_x_continuous('Great Circle Distance (KM)',limits=c(0,20000)) + 
  scale_y_continuous('Autocorrelation',limits=c(-0.4,0.7)) +
  theme_bw() + 
  geom_hline(yintercept=0) +   
  geom_smooth(aes(ymin = lci, ymax = uci), stat="identity",fill="blue",colour="darkblue") 

I also calculated 95% confidence intervals using a simple Monte Carlo technique. This code, and that to generate the nice ggplot2 figure is shown below.

The analysis in the above provides a simple example of how R can be used to generate descriptive spatial statistics. Those interested in implementing spatial methodologies should examine the wide variety of packages on CRAN which can be used for spatial analysis. The ‘spdep‘ package provides most of the essentials. The recent release of the ‘splm‘ package, containing functions for panel regression analysis, is also noteworthy, and potentially very important, contribution.