# Endogenous Spatial Lags for the Linear Regression Model

Over the past number of years, I have noted that spatial econometric methods have been gaining popularity. This is a welcome trend in my opinion, as the spatial structure of data is something that should be explicitly included in the empirical modelling procedure. Omitting spatial effects assumes that the location co-ordinates for observations are unrelated to the observable characteristics that the researcher is trying to model. Not a good assumption, particularly in empirical macroeconomics where the unit of observation is typically countries or regions.

Starting out with the prototypical linear regression model: $y = X \beta + \epsilon$, we can modify this equation in a number of ways to account for the spatial structure of the data. In this blog post, I will concentrate on the spatial lag model. I intend to examine spatial error models in a future blog post.

The spatial lag model is of the form: $y= \rho W y + X \beta + \epsilon$, where the term $\rho W y$ measures the potential spill-over effect that occurs in the outcome variable if this outcome is influenced by other unit’s outcomes, where the location or distance to other observations is a factor in for this spill-over. In other words, the neighbours for each observation have greater (or in some cases less) influence to what happens to that observation, independent of the other explanatory variables $(X)$. The $W$ matrix is a matrix of spatial weights, and the $\rho$ parameter measures the degree of spatial correlation. The value of $\rho$ is bounded between -1 and 1. When $\rho$ is zero, the spatial lag model collapses to the prototypical linear regression model.

The spatial weights matrix should be specified by the researcher. For example, let us have a dataset that consists of 3 observations, spatially located on a 1-dimensional Euclidean space wherein the first observation is a neighbour of the second and the second is a neighbour of the third. The spatial weights matrix for this dataset should be a $3 \times 3$ matrix, where the diagonal consists of 3 zeros (you are not a neighbour with yourself). Typically, this matrix will also be symmetric. It is also at the user’s discretion to choose the weights in $W$. Common schemes include nearest k neighbours (where k is again at the users discretion), inverse-distance, and other schemes based on spatial contiguities. Row-standardization is usually performed, such that all the row elements in $W$ sum to one. In our simple example, the first row of a contiguity-based scheme would be: [0, 1, 0]. The second: [0.5, 0, 0.5]. And the third: [0, 1, 0].

While the spatial-lag model represents a modified version of the basic linear regression model, estimation via OLS is problematic because the spatially lagged variable $(Wy)$ is endogenous. The endogeneity results from what Charles Manski calls the ‘reflection problem’: your neighbours influence you, but you also influence your neighbours. This feedback effect results in simultaneity which renders bias on the OLS estimate of the spatial lag term. A further problem presents itself when the independent variables $(X)$ are themselves spatially correlated. In this case, completely omitting the spatial lag from the model specification will bias the $\beta$ coefficient values due to omitted variable bias.

Fortunately, remedying these biases is relatively simple, as a number of estimators exist that will yield an unbiased estimate of the spatial lag, and consequently the coefficients for the other explanatory variables—assuming, of course, that these explanatory variables are themselves exogenous. Here, I will consider two: the Maximum Likelihood estimator (denoted ML) as described in Ord (1975), and a generalized two-stage least squares regression model (2SLS) wherein spatial lags, and spatial lags lags (i.e. $W^{2} X$) of the explanatory variables are used as instruments for $Wy$. Alongside these two models, I also estimate the misspecified OLS both without (OLS1) and with (OLS2) the spatially lagged dependent variable.

To examine the properties of these four estimators, I run a Monte Carlo experiment. First, let us assume that we have 225 observations equally spread over a $15 \times 15$ lattice grid. Second, we assume that neighbours are defined by what is known as the Rook contiguity, so a neighbour exists if they are in the grid space either above or below or on either side. Once we create the spatial weight matrix we row-standardize.

Taking our spatial weights matrix as defined, we want to simulate the following linear model: $y = \rho Wy + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon$, where we set $\rho=0.4$ , $\beta_{1}=0.5$, $\beta_{2}=-0.5$, $\beta_{3}=1.75$. The explanatory variables are themselves spatially autocorrelated, so our simulation procedure first simulates a random normal variable for both $x_{2}$ and $x_{3}$ from: $N(0, 1)$, then assuming a autocorrelation parameter of $\rho_{x}=0.25$, generates both variables such that: $x_{j} = (1-\rho_{x}W)^{-1} N(0, 1)$ for $j \in \{ 1,2 \}$. In the next step we simulate the error term $\epsilon$. We introduce endogeneity into the spatial lag by assuming that the error term $\epsilon$ is a function of a random normal $v$ so $\epsilon = \alpha v + N(0, 1)$ where $v = N(0, 1)$ and $\alpha=0.2$, and that the spatial lag term includes $v$. We modify the regression model to incorporate this: $y = \rho (Wy + v) + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon$. From this we can calculate the reduced form model: $y = (1 - \rho W)^{-1} (\rho v + \beta_{1} + \beta_{2}x_{2} + \beta_{3}x_{3} + \epsilon)$, and simulate values for our dependent variable $y$.

Performing 1,000 repetitions of the above simulation permits us to examine the distributions of the coefficient estimates produced by the four models outlined in the above. The distributions of these coefficients are displayed in the graphic in the beginning of this post. The spatial autocorrelation parameter $\rho$ is in the bottom-right quadrant. As we can see, the OLS model that includes the spatial effect but does not account for simultaneity (OLS2) over-estimates the importance of spatial spill-overs. Both the ML and 2SLS estimators correctly identify the $\rho$ parameter. The remaining quadrants highlight what happens to the coefficients of the explanatory variables. Clearly, the OLS1 estimator provides the worst estimate of these coefficients. Thus, it appears preferable to use OLS2, with the biased autocorrelation parameter, than the simpler OLS1 estimator. However, the OLS2 estimator also yields biased parameter estimates for the $\beta$ coefficients. Furthermore, since researchers may want to know the marginal effects in spatial equilibrium (i.e. taking into account the spatial spill-over effects) the overestimated $\rho$ parameter creates an additional bias.

To perform these calculations I used the spdep package in R, with the graphic created via ggplot2. Please see the R code I used in the below.

library(spdep) ; library(ggplot2) ; library(reshape)

rm(list=ls())
n = 225
data = data.frame(n1=1:n)
# coords
data$lat = rep(1:sqrt(n), sqrt(n)) data$long = sort(rep(1:sqrt(n), sqrt(n)))
# create W matrix
wt1 = as.matrix(dist(cbind(data$long, data$lat), method = "euclidean", upper=TRUE))
wt1 = ifelse(wt1==1, 1, 0)
diag(wt1) = 0
# row standardize
rs = rowSums(wt1)
wt1 = apply(wt1, 2, function(x) x/rs)
lw1 = mat2listw(wt1, style="W")

rx = 0.25
rho = 0.4
b1 = 0.5
b2 = -0.5
b3 = 1.75
alp = 0.2

inv1 = invIrW(lw1, rho=rx, method="solve", feasible=NULL)
inv2 = invIrW(lw1, rho=rho, method="solve", feasible=NULL)

sims = 1000
beta1results = matrix(NA, ncol=4, nrow=sims)
beta2results = matrix(NA, ncol=4, nrow=sims)
beta3results = matrix(NA, ncol=4, nrow=sims)
rhoresults = matrix(NA, ncol=3, nrow=sims)

for(i in 1:sims){
u1 = rnorm(n)
x2 = inv1 %*% u1
u2 = rnorm(n)
x3 = inv1 %*% u2
v1 = rnorm(n)
e1 = alp*v1 + rnorm(n)
data1 = data.frame(cbind(x2, x3),lag.listw(lw1, cbind(x2, x3)))
names(data1) = c("x2","x3","wx2","wx3")

data1$y1 = inv2 %*% (b1 + b2*x2 + b3*x3 + rho*v1 + e1) data1$wy1 = lag.listw(lw1, data1$y1) data1$w2x2 = lag.listw(lw1, data1$wx2) data1$w2x3 = lag.listw(lw1, data1$wx3) data1$w3x2 = lag.listw(lw1, data1$w2x2) data1$w3x3 = lag.listw(lw1, data1$w2x3) m1 = coef(lm(y1 ~ x2 + x3, data1)) m2 = coef(lm(y1 ~ wy1 + x2 + x3, data1)) m3 = coef(lagsarlm(y1 ~ x2 + x3, data1, lw1)) m4 = coef(stsls(y1 ~ x2 + x3, data1, lw1)) beta1results[i,] = c(m1[1], m2[1], m3[2], m4[2]) beta2results[i,] = c(m1[2], m2[3], m3[3], m4[3]) beta3results[i,] = c(m1[3], m2[4], m3[4], m4[4]) rhoresults[i,] = c(m2[2],m3[1], m4[1]) } apply(rhoresults, 2, mean) ; apply(rhoresults, 2, sd) apply(beta1results, 2, mean) ; apply(beta1results, 2, sd) apply(beta2results, 2, mean) ; apply(beta2results, 2, sd) apply(beta3results, 2, mean) ; apply(beta3results, 2, sd) colnames(rhoresults) = c("OLS2","ML","2SLS") colnames(beta1results) = c("OLS1","OLS2","ML","2SLS") colnames(beta2results) = c("OLS1","OLS2","ML","2SLS") colnames(beta3results) = c("OLS1","OLS2","ML","2SLS") rhoresults = melt(rhoresults) rhoresults$coef = "rho"
rhoresults$true = 0.4 beta1results = melt(beta1results) beta1results$coef = "beta1"
beta1results$true = 0.5 beta2results = melt(beta2results) beta2results$coef = "beta2"
beta2results$true = -0.5 beta3results = melt(beta3results) beta3results$coef = "beta3"
beta3results$true = 1.75 data = rbind(rhoresults,beta1results,beta2results,beta3results) data$Estimator = data$X2 ggplot(data, aes(x=value, colour=Estimator, fill=Estimator)) + geom_density(alpha=.3) + facet_wrap(~ coef, scales= "free") + geom_vline(aes(xintercept=true)) + scale_y_continuous("Density") + scale_x_continuous("Effect Size") + opts(legend.position = 'bottom', legend.direction = 'horizontal')  # Heteroskedastic GLM in R A commenter on my previous blog entry has drawn my attention to an R function called hetglm() that estimates heteroskedastic probit models. This function is contained in the glmx package. The glmx package is not available on CRAN yet, but thankfully can be downloaded here. The hetglm() function has a number of computational advantages compared with the crude method outlined in my previous post. The following example replicates the previous analysis showing the speed advantage associated with using the hetglm() function. rm(list=ls()) # clear ws library(maxLik) library(glmx) n <- 1000 # no. obs x1 <- runif(n,-1,1) # predictor 1 x2 <- runif(n,-1,1) # " 2 e1 <- rnorm(n,0,1) # normal error e2 <- (1 + 0.45*(x1+x2))*e1 # hetero error y <- ifelse(0.5 + 0.5*x1 -0.5*x2 - e2 >0, 1, 0) # outcome # estimate normal probit r1 <- glm(y~x1+x2,family=binomial(link="probit")) system.time(ml1 <- maxLik(hll,start=c(0,0,0,0,0))) # maximize system.time(h1 <- hetglm(y ~ x1 + x2))  # output > system.time(ml1 <- maxLik(hll,start=c(0,0,0,0,0))) # maximize user system elapsed 4.43 0.00 4.59 > system.time(h1 <- hetglm(y ~ x1 + x2)) user system elapsed 0.11 0.00 0.11  # The Heteroskedastic Probit Model Specification testing is an important part of econometric practice. However, from what I can see, few researchers perform heteroskedasticity tests after estimating probit/logit models. This is not a trivial point. Heteroskedasticity in these models can represent a major violation of the probit/logit specification, both of which assume homoskedastic errors. Thankfully, tests for heteroskedasticity in these models exist, and it is also possible to estimate modified binary choice models that are robust to heteroskedastic errors. In this blog post I present an example of how to estimate a heteroskedastic probit in R, and also test for heteroskedasticity. The standard probit model assumes that the error distribution of the latent model has a unit variance. The heteroskedastic probit model relaxes this assumption, and allows the error variance to depend on some of the predictors in the regression model. Those interested in further details of this model, and the potential implications of this form of model misspecification, should consult these notes. In the code below, I simulate some data, specify the log-likelihood function for the heteroskedastic probit model, estimate this model via maximum likelihood, and then perform a simple LR test of homoskedasticity. Note the log-likelihood function can be simplified from: $\ln L (\beta, \gamma | X_{i}, Z_{i}) = \sum^{N}_{i=1} \{ Y_{i} \ln \Phi [X_{i}\beta \exp(-Z_{i}\gamma)] + (1-Y_{i}) \ln [1-\Phi (X_{i}\beta \exp(-Z_{i}\gamma))] \}$ to: $\ln L (\beta, \gamma | X_{i}, Z_{i}) = \sum^{N}_{i=1} \{ \ln \Phi [q_{i}(X_{i}\beta \exp(-Z_{i}\gamma))]\}$ where $q_{i}=2y_{i}-1$ uses the fact that the PDF of the normal distribution is symmetric to cut out some algebra (and possibly lead to an easier log-likelihood function to maximize). The log-liklihood function in the above may prove difficult to maximize, although there is an LM test for homoskedasticity, that only requires the restricted (standard probit model) to be estimated. An interesting application would be to see if Bayesian methods estimated using a Gibbs sampling (via STAN, Bugs, or JAGS) can help deal with computationally difficult log likelihoods in this regard. A topic, I will hopefully turn to in a future blog post. # hetprob rm(list=ls()) # clear ws library(maxLik) # use max lik package # dgp n <- 1000 # no. obs x1 <- runif(n,-1,1) # predictor 1 x2 <- runif(n,-1,1) # " 2 e1 <- rnorm(n,0,1) # normal error e2 <- (1 + 0.45*(x1+x2))*e1 # hetero error y <- ifelse(0.5 + 0.5*x1 -0.5*x2 - e2 >0, 1, 0) # outcome # estimate normal probit r1 <- glm(y~x1+x2,family=binomial(link="probit")) # hetprob llik hll <- function(beta){ beta1 <- beta[1:dim(X)[2]] beta2 <- beta[(1+dim(X)[2]):((dim(X)[2]+dim(X)[2])-1)] q <- 2*y-1 # sym tranform xbeta <- pnorm(q*((X %*% beta1)/exp(X[,-1] %*% beta2))) sum(log(xbeta)) } X <- cbind(1,x1,x2) ml1 <- maxLik(hll,start=c(0,0,0,0,0)) # maximize sqrt(diag(-solve(ml1$hessian))) # get standard errors

# LR test of homosked
# 2*(LR_{unrestriced}-LR_{restriced})
LRtest <- 2*(ml1$maximum+1/2*r1$deviance)
# LS is chi^2 distributed with 2 dof
1-pchisq(LRtest,df=2)
# REJECT (as expected)


# Optim, you’re doing it wrong?

Call me uncouth, but I like my TV loud, my beer cold and my optimization functions as simple as possible. Therefore, what I write in this blog post is very much from a layman’s perspective, and I am happy to be corrected on any fundamental errors.

I have recently become interested in writing my own maximum likelihood estimators. However, before examining any complex optimization problems, I wanted to perform a very simple analysis, and estimate a standard probit regression model.

Obviously, this is a simple model, which can be estimated using the glm function. In the following, I am assuming that the glm coefficients and standard errors are the (most) correct ones (again, I am happy to be corrected on this). The data and model I use is one of female labor force participation taken from William H. Greene’s well-known econometrics textbook.

lfp <- read.csv("lfp.csv")
lfp$kids <- ifelse(lfp$k6==1 | lfp$k18==1, 1, 0) formula=lfp ~ age+I(age^2)+log(inc)+ed+kids data=lfp aa <- lm(formula,data) X <- model.matrix(aa) y <- data.frame(lfp$lfp)
pll <- function(beta){
sum(y*log(pnorm(X %*% beta)) + (1-y)*log(1-pnorm(X %*% beta)))
}


The above code loads these data and specifies the basic probit log-likelihood (pll). The idea here is: given the design matrix X and outcome vector y, what beta vector maximizes the log-likelihood function pll? To perform this optimization problem, I use the following two functions: optim, which is part of the stats package, and maxLik, a function from the package of the same name.

> system.time(ml1 <- optim(coef(aa)*2.5, pll, method="BFGS",
+              control=list(maxit=5000, fnscale=-1), hessian=T))
user  system elapsed
2.59    0.00    2.66
> system.time(ml2 <- maxLik(pll,start=coef(aa)*2.5))
user  system elapsed
4.36    0.00    4.46
user  system elapsed
0.03    0.00    0.04
>
> cbind(ml1$par,ml2$estimate,coef(mod1))
[,1]         [,2]        [,3]
(Intercept) -3.3929077241 -5.326569347 -5.32655206
age         -0.0565559741  0.128949421  0.12894921
I(age^2)     0.0004543657 -0.001675693 -0.00167569
log(inc)     0.4287758432  0.223211383  0.22320995
ed           0.0812164389  0.086559458  0.08655950
kids        -0.3168540384 -0.323633523 -0.32363339
> cbind(sqrt(diag(-solve(ml1$hessian))), + sqrt(diag(-solve(ml2$hessian))),
+       sqrt(diag(vcov(mod1))))
[,1]         [,2]         [,3]
(Intercept) 9.387307e-01 1.6182304697 1.5569666788
age         8.835891e-03 0.0659315872 0.0642575252
I(age^2)    8.978703e-05 0.0007601266 0.0007413629
log(inc)    9.897385e-02 0.0998697296 0.0999188711
ed          2.302763e-02 0.0230260173 0.0229592562
kids        1.020938e-01 0.1017291904 0.1015104540


In the above, I contrast the output obtained by the two optimization functions, alongside the results produced using the standard glm function. Firstly, we can see that optim is almost twice the speed of maxLik. However, the coefficients produced by the maxLik function (column 2) are much closer to the glm coefficients (column 3) than those obtained via optim (column 1). A comparison of the standard errors, taken from the Hessian matrix, show that the maxLik function once again ‘outperforms’ the optim function, which appears to yield completely incorrect standard errors.

> pgr <- function(beta){
+   t(y*(dnorm(X %*% beta)/pnorm(X %*% beta)) -
+     (1-y)*(dnorm(X %*% beta)/(1-pnorm(X %*% beta)))) %*% X
+ }
> system.time(ml1 <- optim(coef(aa)*2.5, pll, pgr, method="BFGS",
+                          control=list(maxit=5000, fnscale=-1), hessian=T))
user  system elapsed
0.40    0.00    0.42
> system.time(ml2 <- maxLik(pll, pgr, start=coef(aa)*2.5))
user  system elapsed
0.57    0.00    0.58
> cbind(ml1$par,ml2$estimate,coef(mod1))
[,1]        [,2]        [,3]
(Intercept) -5.326570797 -5.32656034 -5.32655206
age          0.128949476  0.12894917  0.12894921
I(age^2)    -0.001675693 -0.00167569 -0.00167569
log(inc)     0.223211464  0.22321101  0.22320995
ed           0.086559408  0.08655946  0.08655950
kids        -0.323633519 -0.32363352 -0.32363339
> cbind(sqrt(diag(-solve(ml1$hessian))), + sqrt(diag(-solve(ml2$hessian))),
+       sqrt(diag(vcov(mod1))))
[,1]         [,2]         [,3]
(Intercept) 0.50590822 1.5510444988 1.5569666788
age                NaN 0.0640955648 0.0642575252
I(age^2)           NaN 0.0007394648 0.0007413629
log(inc)    0.09833250 0.0989695518 0.0999188711
ed          0.02293623 0.0230289024 0.0229592562
kids        0.10153358 0.1017304094 0.1015104540
Warning message:
In sqrt(diag(-solve(ml1$hessian))) : NaNs produced  The slow performance of both optim and maxLik is because I did not specify the gradient function (i.e. first derivatives). Without a gradient function both optimizers are forced to approximate. In the above, I add a gradient function pgr. This speeds up both optimization procedures considerably. Furthermore, the similarity of the estimated coefficients is evident. Nevertheless, problems are encountered in the calculation of the Hessian matrix in the optim case. Once again, fewer problems are encountered with the maxLik function. Based on my experience thus far, I would recommend the use of maxLik compared to optim. # Time-Series Policy Evaluation in R Quantifying the success of government policies is clearly important. Randomized control trials, like those conducted by drug companies, are often described as the ‘gold-standard’ for policy evaluation. Under these, a policy is implemented in/to one area/group (treatment), but not in/to another (control). The difference in outcomes between the two areas or groups represents the effectiveness of the policy. Unfortunately, there many circumstances in which governments introduce policies and are not able to measure how are well, if at all, they work. A good example of this is childhood vaccinations against disease, since it is unethical to deny some children a potentially life-saving treatment. This makes evaluating these policies difficult, because there is no counterfactual event, i.e. everyone gets the treatment (in this case vaccination). However, it is possible to evaluate the effectiveness of certain policies with time-series econometrics. For this example, I am going to use a series of monthly road death statistics in Ireland over the period 2003 to present. In November 2011, a new drink driving law was introduced which lowered the blood-alcohol limits. How has the introduction of this law changed road fatalities? To evaluate this policy, I estimated a simple state space model using the R package sspir. The code I used for this example is almost exactly like the van drivers example in this paper. The following generalized linear model is assumed: Po(y_{t}) = a_{t} + b(LAW) + S_{t}, where the number of deaths, y_{t}, is a Poisson process that depends on a random walk, a_{t}, time-varying seasonal patterns, S_{t}, and the introduction of the law. The b parameter is the measure of the policy’s effect. The usefulness of this model for policy evaluation, compared to an OLS linear regression for example, stems from the time-varying intercept a_{t}. In effect, this parameter substitutes the counterfactual or control group that we would need in a randomized control trial because it takes into account the underlying trend in the time-series before the policy change. The law dummy variable adds a shift to the underlying trend, acting as the relevant treatment effect. A full overview of the state space system, and estimation using the Kalman filter is beyond the scope of this blog post. Interested readers should consult this paper and the references therein. The code to generate these results is below. > head(road) deaths drinkdrive ind 1 20 0 1 2 21 0 2 3 33 0 3 4 23 0 4 5 28 0 5 6 37 0 6 mle <- function(psi) { m1 <- ssm(deaths ~ tvar(1) + drinkdrive + tvar(sumseason(time,12)), family=poisson(link="log"), data=road, phi=c(1,exp(psi)),C0=diag(13)*1000) m1fit <- getFit(m1) return(extended(m1fit)$likelihood)
}

res1 <- optim(par=c(0,0),fn=mle,method=c("L-BFGS-B"),
control=list(fnscale=-1),hessian=FALSE)


The first part of the code sets up the state space system inside a function. The reason for this is to calculate the signal-to-noise ratio, which I call psi, using maximum likelihood. The signal-to-noise ratio dictates the ‘wiggliness’ of the random walk a_{t}. Too much signal will result in a_{t} matching the series exactly. Too little, and the random walk will collapse to a constant term, as in OLS. A similar argument can be made for the time-varying seasonal patterns.

After calculating the signal-to-noise ratio, I estimated the model with the maximum likelihood parameters. The function ssm automatically runs the iterated Kalman smoother, which provides the relevant information.

# estimate with mle pars
m2 <- ssm(deaths ~ tvar(1) + drinkdrive + tvar(sumseason(time,12)),
phi=c(1,exp(res1$par)),C0=diag(13)*1000) m2fit <- getFit(m2) 100 * (exp(m2fit$m[1,2])-1) # effect of drink driving law

# calculate sd
m2sd <- NULL
for (i in 1:length(road$deaths)) { thisone <- m2fit$C[[i]][1:2,1:2]
if (road$drinkdrive[i]==0) { m2sd <- c(m2sd,sqrt(thisone[1,1])) } else m2sd <- c(m2sd,sqrt(sum(thisone))) } # plot road$ind <- 1:112
plot(road$ind,road$deaths,ylim=c(0,50),col="red",type="l",axes=F,
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]),col="blue") lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]+2*m2sd),lty=2,col="blue")
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit\$m[,1]-2*m2sd),lty=2,col="blue")