Simple Spatial Correlograms for Cross-Country Analysis in R

Accounting for temporal dependence in econometric analysis is important, as the presence of temporal dependence violates the assumption that observations are independent units. Historically, much less attention has been paid to correcting for spatial dependence, which, if present, also violates this independence assumption.

The comparability of temporal and spatial dependence is useful for illustrating why spatial dependence may matter. For example, a time-series analysis of a county’s GDP would examine how related this year’s figure was to preceding years, whereas a spatial analysis would examine how related one country’s figure was to that of their neighbours.

However, these temporal and spatial issues are not completely analogous. This difference is highlighted with the creation of lagged variables. In time-series, the lag structure is intuitively easy to comprehend. A variable y_{t} has lags of y_{t-1}, …, y_{t-k}.

Lagging a spatial variable introduces more ambiguity. A spatially lagged variable of y_{i} can be calculated as \rho(y_{i}) where \rho is a spatial weighting matrix. Interested readers should consult further sources on this, but needless to say there are a number of ways (nearest neighbour, distance etc.) in which we could define \rho.

The difference between spatial and temporal processes in the creation of lagged variables is somewhat compounded when we consider that it is not possible to create ‘lags of lags’ in the same manner as in time-series (so \rho^2 is not the valid). Your neighbour’s neighbour might be your neighbour.

Given the difficulty in creating spatial lags of spatial lags, is there an easy way of recreating a simple time-series correlogram for a spatial variable? Yes, indeed it is. The solution I present here is not an exclusive way to analyze a spatially dependent variable.

For this analysis I use different countries spatial coordinates and create a correlogram of population density in 1900 (these data are taken from this paper). The first step involves loading these data into the R workspace such that the data frame contains the latitude and longitude coordinates alongside the potentially spatially autocorrelated variable.

> pop <- read.csv("dat2.csv")
> head(pop)
        lon       lat              country        dens
1  66.02554  33.83319          Afghanistan  79.2543748
2  17.55091 -12.29862               Angola  24.2375278
3  20.07006  41.14267              Albania 282.7854408
4  54.33089  23.91317 United Arab Emirates   5.0438742
5  44.93819  40.29368              Armenia   0.2136146
6 134.48679 -25.73251            Australia   4.8896695

With the data frame, I calculated a distance matrix with the ‘rdist.earth’ function from the package ‘fields’. This function measures the distance between each set of data points based on their coordinates. This function recognizes that the world is not flat, and as such calculates what are known as great-circle distances.

# calculate great cirlce distances with fields package
library(fields)
# popdists matrix
popdists <- as.matrix(rdist.earth(cbind(pop$lon, pop$lat), miles = F, R = NULL))
diag(popdists) <- 0

With this distance based matrix it is possible to calculate a correlogram. This is performed through the function ‘autocorr’. There are three arguments. The first is w is the distance-based weights matrix. The second, x is the variable of interest (note: the order of x and w should be the same as in the original data frame), and the third are the distance bands.

# autocorr function
autocorr <- function(w,x,dist=500){
  aa <- ceiling(max(w)/dist)
  dists <- seq(0,aa*dist,dist)
  cors <- NULL
  for(i in 1:aa){
    w1 <- ifelse(w > dists[i] & w <= dists[i+1], 1, 0) 
    w2 <- w1
    for(j in 1:dim(w1)[1]){
      nu <- sum(w1[j,])
      if(nu>0){
        w2[j,] <- w1[j,]/nu
      }  
    }
    lag <- w2 %*% x
    cors <- c(cors,cor(x,lag))
  }
  return(cors)
}

The function works in the following manner. For each step there is a distance band. Here, I have specified that the distance bands are 500km. For the first-step, the function calculates the correlation of the variable of interest (x) and its spatially lagged neighbours – who are defined as having a centre-point within a 500km radius. The next step calculates this correlation for neighbours falling into the 500km to 1,000 km distance band, so there will be a new spatial weighting matrix \rho. This is an iterative procedure, and continues up until the bands exceed the furthest neighbours in the data frame.

# at 500km
ac1 <- autocorr(w=popdists,x=pop$dens,dist=500)
# 1000 km
ac2 <- autocorr(w=popdists,x=pop$dens,dist=1000)

# MC analysis
it <- 1000
mc <- matrix(NA,nrow=it,ncol=length(ac1))
for(i in 1:it){
  pop$rand <- sample(pop$dens,length(pop$dens),replace=F)
  mc[i,] <- autocorr(w=popdists,x=pop$rand,dist=500)
}

library(ggplot2)
ac1 <- data.frame(cbind(ac1,seq(500,20000,500)))
ac1 <- cbind(ac1,t(apply(mc,2,quantile, probs = c(0.025,0.975))))
names(ac1) <- c("ac","dist","lci","uci")

ggplot(ac1, aes(dist, ac)) +
  geom_point(colour = "darkblue", size = 3) +
  geom_line(colour = "red") +
  scale_x_continuous('Great Circle Distance (KM)',limits=c(0,20000)) + 
  scale_y_continuous('Autocorrelation',limits=c(-0.4,0.7)) +
  theme_bw() + 
  geom_hline(yintercept=0) +   
  geom_smooth(aes(ymin = lci, ymax = uci), stat="identity",fill="blue",colour="darkblue") 

I also calculated 95% confidence intervals using a simple Monte Carlo technique. This code, and that to generate the nice ggplot2 figure is shown below.

The analysis in the above provides a simple example of how R can be used to generate descriptive spatial statistics. Those interested in implementing spatial methodologies should examine the wide variety of packages on CRAN which can be used for spatial analysis. The ‘spdep‘ package provides most of the essentials. The recent release of the ‘splm‘ package, containing functions for panel regression analysis, is also noteworthy, and potentially very important, contribution.

About these ads

12 thoughts on “Simple Spatial Correlograms for Cross-Country Analysis in R

  1. Thanks. This is a nice post. Unfortunatly, running your code i got the following warnings and error:

    Warning:
    “In cor(x, lag) : the standard deviation is zero”

    Error:
    Error in t(apply(mc, 2, quantile, probs = c(0.025, 0.975))) :
    error in evaluating the argument ‘x’ in selecting a method for function ‘t’: Error in quantile.default(newX[, i], …) :
    missing values and NaN’s not allowed if ‘na.rm’ is FALSE

      • Thanks for the reply. I just used your data:

        id,lon,lat,country,dens
        1,66.02554,33.83319,”Afghanistan”,79.2543748
        2,17.55091,-12.29862,”Angola”,24.2375278
        3,20.07006,41.14267,”Albania”,282.7854408
        4,54.33089,23.91317,”UAE”,5.0438742
        5,44.93819,40.29368,”Armenia”,0.2136146
        6,134.48679,-25.73251,”Australia”,4.8896695

  2. Very nice post! I would like to replicate your analysis, how did you extract the data set from the paper you referenced to obtain the data2.csv.

    Thanks,
    John

    • They are similar but my function only calculates the raw correlation, whereas Moran’s I or Geary’s C are slightly more elaborate (I think). Moran’s I is more commonly used, and it should be easy to change my code to use Moran’s measure.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s