Why I use Panel/Multilevel Methods

I don’t understand why any researcher would choose not to use panel/multilevel methods on panel/hierarchical data. Let’s take the following linear regression as an example:

y_{it} = \beta_{0} + \beta_{1}x_{it} + a_{i} + \epsilon_{it},

where a_{i} is a random effect for the i-th group. A pooled OLS regression model for the above is unbiased and consistent. However, it will be inefficient, unless a_{i}=0 for all i.

Let’s have a look at the consequences of this inefficiency using a simulation. I will simulate the following model:

y_{it} = 1 + 5 x_{it} + a_{i} + \epsilon_{it},

with a_{i} \sim N(0, 3) and \epsilon_{it} \sim N(0, 1). I will do this simulation and compare the following 4 estimators: pooled OLS, random effects (RE) AKA a multilevel model with a mixed effect intercept, a correlated random effects (CRE) model (include group mean as regressor as in Mundlak (1978)), and finally the regular fixed effects (FE) model. I am doing this in R, so the first model I will use the simple lm() function, the second and third lmer() from the lme4 package, and finally the excellent felm() function from the lfe package. These models will be tested under two conditions. First, we will assume that the random effects assumption holds, the regressor is uncorrelated with the random effect. After looking at this, we will then allow the random effect to correlate with the regressor x_{it}.

The graph below shows the importance of using panel methods over pooled OLS. It shows boxplots of the 100 simulated estimates. Even when the random effects assumption is violated, the random effects estimator (RE) is far superior to simple pooled OLS. Both the CRE and FE estimators perform well. Both have lowest root mean square errors, even though the model satisfies the random effects assumption. Please see my R code below.


# clear ws

# load packages
# from this:

### set number of individuals
n = 200
# time periods
t = 5

### model is: y=beta0_{i} +beta1*x_{it} + e_{it}
### average intercept and slope
beta0 = 1.0
beta1 = 5.0

### set loop reps
loop = 100
### results to be entered
results1 = matrix(NA, nrow=loop, ncol=4)
results2 = matrix(NA, nrow=loop, ncol=4)

for(i in 1:loop){
  # basic data structure
  data = data.frame(t = rep(1:t,n),
                    n = sort(rep(1:n,t)))
  # random effect/intercept to add to each 
  rand = data.frame(n = 1:n,
                    a = rnorm(n,0,3))
  data = join(data, rand, match="first")
  # random error
  data$u = rnorm(nrow(data), 0, 1)
  # regressor x
  data$x = runif(nrow(data), 0, 1)
  # outcome y
  data$y = beta0 + beta1*data$x + data$a + data$u  
  # make factor for i-units
  data$n = as.character(data$n)
  # group i mean's for correlated random effects
  data$xn = ave(data$x, data$n, FUN=mean)
  # pooled OLS
  a1 = lm(y ~ x, data)
  # random effects
  a2 = lmer(y ~ x + (1|n), data)
  # correlated random effects
  a3 = lmer(y ~ x + xn + (1|n), data)
  # fixed effects
  a4 = felm(y ~ x | n, data)
  # gather results
  results1[i,] = c(coef(a1)[2],
  ### now let random effects assumption be false
  ### ie E[xa]!=0
  data$x = runif(nrow(data), 0, 1) + 0.2*data$a
  # the below is like above
  data$y = beta0 + beta1*data$x + data$a + data$u  
  data$n = as.character(data$n)
  data$xn = ave(data$x, data$n, FUN=mean)
  a1 = lm(y ~ x, data)
  a2 = lmer(y ~ x + (1|n), data)
  a3 = lmer(y ~ x + xn + (1|n), data)
  a4 = felm(y ~ x | n, data)
  results2[i,] = c(coef(a1)[2],
# calculate rmse
apply(results1, 2, function(x) sqrt(mean((x-5)^2)))
apply(results2, 2, function(x) sqrt(mean((x-5)^2)))

# shape data and do ggplot
model.names = data.frame(X2 = c("1","2","3","4"),
                         estimator = c("OLS","RE","CRE","FE"))
res1 = melt(results1)
res1 = join(res1, model.names, match="first")
res2 = melt(results2)
res2 = join(res2, model.names, match="first")

res1$split = "RE Valid"
res2$split = "RE Invalid"
res1 = rbind(res1, res2)

res1$split = factor(res1$split, levels =  c("RE Valid", "RE Invalid"))
res1$estimator = factor(res1$estimator, levels =  c("OLS","RE","CRE","FE"))

number_ticks = function(n) {function(limits) pretty(limits, n)}

ggplot(res1, aes(estimator, value)) + 
  geom_boxplot(fill="lightblue") +
  #coord_flip() +
  facet_wrap( ~ split, nrow = 2, scales = "free_y") +
  geom_hline(yintercept = 5) +
  scale_x_discrete('') + 
  scale_y_continuous(bquote(beta==5), breaks=number_ticks(3)) + 
  theme_bw() + 
        legend.title = element_blank(),
        legend.text = element_text(size=16),
        strip.text.x = element_text(size = 16),
        axis.text.x = element_text(angle = 45, hjust = 1))
ggsave("remc.pdf", width=8, height=6)

How Predictable is the English Premier League?


The reason why football is so exciting is uncertainty. The outcome of any match or league is unknown, and you get to watch the action unfold without knowing what’s going to happen. Watching matches where you know the score is never exciting.

This weekend the English Premier League season will conclude with little fanfare. Bar one relegation place, the league positions have already been determined. In fact, these positions were, for the most part, decided weeks ago. The element of uncertainty seems to have been reduced this season.

With this in mind, I wanted to look at uncertainty over the long run in English football. To do this used the data provided by http://www.football-data.co.uk/ and analyzed these with R. These data consist of 34,740 matches played in the top 5 divisions of English football between 2000 and 2015, containing information about both the result and the odds offered by bookies on this result.

To measure the uncertainty of any given match I used the following strategy. First, I averaged across all bookies’ odds for the three possible events: home win, draw, and away win. Next I mapped these aggregated odds into probabilities by inverting each of the odds and then dividing by the summed inverted odds. This takes care of the over round that helps bookies to make a profit. For example, if the odds were 2.1/1 that an event happens and 2.1/1 that it doesn’t then the probability of the event occurring is:

(1/2.1)/ (1/2.1 + (1/2.1)) = 0.4761905/(0.4761905+0.4761905) = 0.5.

Finally, to measure the uncertainty of each match, I subtract the probability that the event occurred from 1, to calculate a “residual” score. Imagine a home win occurs. The “residual” in this case will be 1-P(home win). If P(home win)=1, then there is no uncertainty, and this uncertainty score will be zero. Since there are 3 outcomes, we would expect an uncertainty measure to be bounded between 0 (no uncertainty) and 0.67 (pure uncertainty) where we get 1 out of 3right by just guessing.

After importing these data into R and calculating the uncertainty measure, I looked at this uncertainty measure over time. The plot in the above shows fitted smoothed trend lines of uncertainty, stratified by division. These trends are striking. Going by this graph, the Premier League has gotten more predictable over the analysis period. In 2000, the uncertainty measure was around 0.605. Given that we expect this measure to be bound between 0 (complete certainty) and 0.67 (completely random), this tell us that the average league game was very unpredictable. Over time, however, this measure has decreased by about 5%, which does not seem like much. Despite, the somewhat unexciting end to the 2014/15 season, the outcome of the average game is still not very predictable.

Noticeably, in lower league games there is even greater uncertainty. In fact, the average uncertainty measure of League 2 games approached a value of 0.65 in 2014. This indicates that the average League 2 game is about as unpredictable as playing rock-paper-scissors. Interestingly, and unlike the Premier League, there does not appear to be any discernible change over time. The games are just as unpredictable now as they were in 2000. Please see my R code below.

# clear

# libraries

# what are urls

years = c(rep("0001",4), rep("0102",4), rep("0203",4), rep("0405",4),
          rep("0506",5), rep("0607",5), rep("0708",5), rep("0809",5),
          rep("0910",5), rep("1011",5), rep("1112",5), rep("1213",5),
          rep("1314",5), rep("1415",5))
divis = c(rep(c("E0","E1","E2","E3"),4), rep(c("E0","E1","E2","E3","EC"),10))

urls = paste(years, divis, sep="/")
urls = paste("http://www.football-data.co.uk/mmz4281", urls, sep="/")

odds = c("B365H","B365D","B365A",
home = odds[seq(1,length(odds),3)]
draw = odds[seq(2,length(odds),3)]
away = odds[seq(3,length(odds),3)]

# load all data in a loop
full.data = NULL
for(i in 1:length(urls)){
  temp = read.csv(urls[i])
  # calculate average odds
  temp$homeodds = apply(temp[,names(temp) %in% home], 1, function(x) mean(x,na.rm=T))
  temp$drawodds = apply(temp[,names(temp) %in% draw], 1, function(x) mean(x,na.rm=T))
  temp$awayodds = apply(temp[,names(temp) %in% away], 1, function(x) mean(x,na.rm=T))
  temp = temp[,c("Div","Date","FTHG","FTAG","FTR","homeodds","drawodds","awayodds")]
  full.data = rbind(full.data, temp)

full.data$homewin = ifelse(full.data$FTR=="H", 1, 0)
full.data$draw = ifelse(full.data$FTR=="D", 1, 0)
full.data$awaywin = ifelse(full.data$FTR=="A", 1, 0)

# convert to probs with overrind
full.data$homeprob = (1/full.data$homeodds)/(1/full.data$homeodds+1/full.data$drawodds+1/full.data$awayodds)
full.data$drawprob = (1/full.data$drawodds)/(1/full.data$homeodds+1/full.data$drawodds+1/full.data$awayodds)
full.data$awayprob = (1/full.data$awayodds)/(1/full.data$homeodds+1/full.data$drawodds+1/full.data$awayodds)

# bookie residual
full.data$bookieres = 1-full.data$homeprob
full.data$bookieres[full.data$FTR=="D"] = 1-full.data$drawprob[full.data$FTR=="D"]
full.data$bookieres[full.data$FTR=="A"] = 1-full.data$awayprob[full.data$FTR=="A"]

# now plot over time
full.data$time = ifelse(nchar(as.character(full.data$Date))==8, 
full.data$date = as.Date(full.data$time, origin = "1970-01-01") 

full.data$Division = "Premier League" 
full.data$Division[full.data$Div=="E1"] = "Championship" 
full.data$Division[full.data$Div=="E2"] = "League 1" 
full.data$Division[full.data$Div=="E3"] = "League 2" 
full.data$Division[full.data$Div=="EC"] = "Conference" 

full.data$Division = factor(full.data$Division, levels = c("Premier League", "Championship", "League 1",
                                                           "League 2","Conference"))

ggplot(full.data, aes(date, bookieres, colour=Division)) +
  stat_smooth(size = 1.25, alpha = 0.2) +
  labs(x = "Year", y = "Uncertainty") + 
  theme_bw() +
  theme(legend.position="bottom") +
        legend.title = element_text(size=20),
        legend.text = element_text(size=20))

Coal and the Conservatives

Interesting election results in the UK over the weekend, where the Conservatives romped to victory. This was despite a widespread consensus that neither the Conservative or Labour party would get a majority. This was a triumph for uncertainty and random error over the deterministic, as none of the statistical forecasts appeared to deem such a decisive victory probable. The UK election is a lot harder to model, for numerous reasons, when compared to the US.

This means that a lot of pollsters and political forecasters will have to go back to the drawing board and re-evaluate their methods. Obviously, the models used to forecast the 2015 election could not handle the dynamics of the British electorate. However, there is a high degree of persistence within electuary constituencies. Let’s explore this persistence by looking at the relationship between coal and % Conservative (Tory) votes.

Following a tweet by Vaughan Roderick and using the methodology of Fernihough and O’Rourke (2014), I matched each of the constituencies to Britain’s coalfields creating a “proximity to coal” measure. What the plot below shows is striking. Being located on or in close proximity to a coal field reduces the tory vote share by about 20%. When we control (linearly) for latitude and longitude coordinates, this association decreases in strength, but not by much. For me, this plot highlights a long-standing relationship between Britain’s industrial revolution, the urban working class, and labour/union movement. What I find interesting is that this relationship has persisted despite de-industrialization and the movement away from large-scale manufacturing industry.


> summary(lm(tory~coal,city))

lm(formula = tory ~ coal, data = city)

    Min      1Q  Median      3Q     Max 
-42.507 -10.494   2.242  10.781  29.074 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  42.9492     0.7459   57.58   <2e-16 ***
coal        -24.9704     1.8887  -13.22   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.36 on 630 degrees of freedom
Multiple R-squared:  0.2172,	Adjusted R-squared:  0.216 
F-statistic: 174.8 on 1 and 630 DF,  p-value: < 2.2e-16

# robust to lat-long?
> summary(lm(tory~coal+longitude+latitude,city))

lm(formula = tory ~ coal + longitude + latitude, data = city)

    Min      1Q  Median      3Q     Max 
-44.495  -8.269   1.485   9.316  28.911 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 246.4355    18.9430  13.009  < 2e-16 ***
coal        -15.1616     1.8697  -8.109 2.68e-15 ***
longitude     1.4023     0.4015   3.493 0.000512 ***
latitude     -3.8621     0.3651 -10.578  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.76 on 628 degrees of freedom
Multiple R-squared:  0.3838,	Adjusted R-squared:  0.3809 
F-statistic: 130.4 on 3 and 628 DF,  p-value: < 2.2e-16

Brazil’s Host Advantage


If history can tell us anything about the World Cup, it’s that the host nation has an advantage of all other teams. Evidence of this was presented last night as the referee in the Brazil-Croatia match unjustly ruled in Brazil’s favour on several occasions. But what it is the statistical evidence of a host advantage?

To look at this, I downloaded these data from the Guardian’s website. With these, I ran a very simple probit model that regressed the probability of winning the world cup on whether the country was the host and also if the county was not the host but located in the same continent (I merge North and South America for this exercise). Obviously, this is quite a basic analysis, so I hope to build on these data as the tournament progresses and maybe and the 2010 data, and look at more sophisticated models.

> probitmfx(formula=winners ~ continent + hosts, data=wc)
probitmfx(formula = winners ~ continent + hosts, data = wc)

Marginal Effects:
             dF/dx Std. Err.      z   P>|z|   
continent 0.064425  0.027018 2.3845 0.01710 * 
hosts     0.315378  0.121175 2.6027 0.00925 **
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

dF/dx is for discrete change for the following variables:

[1] "continent" "hosts"   

The results are as we would expect. I am using the excellent mfx package to interpret the probit coefficients. Being the host nation increases the probability of being victorious by nearly 32%. So, going by historical trends, Brazil have a huge advantage for this world cup. If we look at countries in the same continent (think Argentina for this world cup) we see that there is a small advantage here, just over 6%.

Whether these results are robust to additional control variables and in the inclusion of fixed effects alongside heterogeneous time-varying effects is something I hope to probe.

The ivlewbel Package. A new way to Tackle Endogenous Regressor Models.

In April 2012, I wrote this blog post demonstrating an approach proposed in Lewbel (2012) that identifies endogenous regressor coefficients in a linear triangular system. Now I am happy to announce the release of the ivlewbel package, which contains a function through which Lewbel’s method can be applied in R. This package is now available to download on the CRAN.

Please see the example from the previous blog post replicated in the below. Additionally, it would be very helpful if people could comment on bugs and additional features they would like to add to the package. My contact details are in the about section of the blog.



beta1 <- beta2 <- NULL
for(k in 1:500){
  #generate data (including intercept)
  x1 <- rnorm(1000,0,1)
  x2 <- rnorm(1000,0,1)
  u <- rnorm(1000,0,1)
  s1 <- rnorm(1000,0,1)
  s2 <- rnorm(1000,0,1)
  ov <- rnorm(1000,0,1)
  e1 <- u + exp(x1)*s1 + exp(x2)*s1
  e2 <- u + exp(-x1)*s2 + exp(-x2)*s2
  y1 <- 1 + x1 + x2 + ov + e2 
  y2 <- 1 + x1 + x2 + y1 + 2*ov + e1
  x3 <- rep(1,1000)
  dat <- data.frame(y1,y2,x3,x1,x2)
  #record ols estimate
  beta1 <- c(beta1,coef(lm(y2~x1+x2+y1))[4])
  #init values for iv-gmm
  beta2 <- c(beta2,lewbel(formula = y2 ~ y1 | x1 + x2 | x1 + x2, data = dat)$coef.est[1,1])

d <- data.frame(rbind(cbind(beta1,"OLS"),cbind(beta2,"IV-GMM")))
d$beta1 <- as.numeric(as.character(d$beta1))
sm.density.compare(d$beta1, d$V2,xlab=("Endogenous Coefficient"))
title("Lewbel and OLS Estimates")
legend("topright", levels(d$V2),lty=c(1,2,3),col=c(2,3,4),bty="n")

IV Estimates via GMM with Clustering in R

In econometrics, generalized method of moments (GMM) is one estimation methodology that can be used to calculate instrumental variable (IV) estimates. Performing this calculation in R, for a linear IV model, is trivial. One simply uses the gmm() function in the excellent gmm package like an lm() or ivreg() function. The gmm() function will estimate the regression and return model coefficients and their standard errors. An interesting feature of this function, and GMM estimators in general, is that they contain a test of over-identification, often dubbed Hansen’s J-test, as an inherent feature. Therefore, in cases where the researcher is lucky enough to have more instruments than endogenous regressors, they should examine this over-identification test post-estimation.

While the gmm() function in R is very flexible, it does not (yet) allow the user to estimate a GMM model that produces standard errors and an over-identification test that is corrected for clustering. Thankfully, the gmm() function is flexible enough to allow for a simple hack that works around this small shortcoming. For this, I have created a function called gmmcl(), and you can find the code below. This is a function for a basic linear IV model. This code uses the gmm() function to estimate both steps in a two-step feasible GMM procedure. The key to allowing for clustering is to adjust the weights matrix after the second step. Interested readers can find more technical details regarding this approach here. After defining the function, I show a simple application in the code below.

gmmcl = function(formula1, formula2, data, cluster){
  library(plyr) ; library(gmm)
  # create data.frame
  data$id1 = 1:dim(data)[1]
  formula3 = paste(as.character(formula1)[3],"id1", sep=" + ")
  formula4 = paste(as.character(formula1)[2], formula3, sep=" ~ ")
  formula4 = as.formula(formula4)
  formula5 = paste(as.character(formula2)[2],"id1", sep=" + ")
  formula6 = paste(" ~ ", formula5, sep=" ")
  formula6 = as.formula(formula6)
  frame1 = model.frame(formula4, data)
  frame2 = model.frame(formula6, data)
  dat1 = join(data, frame1, type="inner", match="first")
  dat2 = join(dat1, frame2, type="inner", match="first")
  # matrix of instruments
  Z1 = model.matrix(formula2, dat2)
  # step 1
  gmm1 = gmm(formula1, formula2, data = dat2, 
             vcov="TrueFixed", weightsMatrix = diag(dim(Z1)[2]))
  # clustering weight matrix
  cluster = factor(dat2[,cluster])
  u = residuals(gmm1)
  estfun = sweep(Z1, MARGIN=1, u,'*')
  u = apply(estfun, 2, function(x) tapply(x, cluster, sum))  
  S = 1/(length(residuals(gmm1)))*crossprod(u)
  # step 2
  gmm2 = gmm(formula1, formula2, data=dat2, 
             vcov="TrueFixed", weightsMatrix = solve(S))

# generate data.frame
n = 100
z1 = rnorm(n)
z2 = rnorm(n)
x1 = z1 + z2 + rnorm(n)
y1 = x1 + rnorm(n)
id = 1:n

data = data.frame(z1 = c(z1, z1), z2 = c(z2, z2), x1 = c(x1, x1),
                  y1 = c(y1, y1), id = c(id, id))

summary(gmmcl(y1 ~ x1, ~ z1 + z2, data = data, cluster = "id"))

Within Group Index in R

There are many occasions in my research when I want to create a within group index for a data frame. For example, with demographic data for siblings one might want to create a birth order index.

The below illustrates a simple example of how one can create such an index in R.

# two families/groups 1 and 2
# with random ages
data = data.frame(group = c(rep(1,5),rep(2,5)), age = rpois(10,10))

# birth order
# use rank function with negative age for descending order
data$bo = unlist(by(data, data$group, 
                    function(x) rank(-x$age, ties.method = "first")))