How Predictable is the English Premier League?


The reason why football is so exciting is uncertainty. The outcome of any match or league is unknown, and you get to watch the action unfold without knowing what’s going to happen. Watching matches where you know the score is never exciting.

This weekend the English Premier League season will conclude with little fanfare. Bar one relegation place, the league positions have already been determined. In fact, these positions were, for the most part, decided weeks ago. The element of uncertainty seems to have been reduced this season.

With this in mind, I wanted to look at uncertainty over the long run in English football. To do this used the data provided by and analyzed these with R. These data consist of 34,740 matches played in the top 5 divisions of English football between 2000 and 2015, containing information about both the result and the odds offered by bookies on this result.

To measure the uncertainty of any given match I used the following strategy. First, I averaged across all bookies’ odds for the three possible events: home win, draw, and away win. Next I mapped these aggregated odds into probabilities by inverting each of the odds and then dividing by the summed inverted odds. This takes care of the over round that helps bookies to make a profit. For example, if the odds were 2.1/1 that an event happens and 2.1/1 that it doesn’t then the probability of the event occurring is:

(1/2.1)/ (1/2.1 + (1/2.1)) = 0.4761905/(0.4761905+0.4761905) = 0.5.

Finally, to measure the uncertainty of each match, I subtract the probability that the event occurred from 1, to calculate a “residual” score. Imagine a home win occurs. The “residual” in this case will be 1-P(home win). If P(home win)=1, then there is no uncertainty, and this uncertainty score will be zero. Since there are 3 outcomes, we would expect an uncertainty measure to be bounded between 0 (no uncertainty) and 0.67 (pure uncertainty) where we get 1 out of 3right by just guessing.

After importing these data into R and calculating the uncertainty measure, I looked at this uncertainty measure over time. The plot in the above shows fitted smoothed trend lines of uncertainty, stratified by division. These trends are striking. Going by this graph, the Premier League has gotten more predictable over the analysis period. In 2000, the uncertainty measure was around 0.605. Given that we expect this measure to be bound between 0 (complete certainty) and 0.67 (completely random), this tell us that the average league game was very unpredictable. Over time, however, this measure has decreased by about 5%, which does not seem like much. Despite, the somewhat unexciting end to the 2014/15 season, the outcome of the average game is still not very predictable.

Noticeably, in lower league games there is even greater uncertainty. In fact, the average uncertainty measure of League 2 games approached a value of 0.65 in 2014. This indicates that the average League 2 game is about as unpredictable as playing rock-paper-scissors. Interestingly, and unlike the Premier League, there does not appear to be any discernible change over time. The games are just as unpredictable now as they were in 2000. Please see my R code below.

# clear

# libraries

# what are urls

years = c(rep("0001",4), rep("0102",4), rep("0203",4), rep("0405",4),
          rep("0506",5), rep("0607",5), rep("0708",5), rep("0809",5),
          rep("0910",5), rep("1011",5), rep("1112",5), rep("1213",5),
          rep("1314",5), rep("1415",5))
divis = c(rep(c("E0","E1","E2","E3"),4), rep(c("E0","E1","E2","E3","EC"),10))

urls = paste(years, divis, sep="/")
urls = paste("", urls, sep="/")

odds = c("B365H","B365D","B365A",
home = odds[seq(1,length(odds),3)]
draw = odds[seq(2,length(odds),3)]
away = odds[seq(3,length(odds),3)]

# load all data in a loop = NULL
for(i in 1:length(urls)){
  temp = read.csv(urls[i])
  # calculate average odds
  temp$homeodds = apply(temp[,names(temp) %in% home], 1, function(x) mean(x,na.rm=T))
  temp$drawodds = apply(temp[,names(temp) %in% draw], 1, function(x) mean(x,na.rm=T))
  temp$awayodds = apply(temp[,names(temp) %in% away], 1, function(x) mean(x,na.rm=T))
  temp = temp[,c("Div","Date","FTHG","FTAG","FTR","homeodds","drawodds","awayodds")] = rbind(, temp)
}$homewin = ifelse($FTR=="H", 1, 0)$draw = ifelse($FTR=="D", 1, 0)$awaywin = ifelse($FTR=="A", 1, 0)

# convert to probs with overrind$homeprob = (1/$homeodds)/(1/$homeodds+1/$drawodds+1/$awayodds)$drawprob = (1/$drawodds)/(1/$homeodds+1/$drawodds+1/$awayodds)$awayprob = (1/$awayodds)/(1/$homeodds+1/$drawodds+1/$awayodds)

# bookie residual$bookieres =$homeprob$bookieres[$FTR=="D"] =$drawprob[$FTR=="D"]$bookieres[$FTR=="A"] =$awayprob[$FTR=="A"]

# now plot over time$time = ifelse(nchar(as.character($Date))==8, 
                         as.Date($Date,format='%d/%m/%Y'))$date = as.Date($time, origin = "1970-01-01")$Division = "Premier League"$Division[$Div=="E1"] = "Championship"$Division[$Div=="E2"] = "League 1"$Division[$Div=="E3"] = "League 2"$Division[$Div=="EC"] = "Conference"$Division = factor($Division, levels = c("Premier League", "Championship", "League 1",
                                                           "League 2","Conference"))

ggplot(, aes(date, bookieres, colour=Division)) +
  stat_smooth(size = 1.25, alpha = 0.2) +
  labs(x = "Year", y = "Uncertainty") + 
  theme_bw() +
  theme(legend.position="bottom") +
        legend.title = element_text(size=20),
        legend.text = element_text(size=20))

Coal and the Conservatives

Interesting election results in the UK over the weekend, where the Conservatives romped to victory. This was despite a widespread consensus that neither the Conservative or Labour party would get a majority. This was a triumph for uncertainty and random error over the deterministic, as none of the statistical forecasts appeared to deem such a decisive victory probable. The UK election is a lot harder to model, for numerous reasons, when compared to the US.

This means that a lot of pollsters and political forecasters will have to go back to the drawing board and re-evaluate their methods. Obviously, the models used to forecast the 2015 election could not handle the dynamics of the British electorate. However, there is a high degree of persistence within electuary constituencies. Let’s explore this persistence by looking at the relationship between coal and % Conservative (Tory) votes.

Following a tweet by Vaughan Roderick and using the methodology of Fernihough and O’Rourke (2014), I matched each of the constituencies to Britain’s coalfields creating a “proximity to coal” measure. What the plot below shows is striking. Being located on or in close proximity to a coal field reduces the tory vote share by about 20%. When we control (linearly) for latitude and longitude coordinates, this association decreases in strength, but not by much. For me, this plot highlights a long-standing relationship between Britain’s industrial revolution, the urban working class, and labour/union movement. What I find interesting is that this relationship has persisted despite de-industrialization and the movement away from large-scale manufacturing industry.


> summary(lm(tory~coal,city))

lm(formula = tory ~ coal, data = city)

    Min      1Q  Median      3Q     Max 
-42.507 -10.494   2.242  10.781  29.074 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  42.9492     0.7459   57.58   <2e-16 ***
coal        -24.9704     1.8887  -13.22   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.36 on 630 degrees of freedom
Multiple R-squared:  0.2172,	Adjusted R-squared:  0.216 
F-statistic: 174.8 on 1 and 630 DF,  p-value: < 2.2e-16

# robust to lat-long?
> summary(lm(tory~coal+longitude+latitude,city))

lm(formula = tory ~ coal + longitude + latitude, data = city)

    Min      1Q  Median      3Q     Max 
-44.495  -8.269   1.485   9.316  28.911 

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 246.4355    18.9430  13.009  < 2e-16 ***
coal        -15.1616     1.8697  -8.109 2.68e-15 ***
longitude     1.4023     0.4015   3.493 0.000512 ***
latitude     -3.8621     0.3651 -10.578  < 2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.76 on 628 degrees of freedom
Multiple R-squared:  0.3838,	Adjusted R-squared:  0.3809 
F-statistic: 130.4 on 3 and 628 DF,  p-value: < 2.2e-16

Linear Models with Multiple Fixed Effects

Estimating a least squares linear regression model with fixed effects is a common task in applied econometrics, especially with panel data. For example, one might have a panel of countries and want to control for fixed country factors. In this case the researcher will effectively include this fixed identifier as a factor variable, and then proceed to estimate the model that includes as many dummy variables (minus one if an intercept is included in the modelling equation) as there are countries. Obviously, this approach is computationally problematic when there are many fixed factors. In our simple example, an extra country will add an extra column to the X matrix used in the least squares calculation.

Fortunately, there are a number of data transformations that can be used in this panel setting. These include demeaning each within unit observation, using first differences, or including the group means as additional explanatory variables (as suggested by (Mundlak 1978)). However, these approaches only work well when there is one factor that the researcher wants to include fixed effects to account for.

Simen Gaure offers a solution this problem that allows for multiple fixed effects without resorting to a computationally burdensome methodology. Essentially the solution involves an elaboration of the group demeaning transformation mentioned in the above. More technical details can be found here or by referring to Gaure’s forthcoming article in Computational Statistics & Data Analysis. Those interested in implementing this estimation strategy in R can use the lfe package available on CRAN.

In the below, I have included a simple example of how the package works. In this example, the model needs to be set up to calculate fixed effects for two factor variables. Obviously, adding 2,000 columns to the data frame is not a convenient way to estimate the model that includes fixed effects for both the x2 and x3 variables. However, the felm function tackles this problem with ease. Stata has a similar function to feml, areg, although the areg function only allows for absorbed fixed effects in one variable.

# clear workspace
# load lfe package

# create data frame
x1 <- rnorm(10000)
x2 <- rep(1:1000,10)
x3 <- rep(1:1000,10)
e1 <- sin(x2) + 0.02*x3^2 + rnorm(10000)
y <- 10 + 2.5*x1 + (e1-mean(e1))
dat <- data.frame(x1,x2,x3,y)

# simple lm
# lm with fixed effects
felm(dat$y ~ dat$x1 + G(dat$x2) + G(dat$x3))

# output
# simple lm
> lm(y~x1)
Call: lm(formula = y ~ x1)
  (Intercept)           x1  
        10.47       -36.95  
> # lm with fixed effects
> felm(dat$y ~ dat$x1 + G(dat$x2) + G(dat$x3))

Dummies for Dummies

Most R functions used in econometrics convert factor variables into a set of dummy/binary variables automatically. This is useful when estimating a linear model, saving the user from the laborious activity of manually including the dummy variables as regressors. However, what if you want to reshape your dataframe so that it contains such dummy variables?

The following function, datdum(.), is a simple workaround. The first argument is the factor variable (which can also be a character), the second is the dataframe and the third is the name you want to call these dummy variables.

datdum <- function(x, data, name){
  data$rv <- rnorm(dim(data)[1],1,1)
  mm <- data.frame(model.matrix(lm(data$rv~-1+factor(data[,x]))))
  names(mm) <- paste(name,1:dim(mm)[2],sep=".")
  data$rv <- NULL
  data <- cbind(data,mm)

# simple example
dat <- c("A","B","C")
dat <- data.frame(dat)
# output
> dat
1   A
2   B
3   C
> datdum(x="dat",data=dat,name="category")
  dat category.1 category.2 category.3
1   A          1          0          0
2   B          0          1          0
3   C          0          0          1

Instrumental Variables without Traditional Instruments

Typically, regression models in empirical economic research suffer from at least one form of endogeneity bias.

The classic example is economic returns to schooling, where researchers want to know how much increased levels of education affect income. Estimation using a simple linear model, regressing income on schooling, alongside a bunch of control variables, will typically not yield education’s true effect on income. The problem here is one of omitted variables – notably unobserved ability. People who are more educated may be more motivated or have other unobserved characteristics which simultaneously affect schooling and future lifetime earnings.

Endogeneity bias plagues empirical research. However, there are solutions, the most common being instrumental variables (IVs). Unfortunately, the exclusion restrictions needed to justify the use of traditional IV methodology may be impossible to find.

So, what if you have an interesting research question, some data, but endogeneity with no IVs. You should give up, right? Wrong. According to Lewbel (forthcoming in Journal of Business and Economic Statistics), it is possible to overcome the endogeneity problem without the use of a traditional IV approach.

Lewbel’s paper demonstrates how higher order moment restrictions can be used to tackle endogeneity in triangular systems. Without going into too much detail (interested readers can consult Lewbel’s paper), this method is like the traditional two-stage instrumental variable approach, except the first-stage exclusion restriction is generated by the control, or exogenous, variables which we know are heteroskedastic (interested practitioners can test for this in the usual way, i.e. a White test).

In the code below, I demonstrate how one could employ this approach in R using the GMM framework outlined by Lewbel. My code only relates to a simple example with one endogenous variable and two exogenous variables. However, it would be easy to modify this code depending on the model.

# gmm function for 1 endog variable with 2 hetero exogenous variable
# outcome in the first column of 'datmat', endog variable in second
# constant and exog variables in the next three
# hetero exog in the last two (i.e no constant)
g1 <- function(theta, datmat) {
  #set up data
  y1 <- matrix(datmat[,1],ncol=1)
  y2 <- matrix(datmat[,2],ncol=1)
  x1 <- matrix(datmat[,3:5],ncol=3)
  z1 <- matrix(datmat[,4:5],ncol=2)
  # if the variable in the 4th col was not hetero
  # this could be modified so:
  # z1 <- matrix(datmat[,5],ncol=1)

  #set up moment conditions
  in1 <- (y1 -theta[1]*x1[,1]-theta[2]*x1[,2]-theta[3]*x1[,3])
  M <- NULL
  for(i in 1:dim(z1)[2]){
    M <- cbind(M,(z1[,i]-mean(z1[,i])))
  in2 <- (y2 -theta[4]*x1[,1]-theta[5]*x1[,2]-theta[6]*x1[,3]-theta[7]*y1)
  for(i in 1:dim(x1)[2]){M <- cbind(M,in1*x1[,i])}
  for(i in 1:dim(x1)[2]){M <- cbind(M,in2*x1[,i])}
  for(i in 1:dim(z1)[2]){M <- cbind(M,in2*((z1[,i]-mean(z1[,i]))*in1))}
# so estimation is easy
# gmm(function(...), data matrix, initial values vector)
# e.g : gmm(g1, x =as.matrix(dat),c(1,1,1,1,1,1,1))

I also tested the performance of Lewbel’s GMM estimator in comparison a mis-specified OLS estimator. In the code below, I perform 500 simulations of a triangular system containing an omitted variable. For the GMM estimator, it is useful to have good initial starting values. In this simple example, I use the OLS coefficients. In more complicated settings, it is advisable to use the estimates from the 2SLS procedure outlined in Lewbel’s paper. The distributions of the coefficient estimates are shown in the plot below. The true value, indicated by the vertical line, is one. It is pretty evident that the Lewbel approach works very well. I think this method could be very useful in a number of research disciplines.

beta1 <- beta2 <- NULL
for(k in 1:500){
  #generate data (including intercept)
  x1 <- rnorm(1000,0,1)
  x2 <- rnorm(1000,0,1)
  u <- rnorm(1000,0,1)
  s1 <- rnorm(1000,0,1)
  s2 <- rnorm(1000,0,1)
  ov <- rnorm(1000,0,1)
  e1 <- u + exp(x1)*s1 + exp(x2)*s1
  e2 <- u + exp(-x1)*s2 + exp(-x2)*s2
  y1 <- 1 + x1 + x2 + ov + e2 
  y2 <- 1 + x1 + x2 + y1 + 2*ov + e1
  x3 <- rep(1,1000)
  dat <- cbind(y1,y2,x3,x1,x2)
  #record ols estimate
  beta1 <- c(beta1,coef(lm(y2~x1+x2+y1))[4])
  #init values for iv-gmm
  init <- c(coef(lm(y2~x1+x2+y1)),coef(lm(y1~x1+x2)))
  #record gmm estimate
  beta2 <- c(beta2,coef(gmm(g1, x =as.matrix(dat),init))[7])

d <- data.frame(rbind(cbind(beta1,"OLS"),cbind(beta2,"IV-GMM")))
d$beta1 <- as.numeric(as.character(d$beta1))$beta1, d$V2,xlab=("Endogenous Coefficient"))
 title("Lewbel and OLS Estimates")
 legend("topright", levels(d$V2),lty=c(1,2,3),col=c(2,3,4),bty="n")