How Predictable is the English Premier League?


The reason why football is so exciting is uncertainty. The outcome of any match or league is unknown, and you get to watch the action unfold without knowing what’s going to happen. Watching matches where you know the score is never exciting.

This weekend the English Premier League season will conclude with little fanfare. Bar one relegation place, the league positions have already been determined. In fact, these positions were, for the most part, decided weeks ago. The element of uncertainty seems to have been reduced this season.

With this in mind, I wanted to look at uncertainty over the long run in English football. To do this used the data provided by and analyzed these with R. These data consist of 34,740 matches played in the top 5 divisions of English football between 2000 and 2015, containing information about both the result and the odds offered by bookies on this result.

To measure the uncertainty of any given match I used the following strategy. First, I averaged across all bookies’ odds for the three possible events: home win, draw, and away win. Next I mapped these aggregated odds into probabilities by inverting each of the odds and then dividing by the summed inverted odds. This takes care of the over round that helps bookies to make a profit. For example, if the odds were 2.1/1 that an event happens and 2.1/1 that it doesn’t then the probability of the event occurring is:

(1/2.1)/ (1/2.1 + (1/2.1)) = 0.4761905/(0.4761905+0.4761905) = 0.5.

Finally, to measure the uncertainty of each match, I subtract the probability that the event occurred from 1, to calculate a “residual” score. Imagine a home win occurs. The “residual” in this case will be 1-P(home win). If P(home win)=1, then there is no uncertainty, and this uncertainty score will be zero. Since there are 3 outcomes, we would expect an uncertainty measure to be bounded between 0 (no uncertainty) and 0.67 (pure uncertainty) where we get 1 out of 3right by just guessing.

After importing these data into R and calculating the uncertainty measure, I looked at this uncertainty measure over time. The plot in the above shows fitted smoothed trend lines of uncertainty, stratified by division. These trends are striking. Going by this graph, the Premier League has gotten more predictable over the analysis period. In 2000, the uncertainty measure was around 0.605. Given that we expect this measure to be bound between 0 (complete certainty) and 0.67 (completely random), this tell us that the average league game was very unpredictable. Over time, however, this measure has decreased by about 5%, which does not seem like much. Despite, the somewhat unexciting end to the 2014/15 season, the outcome of the average game is still not very predictable.

Noticeably, in lower league games there is even greater uncertainty. In fact, the average uncertainty measure of League 2 games approached a value of 0.65 in 2014. This indicates that the average League 2 game is about as unpredictable as playing rock-paper-scissors. Interestingly, and unlike the Premier League, there does not appear to be any discernible change over time. The games are just as unpredictable now as they were in 2000. Please see my R code below.

# clear

# libraries

# what are urls

years = c(rep("0001",4), rep("0102",4), rep("0203",4), rep("0405",4),
          rep("0506",5), rep("0607",5), rep("0708",5), rep("0809",5),
          rep("0910",5), rep("1011",5), rep("1112",5), rep("1213",5),
          rep("1314",5), rep("1415",5))
divis = c(rep(c("E0","E1","E2","E3"),4), rep(c("E0","E1","E2","E3","EC"),10))

urls = paste(years, divis, sep="/")
urls = paste("", urls, sep="/")

odds = c("B365H","B365D","B365A",
home = odds[seq(1,length(odds),3)]
draw = odds[seq(2,length(odds),3)]
away = odds[seq(3,length(odds),3)]

# load all data in a loop = NULL
for(i in 1:length(urls)){
  temp = read.csv(urls[i])
  # calculate average odds
  temp$homeodds = apply(temp[,names(temp) %in% home], 1, function(x) mean(x,na.rm=T))
  temp$drawodds = apply(temp[,names(temp) %in% draw], 1, function(x) mean(x,na.rm=T))
  temp$awayodds = apply(temp[,names(temp) %in% away], 1, function(x) mean(x,na.rm=T))
  temp = temp[,c("Div","Date","FTHG","FTAG","FTR","homeodds","drawodds","awayodds")] = rbind(, temp)
}$homewin = ifelse($FTR=="H", 1, 0)$draw = ifelse($FTR=="D", 1, 0)$awaywin = ifelse($FTR=="A", 1, 0)

# convert to probs with overrind$homeprob = (1/$homeodds)/(1/$homeodds+1/$drawodds+1/$awayodds)$drawprob = (1/$drawodds)/(1/$homeodds+1/$drawodds+1/$awayodds)$awayprob = (1/$awayodds)/(1/$homeodds+1/$drawodds+1/$awayodds)

# bookie residual$bookieres =$homeprob$bookieres[$FTR=="D"] =$drawprob[$FTR=="D"]$bookieres[$FTR=="A"] =$awayprob[$FTR=="A"]

# now plot over time$time = ifelse(nchar(as.character($Date))==8, 
                         as.Date($Date,format='%d/%m/%Y'))$date = as.Date($time, origin = "1970-01-01")$Division = "Premier League"$Division[$Div=="E1"] = "Championship"$Division[$Div=="E2"] = "League 1"$Division[$Div=="E3"] = "League 2"$Division[$Div=="EC"] = "Conference"$Division = factor($Division, levels = c("Premier League", "Championship", "League 1",
                                                           "League 2","Conference"))

ggplot(, aes(date, bookieres, colour=Division)) +
  stat_smooth(size = 1.25, alpha = 0.2) +
  labs(x = "Year", y = "Uncertainty") + 
  theme_bw() +
  theme(legend.position="bottom") +
        legend.title = element_text(size=20),
        legend.text = element_text(size=20))

Within Group Index in R

There are many occasions in my research when I want to create a within group index for a data frame. For example, with demographic data for siblings one might want to create a birth order index.

The below illustrates a simple example of how one can create such an index in R.

# two families/groups 1 and 2
# with random ages
data = data.frame(group = c(rep(1,5),rep(2,5)), age = rpois(10,10))

# birth order
# use rank function with negative age for descending order
data$bo = unlist(by(data, data$group, 
                    function(x) rank(-x$age, ties.method = "first")))

Combining ggplot Images

The ggplot2 package provides an excellent platform for data visualization. One (minor) drawback of this package is that combining ggplot images into one plot, like the par() function does for regular plots, is not a straightforward procedure. Fortunately, R user Stephen Turner has kindly provided a function called “arrange” that does exactly this. The function, taken from his blog, and an example of how it can be used is provided below.

vp.layout <- function(x, y) viewport(layout.pos.row=x, layout.pos.col=y)
arrange <- function(..., nrow=NULL, ncol=NULL, as.table=FALSE) {
  dots <- list(...)
  n <- length(dots)
  if(is.null(nrow) & is.null(ncol)) { nrow = floor(n/2) ; ncol = ceiling(n/nrow)}
  if(is.null(nrow)) { nrow = ceiling(n/ncol)}
  if(is.null(ncol)) { ncol = ceiling(n/nrow)}
  ## NOTE see n2mfrow in grDevices for possible alternative
  pushViewport(viewport(layout=grid.layout(nrow,ncol) ) )
  ii.p <- 1
  for(ii.row in seq(1, nrow)){
    ii.table.row <- ii.row
    if(as.table) {ii.table.row <- nrow - ii.table.row + 1}
    for(ii.col in seq(1, ncol)){
      ii.table <- ii.p
      if(ii.p > n) break
      print(dots[[ii.table]], vp=vp.layout(ii.table.row, ii.col))
      ii.p <- ii.p + 1
library(ggplot2) ; library(grid)

p1 <- qplot(wt, mpg, data=mtcars) 
p2 <- ggplot(diamonds, aes(price, colour = cut)) + 


Let’s Party!

Exploring whether regression coefficients differ between groups is an important part of applied econometric research, and particularly for research with a policy based objective.

For example, a government in a developing country may decide to introduce free school lunches in an effort to improve childhood health. However, if this treatment is known to only improve the health of boys from the lowest socioeconomic strata, it makes sense that this group should be targeted to receive the treatment, while the additional public resources, which would have been unnecessarily used on the other groups, could be efficiently allocated elsewhere.

There are two conventional approaches to estimating these potentially differing effects. The first involves manually partitioning one’s data and performing separate analysis (i.e. one regression for boys the other for girls). The second involves including interaction terms in the regression model. The inclusion of interaction terms allows for different groups to have different slopes.

Problems with these aforementioned strategies arise when researchers would like to stratify the analysis across many groups. Splitting the analysis into different groups can be both a confusing (triple interaction terms anyone?) and inefficient way to conduct research. Furthermore, the results of stratification across a large number of groups can be somewhat difficult to present in a research paper (think of a table with one hundred result columns).

Thankfully, the party package on Cran offers a neat solution to the above concerns, as the functions in this package offer procedures for model based stratification. Following a model-based approach has the obvious advantage that it avoids unnecessary splitting of data, and can therefore be seen as a more efficient way of analyzing group differences.

The model based approach takes the regression model of interest and partitions the results into groups based on parameter instabilities indicated by structural break tests. More info on such tests is given in Zeileis (2005).

In the below, I provide a simple example of the party package at work. Obviously, I encourage interested users to read both the package vignettes, and associated literature before performing more complicated analysis on real data. Let there be three groups (z). In group 0, the effect of x on y is -0.5, in groups 1 and 2 this effect is +0.5. Based on the below plot, we can see that the model-based recursive partitioning approach both predicts the splits, and also the correct parameter estimates.


# set up simulated data
z <- sample(c(0,1,2),2000,replace=T)
z1 <- ifelse(z==1,1,0)
z2 <- ifelse(z==2,1,0)
x <- rnorm(2000,0,1)
y <- 1 + 2*z1 + 2*z2 - 0.5*x + x*z1 + x*z2 + rnorm(2000,0,1) 

# model based partitioning of regression of y~x 
# over groups indicated by z
mod1 <- mob(y ~ x | factor(z))
# nice plot of results


Dummies for Dummies

Most R functions used in econometrics convert factor variables into a set of dummy/binary variables automatically. This is useful when estimating a linear model, saving the user from the laborious activity of manually including the dummy variables as regressors. However, what if you want to reshape your dataframe so that it contains such dummy variables?

The following function, datdum(.), is a simple workaround. The first argument is the factor variable (which can also be a character), the second is the dataframe and the third is the name you want to call these dummy variables.

datdum <- function(x, data, name){
  data$rv <- rnorm(dim(data)[1],1,1)
  mm <- data.frame(model.matrix(lm(data$rv~-1+factor(data[,x]))))
  names(mm) <- paste(name,1:dim(mm)[2],sep=".")
  data$rv <- NULL
  data <- cbind(data,mm)

# simple example
dat <- c("A","B","C")
dat <- data.frame(dat)
# output
> dat
1   A
2   B
3   C
> datdum(x="dat",data=dat,name="category")
  dat category.1 category.2 category.3
1   A          1          0          0
2   B          0          1          0
3   C          0          0          1