How Predictable is the English Premier League?


The reason why football is so exciting is uncertainty. The outcome of any match or league is unknown, and you get to watch the action unfold without knowing what’s going to happen. Watching matches where you know the score is never exciting.

This weekend the English Premier League season will conclude with little fanfare. Bar one relegation place, the league positions have already been determined. In fact, these positions were, for the most part, decided weeks ago. The element of uncertainty seems to have been reduced this season.

With this in mind, I wanted to look at uncertainty over the long run in English football. To do this used the data provided by and analyzed these with R. These data consist of 34,740 matches played in the top 5 divisions of English football between 2000 and 2015, containing information about both the result and the odds offered by bookies on this result.

To measure the uncertainty of any given match I used the following strategy. First, I averaged across all bookies’ odds for the three possible events: home win, draw, and away win. Next I mapped these aggregated odds into probabilities by inverting each of the odds and then dividing by the summed inverted odds. This takes care of the over round that helps bookies to make a profit. For example, if the odds were 2.1/1 that an event happens and 2.1/1 that it doesn’t then the probability of the event occurring is:

(1/2.1)/ (1/2.1 + (1/2.1)) = 0.4761905/(0.4761905+0.4761905) = 0.5.

Finally, to measure the uncertainty of each match, I subtract the probability that the event occurred from 1, to calculate a “residual” score. Imagine a home win occurs. The “residual” in this case will be 1-P(home win). If P(home win)=1, then there is no uncertainty, and this uncertainty score will be zero. Since there are 3 outcomes, we would expect an uncertainty measure to be bounded between 0 (no uncertainty) and 0.67 (pure uncertainty) where we get 1 out of 3right by just guessing.

After importing these data into R and calculating the uncertainty measure, I looked at this uncertainty measure over time. The plot in the above shows fitted smoothed trend lines of uncertainty, stratified by division. These trends are striking. Going by this graph, the Premier League has gotten more predictable over the analysis period. In 2000, the uncertainty measure was around 0.605. Given that we expect this measure to be bound between 0 (complete certainty) and 0.67 (completely random), this tell us that the average league game was very unpredictable. Over time, however, this measure has decreased by about 5%, which does not seem like much. Despite, the somewhat unexciting end to the 2014/15 season, the outcome of the average game is still not very predictable.

Noticeably, in lower league games there is even greater uncertainty. In fact, the average uncertainty measure of League 2 games approached a value of 0.65 in 2014. This indicates that the average League 2 game is about as unpredictable as playing rock-paper-scissors. Interestingly, and unlike the Premier League, there does not appear to be any discernible change over time. The games are just as unpredictable now as they were in 2000. Please see my R code below.

# clear

# libraries

# what are urls

years = c(rep("0001",4), rep("0102",4), rep("0203",4), rep("0405",4),
          rep("0506",5), rep("0607",5), rep("0708",5), rep("0809",5),
          rep("0910",5), rep("1011",5), rep("1112",5), rep("1213",5),
          rep("1314",5), rep("1415",5))
divis = c(rep(c("E0","E1","E2","E3"),4), rep(c("E0","E1","E2","E3","EC"),10))

urls = paste(years, divis, sep="/")
urls = paste("", urls, sep="/")

odds = c("B365H","B365D","B365A",
home = odds[seq(1,length(odds),3)]
draw = odds[seq(2,length(odds),3)]
away = odds[seq(3,length(odds),3)]

# load all data in a loop = NULL
for(i in 1:length(urls)){
  temp = read.csv(urls[i])
  # calculate average odds
  temp$homeodds = apply(temp[,names(temp) %in% home], 1, function(x) mean(x,na.rm=T))
  temp$drawodds = apply(temp[,names(temp) %in% draw], 1, function(x) mean(x,na.rm=T))
  temp$awayodds = apply(temp[,names(temp) %in% away], 1, function(x) mean(x,na.rm=T))
  temp = temp[,c("Div","Date","FTHG","FTAG","FTR","homeodds","drawodds","awayodds")] = rbind(, temp)
}$homewin = ifelse($FTR=="H", 1, 0)$draw = ifelse($FTR=="D", 1, 0)$awaywin = ifelse($FTR=="A", 1, 0)

# convert to probs with overrind$homeprob = (1/$homeodds)/(1/$homeodds+1/$drawodds+1/$awayodds)$drawprob = (1/$drawodds)/(1/$homeodds+1/$drawodds+1/$awayodds)$awayprob = (1/$awayodds)/(1/$homeodds+1/$drawodds+1/$awayodds)

# bookie residual$bookieres =$homeprob$bookieres[$FTR=="D"] =$drawprob[$FTR=="D"]$bookieres[$FTR=="A"] =$awayprob[$FTR=="A"]

# now plot over time$time = ifelse(nchar(as.character($Date))==8, 
                         as.Date($Date,format='%d/%m/%Y'))$date = as.Date($time, origin = "1970-01-01")$Division = "Premier League"$Division[$Div=="E1"] = "Championship"$Division[$Div=="E2"] = "League 1"$Division[$Div=="E3"] = "League 2"$Division[$Div=="EC"] = "Conference"$Division = factor($Division, levels = c("Premier League", "Championship", "League 1",
                                                           "League 2","Conference"))

ggplot(, aes(date, bookieres, colour=Division)) +
  stat_smooth(size = 1.25, alpha = 0.2) +
  labs(x = "Year", y = "Uncertainty") + 
  theme_bw() +
  theme(legend.position="bottom") +
        legend.title = element_text(size=20),
        legend.text = element_text(size=20))

How Much Should Bale Cost Real?

It looks increasingly likely that Gareth Bale will transfer from Tottenham to Real Madrid for a world record transfer fee. Negotiations are ongoing, with both parties keen to get the best deal possible deal with the transfer fee. Reports speculate that this transfer fee will be anywhere in the very wide range of £80m to £120m.

Given the topical nature of this transfer saga, I decided to explore the world record breaking transfer fee data, and see if these data can help predict what the Gareth Bale transfer fee should be. According to this Wikipedia article, there have been 41 record breaking transfers, from Willie Groves going from West Brom to Aston Villa in 1893 for £100, to Cristiano Ronaldo’s £80m 2009 transfer to Real Madrid from Manchester United.

When comparing any historical price data it is very important that we are comparing like with like. Clearly, a fee of £100 in 1893 is not the same as £100 in 2009. Therefore, the world record transfer fees need to be adjusted for inflation. To do this, I used the excellent measuringworth website, and converted all of the transfer fees into 2011 pounds sterling.


The plot above demonstrates a very strong linear relationship between logged real world record transfer fees and time. The R-squared indicates that the year of the transfer fee explains roughly 97% of the variation in price.

So, if Real Madrid are to pay a world transfer fee for Bale, how much does this model predict the fee will be? The above plot demonstrates what happens when the simple log-linear model is extrapolated to predict the world record transfer fee in 2013. The outcome here is 18.37, so around £96m, in 2011 prices. We can update this value to 2013 prices. Assuming a modest inflation rate of 2% we get £96m[exp(0.02*2)]=£99.4m. No small potatoes.


bale = read.csv("bale.csv")
# data from:

ols1 = lm(log(real2011)~year, bale)

# price
# inflate lets say 2% inflation

# nice ggplot
bale$lnprice2011 = log(bale$real2011)
addon = data.frame(year=2013,nominal=0,real2011=0,name="Bale?",

ggplot(bale, aes(x=year, y=lnprice2011, label=name)) + 
  geom_text(hjust=0.4, vjust=0.4) +
  stat_smooth(method = "lm",fullrange = TRUE, level = 0.975) +
  theme_bw(base_size = 12, base_family = "") +
  xlim(1885, 2020) + ylim(8, 20) +
  xlab("Year") + ylab("ln(Price)") +
  ggtitle("World Transfer Records, Real 2011 Prices (£)")+
  geom_point(aes(col="red"),size=4,data=addon) + 
  geom_text(aes(col="red", fontface=3),hjust=-0.1, vjust=0,size=7,data=addon) + 

Combining ggplot Images

The ggplot2 package provides an excellent platform for data visualization. One (minor) drawback of this package is that combining ggplot images into one plot, like the par() function does for regular plots, is not a straightforward procedure. Fortunately, R user Stephen Turner has kindly provided a function called “arrange” that does exactly this. The function, taken from his blog, and an example of how it can be used is provided below.

vp.layout <- function(x, y) viewport(layout.pos.row=x, layout.pos.col=y)
arrange <- function(..., nrow=NULL, ncol=NULL, as.table=FALSE) {
  dots <- list(...)
  n <- length(dots)
  if(is.null(nrow) & is.null(ncol)) { nrow = floor(n/2) ; ncol = ceiling(n/nrow)}
  if(is.null(nrow)) { nrow = ceiling(n/ncol)}
  if(is.null(ncol)) { ncol = ceiling(n/nrow)}
  ## NOTE see n2mfrow in grDevices for possible alternative
  pushViewport(viewport(layout=grid.layout(nrow,ncol) ) )
  ii.p <- 1
  for(ii.row in seq(1, nrow)){
    ii.table.row <- ii.row
    if(as.table) {ii.table.row <- nrow - ii.table.row + 1}
    for(ii.col in seq(1, ncol)){
      ii.table <- ii.p
      if(ii.p > n) break
      print(dots[[ii.table]], vp=vp.layout(ii.table.row, ii.col))
      ii.p <- ii.p + 1
library(ggplot2) ; library(grid)

p1 <- qplot(wt, mpg, data=mtcars) 
p2 <- ggplot(diamonds, aes(price, colour = cut)) + 


Visualizing Euro 2012: First Group Games

Now that every team has played a match it will be interesting to see how this has affected the (inverse) odds of victory. Since the plot in my last post was a bit ‘busy’, I have decided to use the facet_wrap function in gglplot2 to stratify by group.

Also, re-producing the ‘busy’ plot from the last post yields the following.

Germany, despite not playing well, has gained, while the Netherlands, despite playing quite well, have declined. These two countries will play each other in the next round, so it will be interesting to see how a victory for the Netherlands would change these graphics.

Data and code:

# after loading data as object called eur
n <- dim(eur)[1]
eur <- t(eur[1:n,])
dat <- NULL
for(i in 1:n){dat <- data.frame(rbind(dat,cbind(eur[-1,i],names(eur[-1,i]),i)))}

dat$V1 <- 1/as.numeric(as.character(dat$V1))
dat$V3 <- as.character(dat$V2)
dat$V3[dat$i!=n] <- c("")
dat$group <- ifelse(dat$V2 %in% c("RUS","GRE","POL","CZE"),"Group.A","Group.D")
dat$group <- ifelse(dat$V2 %in% c("GER","NED","POR","DEN"),"Group.B",dat$group)
dat$group <- ifelse(dat$V2 %in% c("IRL","CRO","ITA","ESP"),"Group.C",dat$group)
dat$i <- as.numeric(as.character(dat$i))

ggplot(dat, aes(x=i, y=V1, colour = V2, group=V2, label=V3)) + 
  geom_line(size=0.8) + geom_point(size=4, shape=21, fill="white") + #theme_bw() +
  geom_text(hjust=-0.3, vjust=0) +
  scale_x_continuous('Day',limits=c(1,(n+0.4)),breaks=1:n) +
  scale_y_continuous('1/Odds') +
  theme_bw() +
  opts(title = expression("Euro 2012, Inverse Odds of Victory"),

ggplot(dat, aes(x=i, y=V1, colour = V2, group=V2, label=V3)) + 
  geom_line(size=0.8) + geom_point(size=4, shape=21, fill="white") + #theme_bw() +
  geom_text(hjust=-0.3, vjust=0.4) +
  scale_x_continuous('Day',limits=c(1,(n+0.8)),breaks=1:n) +
  scale_y_continuous('1/Odds') +
  facet_wrap( ~ group, ncol = 2, scales="free_y") +
  theme_bw() +
  opts(title = expression("Euro 2012, Inverse Odds of Victory"),

Visualizing Euro 2012 with ggplot2

After scanning this paper by Zeileis, Leitner & Hornik, I thought it would be interesting to see how the victory odds for each team changes as Euro 2012 progresses. To do this, I am going to collect the daily inverse odds of a tournament victory offered by a popular betting site for each team.

Here is the first plot. Day one corresponds to the pretournament odds as given in the aforementioned paper for the popular betting site. These odds were obtained on the 9th of May, while day two’s odds were collected this morning.

I’ll update this in a week.