How Predictable is the English Premier League?

betuncertain

The reason why football is so exciting is uncertainty. The outcome of any match or league is unknown, and you get to watch the action unfold without knowing what’s going to happen. Watching matches where you know the score is never exciting.

This weekend the English Premier League season will conclude with little fanfare. Bar one relegation place, the league positions have already been determined. In fact, these positions were, for the most part, decided weeks ago. The element of uncertainty seems to have been reduced this season.

With this in mind, I wanted to look at uncertainty over the long run in English football. To do this used the data provided by http://www.football-data.co.uk/ and analyzed these with R. These data consist of 34,740 matches played in the top 5 divisions of English football between 2000 and 2015, containing information about both the result and the odds offered by bookies on this result.

To measure the uncertainty of any given match I used the following strategy. First, I averaged across all bookies’ odds for the three possible events: home win, draw, and away win. Next I mapped these aggregated odds into probabilities by inverting each of the odds and then dividing by the summed inverted odds. This takes care of the over round that helps bookies to make a profit. For example, if the odds were 2.1/1 that an event happens and 2.1/1 that it doesn’t then the probability of the event occurring is:

(1/2.1)/ (1/2.1 + (1/2.1)) = 0.4761905/(0.4761905+0.4761905) = 0.5.

Finally, to measure the uncertainty of each match, I subtract the probability that the event occurred from 1, to calculate a “residual” score. Imagine a home win occurs. The “residual” in this case will be 1-P(home win). If P(home win)=1, then there is no uncertainty, and this uncertainty score will be zero. Since there are 3 outcomes, we would expect an uncertainty measure to be bounded between 0 (no uncertainty) and 0.67 (pure uncertainty) where we get 1 out of 3right by just guessing.

After importing these data into R and calculating the uncertainty measure, I looked at this uncertainty measure over time. The plot in the above shows fitted smoothed trend lines of uncertainty, stratified by division. These trends are striking. Going by this graph, the Premier League has gotten more predictable over the analysis period. In 2000, the uncertainty measure was around 0.605. Given that we expect this measure to be bound between 0 (complete certainty) and 0.67 (completely random), this tell us that the average league game was very unpredictable. Over time, however, this measure has decreased by about 5%, which does not seem like much. Despite, the somewhat unexciting end to the 2014/15 season, the outcome of the average game is still not very predictable.

Noticeably, in lower league games there is even greater uncertainty. In fact, the average uncertainty measure of League 2 games approached a value of 0.65 in 2014. This indicates that the average League 2 game is about as unpredictable as playing rock-paper-scissors. Interestingly, and unlike the Premier League, there does not appear to be any discernible change over time. The games are just as unpredictable now as they were in 2000. Please see my R code below.

# clear
rm(list=ls())

# libraries
library(ggplot2)

# what are urls

years = c(rep("0001",4), rep("0102",4), rep("0203",4), rep("0405",4),
          rep("0506",5), rep("0607",5), rep("0708",5), rep("0809",5),
          rep("0910",5), rep("1011",5), rep("1112",5), rep("1213",5),
          rep("1314",5), rep("1415",5))
divis = c(rep(c("E0","E1","E2","E3"),4), rep(c("E0","E1","E2","E3","EC"),10))

urls = paste(years, divis, sep="/")
urls = paste("http://www.football-data.co.uk/mmz4281", urls, sep="/")


odds = c("B365H","B365D","B365A",
         "BSH","BSD","BSA",
         "BWH","BWD","BWA",
         "GBH","GBD","GBA",
         "IWH","IWD","IWA",
         "LBH","LBD","LBA",
         "PSH","PSD","PSA",
         "SOH","SOD","SOA",
         "SBH","SBD","SBA",
         "SJH","SJD","SJA",
         "SYH","SYD","SYA",
         "VCH","VCD","VCA",
         "WHH","WHD","WHA")
home = odds[seq(1,length(odds),3)]
draw = odds[seq(2,length(odds),3)]
away = odds[seq(3,length(odds),3)]

# load all data in a loop
full.data = NULL
for(i in 1:length(urls)){
  temp = read.csv(urls[i])
  # calculate average odds
  temp$homeodds = apply(temp[,names(temp) %in% home], 1, function(x) mean(x,na.rm=T))
  temp$drawodds = apply(temp[,names(temp) %in% draw], 1, function(x) mean(x,na.rm=T))
  temp$awayodds = apply(temp[,names(temp) %in% away], 1, function(x) mean(x,na.rm=T))
  temp = temp[,c("Div","Date","FTHG","FTAG","FTR","homeodds","drawodds","awayodds")]
  full.data = rbind(full.data, temp)
}

full.data$homewin = ifelse(full.data$FTR=="H", 1, 0)
full.data$draw = ifelse(full.data$FTR=="D", 1, 0)
full.data$awaywin = ifelse(full.data$FTR=="A", 1, 0)

# convert to probs with overrind
full.data$homeprob = (1/full.data$homeodds)/(1/full.data$homeodds+1/full.data$drawodds+1/full.data$awayodds)
full.data$drawprob = (1/full.data$drawodds)/(1/full.data$homeodds+1/full.data$drawodds+1/full.data$awayodds)
full.data$awayprob = (1/full.data$awayodds)/(1/full.data$homeodds+1/full.data$drawodds+1/full.data$awayodds)

# bookie residual
full.data$bookieres = 1-full.data$homeprob
full.data$bookieres[full.data$FTR=="D"] = 1-full.data$drawprob[full.data$FTR=="D"]
full.data$bookieres[full.data$FTR=="A"] = 1-full.data$awayprob[full.data$FTR=="A"]

# now plot over time
full.data$time = ifelse(nchar(as.character(full.data$Date))==8, 
                         as.Date(full.data$Date,format='%d/%m/%y'),
                         as.Date(full.data$Date,format='%d/%m/%Y'))
full.data$date = as.Date(full.data$time, origin = "1970-01-01") 

full.data$Division = "Premier League" 
full.data$Division[full.data$Div=="E1"] = "Championship" 
full.data$Division[full.data$Div=="E2"] = "League 1" 
full.data$Division[full.data$Div=="E3"] = "League 2" 
full.data$Division[full.data$Div=="EC"] = "Conference" 

full.data$Division = factor(full.data$Division, levels = c("Premier League", "Championship", "League 1",
                                                           "League 2","Conference"))

ggplot(full.data, aes(date, bookieres, colour=Division)) +
  stat_smooth(size = 1.25, alpha = 0.2) +
  labs(x = "Year", y = "Uncertainty") + 
  theme_bw() +
  theme(legend.position="bottom") +
  theme(axis.text=element_text(size=20),
        axis.title=element_text(size=20),
        legend.title = element_text(size=20),
        legend.text = element_text(size=20))
Advertisement

Kalkalash! Pinpointing the Moments “The Simpsons” became less Cromulent

Whenever somebody mentions “The Simpsons” it always stirs up feelings of nostalgia in me. The characters, uproarious gags, zingy one-liners, and edgy animation all contributed towards making, arguably, the greatest TV ever. However, it’s easy to forget that as a TV show “The Simpsons” is still ongoing—in its twenty-fourth season no less.

For me, and most others, the latter episodes bear little resemblance to older ones. The current incarnation of the show is stale, and has been for a long time. I haven’t watched a new episode in over ten years, and don’t intend to any time soon. When did this decline begin? Was it part of a slow secular trend, or was there a sudden drop in the quality, from which there was no recovery?

To answer these questions I use the Global Episode Opinion Survey (GEOS) episode ratings data, which are published online. A simple web scrape of the “all episodes” page provides me with 423 episode ratings, spanning from the first episode of season 1, to the third episode of season 20. After S20E03, the ratings become too sparse, which is probably a function of how bad the show, in its current condition, is. To detect changepoints in show ratings, I have used the R package changepoint. An informative introduction of both the package and changepoint analyses can be found in this accompanying vignette.

simpsons2

The figure above provides a summary of my results. Five breakpoints were detected. The first occurring in the first episode of the ninth season: The City of New York Vs. Homer Simpson. Most will remember this; Homer goes to New York to collect his clamped car and ends up going berserk. Good episode, although this essentially marked the beginning of the end.

According to the changepoint results, the decline occurred in three stages. The first lasted from the New York episode up until episode 11 in season 10. The shows in this stage have an average rating of about 7, and the episode where the breakpoint is detected is: Wild Barts Can’t Be Broken. The next stage roughly marks my transition, as it is about this point that I stopped watching. This stage lasts as far as S15E09, whereupon the show suffers the further ignominy of another ratings downgrade. Things possibly couldn’t get any worse, and they don’t, as the show earns a minor reprieve after the twentieth episode of season 18.

So now you know. R code for the analysis can be found in the below.

# packages
library(Hmisc) ; library(changepoint)
# clear ws
rm(list=ls())

# webscrape data
page1 = "http://www.geos.tv/index.php/list?sid=159&collection=all"
home1 = readLines(con<-url(page1)); close(con)

# pick out lines with ratings
means = '<td width="60px" align="right" nowrap>'
epis = home1[grep(means, home1)]
epis = epis[57:531]
epis = epis[49:474]

# prune data
loc = function(x) substring.location(x,"</span>")$first[1]
epis = data.frame(epis)
epis = cbind(epis,apply(epis, 1, loc))
epis$cut = NA
for(i in 1:dim(epis)[1]){
  epis[i,3] = substr(epis[i,1], epis[i,2]-4, epis[i,2]-1) 
}
#create data frame
ts1 = data.frame(rate=epis$cut, episode=50:475)
# remove out of season shows and movie
ts1 = ts1[!(ts1$episode %in% c(178,450,451)),]
# make numeric
ts1$rate = as.numeric(as.character(ts1$rate))

# changepoint function
mean2.pelt = cpt.mean(ts1$rate,method='PELT')

# plot results
plot(mean2.pelt,type='l',cpt.col='red',xlab='Episode',
     ylab='Average Rating',cpt.width=2, main="Changepoints in 'The Simpsons' Ratings")

# what episodes ?
# The City of New York vs. Homer Simpson
# Wild Barts Can't Be Broken
# I, (Annoyed Grunt)-Bot - 
# Stop Or My Dog Will Shoot!

BMR: Bayesian Macroeconometrics in R

The recently released BMR package, short for Bayesian Macroeconometrics with R, provides a comprehensive set of powerful routines that estimate Bayesian Vector Autoregression (VAR) and Dynamic Stochastic General Equilibrium (DSGE) models in R.

The procedure of estimating both Bayesian VAR and DSGE models can represent a great computational burden. However, BMR removes a lot of this burden, performing the most computationally demanding procedures using C++, which is ported into R with the Rcpp package in a manner similar to that of the recently released STAN package.

Despite the complexity of these models, the package itself is very easy to use. Furthermore, the package’s author has provided an awesome vignette that explains both the theory underlining these models, and examples of their use.

Time-Series Policy Evaluation in R

Quantifying the success of government policies is clearly important. Randomized control trials, like those conducted by drug companies, are often described as the ‘gold-standard’ for policy evaluation. Under these, a policy is implemented in/to one area/group (treatment), but not in/to another (control). The difference in outcomes between the two areas or groups represents the effectiveness of the policy.

Unfortunately, there many circumstances in which governments introduce policies and are not able to measure how are well, if at all, they work. A good example of this is childhood vaccinations against disease, since it is unethical to deny some children a potentially life-saving treatment. This makes evaluating these policies difficult, because there is no counterfactual event, i.e. everyone gets the treatment (in this case vaccination).

However, it is possible to evaluate the effectiveness of certain policies with time-series econometrics. For this example, I am going to use a series of monthly road death statistics in Ireland over the period 2003 to present. In November 2011, a new drink driving law was introduced which lowered the blood-alcohol limits. How has the introduction of this law changed road fatalities?

To evaluate this policy, I estimated a simple state space model using the R package sspir. The code I used for this example is almost exactly like the van drivers example in this paper. The following generalized linear model is assumed: Po(y_{t}) = a_{t} + b(LAW) + S_{t}, where the number of deaths, y_{t}, is a Poisson process that depends on a random walk, a_{t}, time-varying seasonal patterns, S_{t}, and the introduction of the law. The b parameter is the measure of the policy’s effect.

The usefulness of this model for policy evaluation, compared to an OLS linear regression for example, stems from the time-varying intercept a_{t}. In effect, this parameter substitutes the counterfactual or control group that we would need in a randomized control trial because it takes into account the underlying trend in the time-series before the policy change. The law dummy variable adds a shift to the underlying trend, acting as the relevant treatment effect.

A full overview of the state space system, and estimation using the Kalman filter is beyond the scope of this blog post. Interested readers should consult this paper and the references therein. The code to generate these results is below.

> head(road)
  deaths drinkdrive ind
1     20          0   1
2     21          0   2
3     33          0   3
4     23          0   4
5     28          0   5
6     37          0   6

mle <- function(psi) {
  m1 <- ssm(deaths ~ tvar(1) + drinkdrive + tvar(sumseason(time,12)),
            family=poisson(link="log"), data=road,
            phi=c(1,exp(psi)),C0=diag(13)*1000)
  m1fit <- getFit(m1)
  return(extended(m1fit)$likelihood)
}

res1 <- optim(par=c(0,0),fn=mle,method=c("L-BFGS-B"),
              control=list(fnscale=-1),hessian=FALSE)

The first part of the code sets up the state space system inside a function. The reason for this is to calculate the signal-to-noise ratio, which I call psi, using maximum likelihood. The signal-to-noise ratio dictates the ‘wiggliness’ of the random walk a_{t}. Too much signal will result in a_{t} matching the series exactly. Too little, and the random walk will collapse to a constant term, as in OLS. A similar argument can be made for the time-varying seasonal patterns.

After calculating the signal-to-noise ratio, I estimated the model with the maximum likelihood parameters. The function ssm automatically runs the iterated Kalman smoother, which provides the relevant information.

# estimate with mle pars
m2 <- ssm(deaths ~ tvar(1) + drinkdrive + tvar(sumseason(time,12)),
          family=poisson(link="log"), data=road,
          phi=c(1,exp(res1$par)),C0=diag(13)*1000)
m2fit <- getFit(m2)
          
100 * (exp(m2fit$m[1,2])-1) # effect of drink driving law

# calculate sd
m2sd <- NULL
for (i in 1:length(road$deaths)) {
  thisone <- m2fit$C[[i]][1:2,1:2]  
  if (road$drinkdrive[i]==0) { 
    m2sd <- c(m2sd,sqrt(thisone[1,1])) 
  }
  else
    m2sd <- c(m2sd,sqrt(sum(thisone)))
}

# plot
road$ind <- 1:112
plot(road$ind,road$deaths,ylim=c(0,50),col="red",type="l",axes=F,
     xlab="Year",ylab="Death Count",main="Monthly Road Deaths, Ireland")
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]),col="blue")
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]+2*m2sd),lty=2,col="blue")
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]-2*m2sd),lty=2,col="blue")
abline(v=107)
axis(1, at=c(1,25,49,73,97), lab=c("2003","2005","2007","2009","2011"))
axis(2)
box()

The following is a plot showing the underlying series (red), the policy relevant effect (shift in the blue line, black vertical line represents the policy change), and uncertainty (blue dotted lines). Perhaps it is too early to suggest this policy was ineffective. However, as is evident in the graph, this policy change did not result in less road deaths. In fact, this policy actually led to a 5.6% increase in road deaths, although a large amount of uncertainty surrounds these estimates, as shown by the 95% confidence intervals.

While I use this as an example, I should be clear that I do not condone driving under the influence of alcohol, nor do I think this was a bad policy. These data may not be a perfect measure of the policy’s effectiveness. Perhaps there are other time-series, like head-on collisions, for which the policy did make a discernable difference. Additionally, there may be previous policy changes in the series which I am unaware of.

Temperature Change in Ireland

Has Ireland gotten any warmer? Ask any punter on the street and they will happily inform you of wild swings, trends and dips. “Back when I was a child”, “when I was younger”, or “years ago” are the usual refrains.

What’s the evidence? To answer this, I will use the temperature data from my previous post alongside the R package bcp. The bcp package stands for Bayesian Change Point, and it implements the change point methodology of Barry & Hartigan (1993). A good overview of how to use the bcp R package is offered by Erdman & Emerson (2007).

These series are monthly and run from 1866 to 2011. There are two measures for each month, the average daily high or low. I use both here, and perform the multivariate bcp procedure for each month separately, with average high and low for each month.

The figure above shows the output produced when I look at the January temperatures. Clearly, there are no major changes. The January temperature in Ireland is the same now as it was in 1866. The results for the other months are very similar. However, there are some changes in the October series, with both the posterior means for both max and min temperatures increasing by around 1 degree Celsius (see below).

Has Ireland gotten warmer? Yes, but only in October, for some reason unknown to me.

# with web-scraped data from the previous post

library(bcp)

cold <- NULL
for (i in 1:12){
  mins <- armagh[armagh[,2]==i,4]
  cold <- cbind(cold,mins)
}
cold <- cold[14:159,]

warm <- NULL
for (i in 1:12){
  maxs <- armagh[armagh[,2]==i,3]
  warm <- cbind(warm,maxs)
}
warm <- warm[14:159,]

bcp.0 <- bcp(cbind(cold[,1],warm[,1]), burnin = 1000, mcmc = 5000)
plot(bcp.0)
bcp.0 <- bcp(cbind(cold[,10],warm[,10]), burnin = 1000, mcmc = 5000)
plot(bcp.0)