# How Predictable is the English Premier League?

The reason why football is so exciting is uncertainty. The outcome of any match or league is unknown, and you get to watch the action unfold without knowing what’s going to happen. Watching matches where you know the score is never exciting.

This weekend the English Premier League season will conclude with little fanfare. Bar one relegation place, the league positions have already been determined. In fact, these positions were, for the most part, decided weeks ago. The element of uncertainty seems to have been reduced this season.

With this in mind, I wanted to look at uncertainty over the long run in English football. To do this used the data provided by http://www.football-data.co.uk/ and analyzed these with R. These data consist of 34,740 matches played in the top 5 divisions of English football between 2000 and 2015, containing information about both the result and the odds offered by bookies on this result.

To measure the uncertainty of any given match I used the following strategy. First, I averaged across all bookies’ odds for the three possible events: home win, draw, and away win. Next I mapped these aggregated odds into probabilities by inverting each of the odds and then dividing by the summed inverted odds. This takes care of the over round that helps bookies to make a profit. For example, if the odds were 2.1/1 that an event happens and 2.1/1 that it doesn’t then the probability of the event occurring is:

(1/2.1)/ (1/2.1 + (1/2.1)) = 0.4761905/(0.4761905+0.4761905) = 0.5.

Finally, to measure the uncertainty of each match, I subtract the probability that the event occurred from 1, to calculate a “residual” score. Imagine a home win occurs. The “residual” in this case will be 1-P(home win). If P(home win)=1, then there is no uncertainty, and this uncertainty score will be zero. Since there are 3 outcomes, we would expect an uncertainty measure to be bounded between 0 (no uncertainty) and 0.67 (pure uncertainty) where we get 1 out of 3right by just guessing.

After importing these data into R and calculating the uncertainty measure, I looked at this uncertainty measure over time. The plot in the above shows fitted smoothed trend lines of uncertainty, stratified by division. These trends are striking. Going by this graph, the Premier League has gotten more predictable over the analysis period. In 2000, the uncertainty measure was around 0.605. Given that we expect this measure to be bound between 0 (complete certainty) and 0.67 (completely random), this tell us that the average league game was very unpredictable. Over time, however, this measure has decreased by about 5%, which does not seem like much. Despite, the somewhat unexciting end to the 2014/15 season, the outcome of the average game is still not very predictable.

Noticeably, in lower league games there is even greater uncertainty. In fact, the average uncertainty measure of League 2 games approached a value of 0.65 in 2014. This indicates that the average League 2 game is about as unpredictable as playing rock-paper-scissors. Interestingly, and unlike the Premier League, there does not appear to be any discernible change over time. The games are just as unpredictable now as they were in 2000. Please see my R code below.

```# clear
rm(list=ls())

# libraries
library(ggplot2)

# what are urls

years = c(rep("0001",4), rep("0102",4), rep("0203",4), rep("0405",4),
rep("0506",5), rep("0607",5), rep("0708",5), rep("0809",5),
rep("0910",5), rep("1011",5), rep("1112",5), rep("1213",5),
rep("1314",5), rep("1415",5))
divis = c(rep(c("E0","E1","E2","E3"),4), rep(c("E0","E1","E2","E3","EC"),10))

urls = paste(years, divis, sep="/")
urls = paste("http://www.football-data.co.uk/mmz4281", urls, sep="/")

odds = c("B365H","B365D","B365A",
"BSH","BSD","BSA",
"BWH","BWD","BWA",
"GBH","GBD","GBA",
"IWH","IWD","IWA",
"LBH","LBD","LBA",
"PSH","PSD","PSA",
"SOH","SOD","SOA",
"SBH","SBD","SBA",
"SJH","SJD","SJA",
"SYH","SYD","SYA",
"VCH","VCD","VCA",
"WHH","WHD","WHA")
home = odds[seq(1,length(odds),3)]
draw = odds[seq(2,length(odds),3)]
away = odds[seq(3,length(odds),3)]

# load all data in a loop
full.data = NULL
for(i in 1:length(urls)){
# calculate average odds
temp\$homeodds = apply(temp[,names(temp) %in% home], 1, function(x) mean(x,na.rm=T))
temp\$drawodds = apply(temp[,names(temp) %in% draw], 1, function(x) mean(x,na.rm=T))
temp\$awayodds = apply(temp[,names(temp) %in% away], 1, function(x) mean(x,na.rm=T))
temp = temp[,c("Div","Date","FTHG","FTAG","FTR","homeodds","drawodds","awayodds")]
full.data = rbind(full.data, temp)
}

full.data\$homewin = ifelse(full.data\$FTR=="H", 1, 0)
full.data\$draw = ifelse(full.data\$FTR=="D", 1, 0)
full.data\$awaywin = ifelse(full.data\$FTR=="A", 1, 0)

# convert to probs with overrind
full.data\$homeprob = (1/full.data\$homeodds)/(1/full.data\$homeodds+1/full.data\$drawodds+1/full.data\$awayodds)
full.data\$drawprob = (1/full.data\$drawodds)/(1/full.data\$homeodds+1/full.data\$drawodds+1/full.data\$awayodds)
full.data\$awayprob = (1/full.data\$awayodds)/(1/full.data\$homeodds+1/full.data\$drawodds+1/full.data\$awayodds)

# bookie residual
full.data\$bookieres = 1-full.data\$homeprob
full.data\$bookieres[full.data\$FTR=="D"] = 1-full.data\$drawprob[full.data\$FTR=="D"]
full.data\$bookieres[full.data\$FTR=="A"] = 1-full.data\$awayprob[full.data\$FTR=="A"]

# now plot over time
full.data\$time = ifelse(nchar(as.character(full.data\$Date))==8,
as.Date(full.data\$Date,format='%d/%m/%y'),
as.Date(full.data\$Date,format='%d/%m/%Y'))
full.data\$date = as.Date(full.data\$time, origin = "1970-01-01")

full.data\$Division = "Premier League"
full.data\$Division[full.data\$Div=="E1"] = "Championship"
full.data\$Division[full.data\$Div=="E2"] = "League 1"
full.data\$Division[full.data\$Div=="E3"] = "League 2"
full.data\$Division[full.data\$Div=="EC"] = "Conference"

full.data\$Division = factor(full.data\$Division, levels = c("Premier League", "Championship", "League 1",
"League 2","Conference"))

ggplot(full.data, aes(date, bookieres, colour=Division)) +
stat_smooth(size = 1.25, alpha = 0.2) +
labs(x = "Year", y = "Uncertainty") +
theme_bw() +
theme(legend.position="bottom") +
theme(axis.text=element_text(size=20),
axis.title=element_text(size=20),
legend.title = element_text(size=20),
legend.text = element_text(size=20))
```

# Kalkalash! Pinpointing the Moments “The Simpsons” became less Cromulent

Whenever somebody mentions “The Simpsons” it always stirs up feelings of nostalgia in me. The characters, uproarious gags, zingy one-liners, and edgy animation all contributed towards making, arguably, the greatest TV ever. However, it’s easy to forget that as a TV show “The Simpsons” is still ongoing—in its twenty-fourth season no less.

For me, and most others, the latter episodes bear little resemblance to older ones. The current incarnation of the show is stale, and has been for a long time. I haven’t watched a new episode in over ten years, and don’t intend to any time soon. When did this decline begin? Was it part of a slow secular trend, or was there a sudden drop in the quality, from which there was no recovery?

To answer these questions I use the Global Episode Opinion Survey (GEOS) episode ratings data, which are published online. A simple web scrape of the “all episodes” page provides me with 423 episode ratings, spanning from the first episode of season 1, to the third episode of season 20. After S20E03, the ratings become too sparse, which is probably a function of how bad the show, in its current condition, is. To detect changepoints in show ratings, I have used the R package changepoint. An informative introduction of both the package and changepoint analyses can be found in this accompanying vignette.

The figure above provides a summary of my results. Five breakpoints were detected. The first occurring in the first episode of the ninth season: The City of New York Vs. Homer Simpson. Most will remember this; Homer goes to New York to collect his clamped car and ends up going berserk. Good episode, although this essentially marked the beginning of the end.

According to the changepoint results, the decline occurred in three stages. The first lasted from the New York episode up until episode 11 in season 10. The shows in this stage have an average rating of about 7, and the episode where the breakpoint is detected is: Wild Barts Can’t Be Broken. The next stage roughly marks my transition, as it is about this point that I stopped watching. This stage lasts as far as S15E09, whereupon the show suffers the further ignominy of another ratings downgrade. Things possibly couldn’t get any worse, and they don’t, as the show earns a minor reprieve after the twentieth episode of season 18.

So now you know. R code for the analysis can be found in the below.

```# packages
library(Hmisc) ; library(changepoint)
# clear ws
rm(list=ls())

# webscrape data
page1 = "http://www.geos.tv/index.php/list?sid=159&collection=all"

# pick out lines with ratings
means = '<td width="60px" align="right" nowrap>'
epis = home1[grep(means, home1)]
epis = epis[57:531]
epis = epis[49:474]

# prune data
loc = function(x) substring.location(x,"</span>")\$first[1]
epis = data.frame(epis)
epis = cbind(epis,apply(epis, 1, loc))
epis\$cut = NA
for(i in 1:dim(epis)[1]){
epis[i,3] = substr(epis[i,1], epis[i,2]-4, epis[i,2]-1)
}
#create data frame
ts1 = data.frame(rate=epis\$cut, episode=50:475)
# remove out of season shows and movie
ts1 = ts1[!(ts1\$episode %in% c(178,450,451)),]
# make numeric
ts1\$rate = as.numeric(as.character(ts1\$rate))

# changepoint function
mean2.pelt = cpt.mean(ts1\$rate,method='PELT')

# plot results
plot(mean2.pelt,type='l',cpt.col='red',xlab='Episode',
ylab='Average Rating',cpt.width=2, main="Changepoints in 'The Simpsons' Ratings")

# what episodes ?
# The City of New York vs. Homer Simpson
# Wild Barts Can't Be Broken
# I, (Annoyed Grunt)-Bot -
# Stop Or My Dog Will Shoot!
```

# Web-Scraping in R

Web-scraping, or web-crawling, sounds like a seedy activity worthy of an Interpol investigative department. The reality, however, is far less nefarious. Web-scraping is any procedure by which someone extracts data from the internet. Given that it’s possible to get the internet on computers these days; web-scrapping opens an array of interesting possibilities to social-science researchers as it is possible to harvest massive datasets in short periods of times.

In the following code, I illustrate a very simple web-scraping routine. The object of this exercise is to scrape some Irish weather time-series, which I will look at in a future post. The main function of interest here is:

```readLines(...)
```

which reads these data into the R workspace. The url object in my example is a .txt file, however if the url address is written in html the readLines command will read all of the lines of the html. In effect, the readLines object will be a character vector with a length equal to the number of lines in the html code. In this case, extracting the data you need from the crap you don’t requires a bit of jiggery-pokery formally referred to as parsing. Thankfully, there are a whole bunch of functions in the Hmisc and other packages which can be used to do this in a systematic way. I have found the following particularly useful for parsing:

```substring.location(...) ; grep(...) ; substr(...) ; strsplit(...)
```
```# clean data
rm(list=ls())

# web url
site <- "http://www.metoffice.gov.uk/climate/uk/stationdata/armaghdata.txt"

# call in data with try command in while loop
i <- 1
while (i < 2){
if (class(aa) == "try-error") {
next
} else {
i <- i + 1
}
}

# grand! now inspect and trim off crap
aa <- aa[6:dim(aa)[1],]

# data is melted together so some tidying required
bb <- cc <- dd <- c()
for (i in (1:length(aa))){
bb <- unlist(strsplit(as.character(aa[i]), " "))
cc <- bb[nchar(bb)>0] ; cc <- cc[1:7]
dd <- rbind(dd,cc)
}

row.names(dd) <- dd[,1]

colnm <- c(dd[1,1],dd[1,2],paste(dd[1,3],dd[2,1],sep=" "), paste(dd[1,4],
dd[2,2],sep=" "), paste(dd[1,5],dd[2,3],sep=" "),
paste(dd[1,6],dd[2,4],sep=" "), paste(dd[1,7],dd[2,5],sep=" "))

colnames(dd) <- colnm

armagh <- data.frame(dd[-c(1,2),])

for (i in (1:dim(armagh)[2])){
armagh[,i] <- as.numeric(as.character(armagh[,i]))
}

decmin <- armagh[armagh[,2]==12,4]
year <- armagh[armagh[,2]==12,1]
wh1 <- data.frame(cbind(armagh\$tmin.degC[armagh\$mm==12],armagh\$yyyy[armagh\$mm==12]))
wh1 <- na.omit(wh1)

# nice plot
library(ggplot2)
ggplot(wh1, aes(X2,X1)) +
geom_line(colour="red") +
theme_bw() +
scale_x_continuous('Year') +
scale_y_continuous('Minimum Temperature - Degree Celsius') +
opts(title = expression("December Average Daily Minimum Temperature - Armagh 1865-2011"))
```

In the script above, I call in these data, tidy them up and then do a pretty graph with the excellent ggplot2 package. I use a while loop with the try function to call in the data. This can be very important for those interested in scraping data systematically. Sometimes the readLines function will not be able to establish a connection to the url address of interest. The try function in the while loop here ensures that in the event that R is not able to make the connection, it will try again until a connection is established. The equivalent to this is pressing refresh in your internet browser. When scraping data iteratively from a large number of url addresses, connection difficulties are inevitable, and therefore using the try function in while loop can save a lot of hassle.

Happy scraping, ya filthy animal.