Brazil’s Host Advantage

brazil

If history can tell us anything about the World Cup, it’s that the host nation has an advantage of all other teams. Evidence of this was presented last night as the referee in the Brazil-Croatia match unjustly ruled in Brazil’s favour on several occasions. But what it is the statistical evidence of a host advantage?

To look at this, I downloaded these data from the Guardian’s website. With these, I ran a very simple probit model that regressed the probability of winning the world cup on whether the country was the host and also if the county was not the host but located in the same continent (I merge North and South America for this exercise). Obviously, this is quite a basic analysis, so I hope to build on these data as the tournament progresses and maybe and the 2010 data, and look at more sophisticated models.

> probitmfx(formula=winners ~ continent + hosts, data=wc)
Call:
probitmfx(formula = winners ~ continent + hosts, data = wc)

Marginal Effects:
             dF/dx Std. Err.      z   P>|z|   
continent 0.064425  0.027018 2.3845 0.01710 * 
hosts     0.315378  0.121175 2.6027 0.00925 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

dF/dx is for discrete change for the following variables:

[1] "continent" "hosts"   

The results are as we would expect. I am using the excellent mfx package to interpret the probit coefficients. Being the host nation increases the probability of being victorious by nearly 32%. So, going by historical trends, Brazil have a huge advantage for this world cup. If we look at countries in the same continent (think Argentina for this world cup) we see that there is a small advantage here, just over 6%.

Whether these results are robust to additional control variables and in the inclusion of fixed effects alongside heterogeneous time-varying effects is something I hope to probe.

Advertisements

How Much Should Bale Cost Real?

It looks increasingly likely that Gareth Bale will transfer from Tottenham to Real Madrid for a world record transfer fee. Negotiations are ongoing, with both parties keen to get the best deal possible deal with the transfer fee. Reports speculate that this transfer fee will be anywhere in the very wide range of £80m to £120m.

Given the topical nature of this transfer saga, I decided to explore the world record breaking transfer fee data, and see if these data can help predict what the Gareth Bale transfer fee should be. According to this Wikipedia article, there have been 41 record breaking transfers, from Willie Groves going from West Brom to Aston Villa in 1893 for £100, to Cristiano Ronaldo’s £80m 2009 transfer to Real Madrid from Manchester United.

When comparing any historical price data it is very important that we are comparing like with like. Clearly, a fee of £100 in 1893 is not the same as £100 in 2009. Therefore, the world record transfer fees need to be adjusted for inflation. To do this, I used the excellent measuringworth website, and converted all of the transfer fees into 2011 pounds sterling.

bale

The plot above demonstrates a very strong linear relationship between logged real world record transfer fees and time. The R-squared indicates that the year of the transfer fee explains roughly 97% of the variation in price.

So, if Real Madrid are to pay a world transfer fee for Bale, how much does this model predict the fee will be? The above plot demonstrates what happens when the simple log-linear model is extrapolated to predict the world record transfer fee in 2013. The outcome here is 18.37, so around £96m, in 2011 prices. We can update this value to 2013 prices. Assuming a modest inflation rate of 2% we get £96m[exp(0.02*2)]=£99.4m. No small potatoes.

rm(list=ls())

bale = read.csv("bale.csv")
# data from:
# http://en.wikipedia.org/wiki/World_football_transfer_record
# http://www.measuringworth.com/ukcompare/

ols1 = lm(log(real2011)~year, bale)

# price
exp(predict(ols1,data.frame(year=2013)))
# inflate lets say 2% inflation
exp(predict(ols1,data.frame(year=2013)))*exp(0.02*2)

# nice ggplot
library(ggplot2)
bale$lnprice2011 = log(bale$real2011)
addon = data.frame(year=2013,nominal=0,real2011=0,name="Bale?",
                   lnprice2011=predict(ols1,data.frame(year=2013)))

ggplot(bale, aes(x=year, y=lnprice2011, label=name)) + 
  geom_text(hjust=0.4, vjust=0.4) +
  stat_smooth(method = "lm",fullrange = TRUE, level = 0.975) +
  theme_bw(base_size = 12, base_family = "") +
  xlim(1885, 2020) + ylim(8, 20) +
  xlab("Year") + ylab("ln(Price)") +
  ggtitle("World Transfer Records, Real 2011 Prices (£)")+
  geom_point(aes(col="red"),size=4,data=addon) + 
  geom_text(aes(col="red", fontface=3),hjust=-0.1, vjust=0,size=7,data=addon) + 
  theme(legend.position="none")

Identical Champions League Draw: What Were the Odds?

cldraw1A number of news outlets have reported a peculiar quirk that arose during Friday’s Champions League draw. Apparently, the sport’s European governing body, UEFA, ran a trial run the day before the main event, and the schedule chosen during this event was identical to that of the actual draw on Friday.

Given this strange coincidence, a number of people have been expressing the various odds of this occurrence. For example, the author of this newspaper article claimed that ‘bookies’ calculated the odds at 5,000 to 1. In other words, the probability of this event was 0.0002.

The same article also says that the probability of this event (two random draws being identical) occurring is not as low as one might think. However, this article does not give the probability or odds of this event occurring. The oblivious reason for this is that such a calculation is difficult. Since teams from the same domestic league and teams from the same country cannot play each other, such a calculation involves using conditional probabilities over a variety of scenarios.

Despite my training in Mathematics and interest in quantitative pursuits, I have always struggled to calculate the probability of multiple conditional events. Given that there are many different ways in which two identical draws can be made, such a calculation is, unfortunately, beyond my admittedly limited ability.

Thankfully, there’s a cheats way to getting a rough answer: use Monte Carlo simulation. The code below shows how to write up a function in R that performs synthetic draws for the Champions League given the aforementioned conditions. With this function, I performed two draws 200,000 times, and calculated that the probability of the identical draws is: 0.00011, so the odds are around: 1 in 9,090. This probability is subject to some sampling error, however getting a more accurate measure via simulation would require more computing power like that enabled by Rcpp (which I really need to learn). Nevertheless, the answer is clearly lower than that proposed either by the ‘bookies’ or the newspaper article’s author.

# cl draw
rm(list=ls())
setwd("C:/Users/Alan/Desktop")
dat <- read.csv("cldraw.csv")

#=============================
> dat
team iso pos group
1       Galatasaray TUR  RU     H
2           Schalke GER  WI     B
3            Celtic SCO  RU     G
4          Juventus ITA  WI     E
5           Arsenal ENG  RU     B
6            Bayern GER  WI     F
7  Shakhtar Donetsk UKR  RU     E
8          Dortmund GER  WI     D
9             Milan ITA  RU     C
10        Barcelona ESP  WI     G
11      Real Madrid ESP  RU     D
12      Man. United ENG  WI     H
13         Valencia ESP  RU     F
14              PSG FRA  WI     A
15            Porto POR  RU     A
16           Malaga ESP  WI     C
#=============================

draw <- function(x){
  fixtures <- matrix(NA,nrow=8,ncol=2)
  p <- 0
  while(p==0){
    for(j in 1:8){
      k <- 0
      n <- 0
      while(k==0){
        n <- n + 1
        if(n>50){break}
        aa <- x[x[,"pos"]=="RU",]
        t1 <- aa[sample(1:dim(aa)[1],1),]
        bb <- x[x[,"pos"]=="WI",]
        t2 <- bb[sample(1:dim(bb)[1],1),]
        k <- ifelse(t1[,"iso"]!=t2[,"iso"] & t1[,"group"]!=t2[,"group"],1,0)
      }
      fixtures[j,1] <- as.character(t1[,"team"])
      fixtures[j,2] <- as.character(t2[,"team"])
      x <- x[!(x[,"team"] %in% c(as.character(t1[,"team"]),
                                 as.character(t2[,"team"]))),]
    }
    if(n>50){p <- 0}    
    p <- ifelse(sum(as.numeric(is.na(fixtures)))==0,1,0)
  }
  return(fixtures)
}

drawtwo <- function(x){ 
  f1 <- as.vector(unlist(x))
  joinup <- data.frame(team=f1[1:16], iso=f1[17:32],
                       pos=f1[33:48], group=f1[49:64])
  check1 <- data.frame(draw(joinup))
  check2 <- data.frame(draw(joinup))
  rightdraw <- ifelse(sum(na.omit(check1[order(check1),2])==
    na.omit(check2[order(check2),2]))==8, 1, 0)
  return(rightdraw)
}

drawtwo(dat)

dat2 <- rbind(as.vector(unlist(dat)),
              as.vector(unlist(dat)))
dat3 <- dat2[rep(1,1000),]

vals <- 0
for(i in 1:200){
  yy <- apply(dat3, 1, drawtwo)
  vals <- sum(yy) + vals
}
#=============================
# Probability 
> vals/200000
[1] 0.00011
# Odds
> 1/(vals/200000)-1
[1] 9089.909
#=============================

How to Convert Rugby into Football/Soccer Scores

Following the Irish rugby team’s humiliating 60-0 defeat to New Zealand, an interesting question was posed on Twitter: what does a 60-0 result convert to in football/soccer?

Intrigued, I decided to gather some data from both the English premier league (this season, more data collected and future blog posts to come!) and the equivalent English league in rugby union (this season too). My solution to this question involved the following steps. Firstly, I assumed that the scoring process in both games follow parametric probability distributions. I then fitted these data to these distributions, and calculated both the distribution and quantile functions. This allowed me to see the probability of a team scoring 60 in rugby, and then convert that probability into football goals.

The scores in both games will take the form of some kind of count distribution. However, Rugby is a much higher scoring game, and it is unlikely that both of the score count processes are being generated from the same parametric distribution. To fit scores from both games to their respective distributions, I have chosen to use the gamlss package on CRAN. The advantage of the gamlss package is that it has the capability to fit a huge range of distributions.

The code below shows how I loaded these data and fit the scores for both football and rugby to a number of count related distributions. My final choice of distribution was based on a comparison of AIC values for each model. Based on these, football and rugby scores follow the Poisson-inverse Gaussian, and zero-adjusted and zero-inflated negative binomial distributions respectively. The pZANBI and qPIG functions calculate the location of a rugby score on the football score distribution.

To answer the question: a 60-0 score in rugby translates into a 7-0 score in football. Oh dear.

#### score analysis
rm(list=ls())
p1 <- read.csv("premgames.csv")
sc <- c(p1$hgoal,p1$agoal)
# sc is premier league goals

library(gamlss.dist)
library(gamlss)

# fit dists
m1a <- gamlss(sc ~ 1, family = PO)
m2a <- gamlss(sc ~ 1, family = NBI)
m3a <- gamlss(sc ~ 1, family = NBII)
m4a <- gamlss(sc ~ 1, family = PIG)
m5a <- gamlss(sc ~ 1, family = ZANBI)
m6a <- gamlss(sc ~ 1, family = ZIPIG)
m7a <- gamlss(sc ~ 1, family = SI)

# compare dists
AIC(m1a,m2a,m3a,m4a,m5a,m6a,m7a)
# m4a is the best

#load rugby data
p2 <- as.character(unlist(read.csv("rugscore.csv")))
nms <- names(table(p2))[2:47]
p3 <- p2[p2 %in% nms]
p4 <- as.numeric(as.character(p3))

#fit
m1b <- gamlss(p4 ~ 1, family = PO)
m2b <- gamlss(p4 ~ 1, family = NBI)
m3b <- gamlss(p4 ~ 1, family = NBII)
m4b <- gamlss(p4 ~ 1, family = PIG)
m5b <- gamlss(p4 ~ 1, family = ZANBI)
m6b <- gamlss(p4 ~ 1, family = ZIPIG)
m7b <- gamlss(p4 ~ 1, family = SI)

#compare
AIC(m1b,m2b,m3b,m4b,m5b,m6b,m7b)
#m5b is best

# p of 60 in rugby
s1 <- pZANBI(60, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
             nu = exp(m5b$nu.coefficients))
# convert p in rugby to soccer distribution
qPIG(s1, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))

# the same again for zero
s2 <- pZANBI(0, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
             nu = exp(m5b$nu.coefficients))
qPIG(s2, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))

############################################################# 
########## output

> rm(list=ls())
> p1 <- read.csv("premgames.csv")
> sc <- c(p1$hgoal,p1$agoal)
> # sc is premier league goals
> 
> library(gamlss.dist)
> library(gamlss)
> 
> # fit dists
> m1a <- gamlss(sc ~ 1, family = PO)
> m2a <- gamlss(sc ~ 1, family = NBI)
> m3a <- gamlss(sc ~ 1, family = NBII)
> m4a <- gamlss(sc ~ 1, family = PIG)
> m5a <- gamlss(sc ~ 1, family = ZANBI)
> m6a <- gamlss(sc ~ 1, family = ZIPIG)
> m7a <- gamlss(sc ~ 1, family = SI)
> 
> # compare dists
> AIC(m1a,m2a,m3a,m4a,m5a,m6a,m7a)
    df      AIC
m4a  2 2334.244
m2a  2 2334.412
m3a  2 2334.412
m6a  3 2336.244
m7a  3 2336.244
m5a  3 2336.328
m1a  1 2341.862
> # m4a is the best
> 
> #load rugby data
> p2 <- as.character(unlist(read.csv("rugscore.csv")))
> nms <- names(table(p2))[2:47]
> p3 <- p2[p2 %in% nms]
> p4 <- as.numeric(as.character(p3))
> 
> #fit
> m1b <- gamlss(p4 ~ 1, family = PO)
> m2b <- gamlss(p4 ~ 1, family = NBI)
> m3b <- gamlss(p4 ~ 1, family = NBII)
> m4b <- gamlss(p4 ~ 1, family = PIG)
> m5b <- gamlss(p4 ~ 1, family = ZANBI)
> m6b <- gamlss(p4 ~ 1, family = ZIPIG)
> m7b <- gamlss(p4 ~ 1, family = SI)
> 
> #compare
> AIC(m1b,m2b,m3b,m4b,m5b,m6b,m7b)
    df      AIC
m5b  3 1721.183
m2b  2 1722.700
m3b  2 1722.700
m6b  3 1727.345
m4b  2 1732.172
m7b  3 1749.975
m1b  1 2265.146
> #m5b is best
> 
> # p of 60 in rugby
> s1 <- pZANBI(60, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
+              nu = exp(m5b$nu.coefficients))
> # convert p in rugby to soccer distribution
> qPIG(s1, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))
[1] 7
> 
> # the same again for zero
> s2 <- pZANBI(0, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
+              nu = exp(m5b$nu.coefficients))
> qPIG(s2, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))
[1] 0