How to Convert Rugby into Football/Soccer Scores

Following the Irish rugby team’s humiliating 60-0 defeat to New Zealand, an interesting question was posed on Twitter: what does a 60-0 result convert to in football/soccer?

Intrigued, I decided to gather some data from both the English premier league (this season, more data collected and future blog posts to come!) and the equivalent English league in rugby union (this season too). My solution to this question involved the following steps. Firstly, I assumed that the scoring process in both games follow parametric probability distributions. I then fitted these data to these distributions, and calculated both the distribution and quantile functions. This allowed me to see the probability of a team scoring 60 in rugby, and then convert that probability into football goals.

The scores in both games will take the form of some kind of count distribution. However, Rugby is a much higher scoring game, and it is unlikely that both of the score count processes are being generated from the same parametric distribution. To fit scores from both games to their respective distributions, I have chosen to use the gamlss package on CRAN. The advantage of the gamlss package is that it has the capability to fit a huge range of distributions.

The code below shows how I loaded these data and fit the scores for both football and rugby to a number of count related distributions. My final choice of distribution was based on a comparison of AIC values for each model. Based on these, football and rugby scores follow the Poisson-inverse Gaussian, and zero-adjusted and zero-inflated negative binomial distributions respectively. The pZANBI and qPIG functions calculate the location of a rugby score on the football score distribution.

To answer the question: a 60-0 score in rugby translates into a 7-0 score in football. Oh dear.

#### score analysis
rm(list=ls())
p1 <- read.csv("premgames.csv")
sc <- c(p1$hgoal,p1$agoal)
# sc is premier league goals

library(gamlss.dist)
library(gamlss)

# fit dists
m1a <- gamlss(sc ~ 1, family = PO)
m2a <- gamlss(sc ~ 1, family = NBI)
m3a <- gamlss(sc ~ 1, family = NBII)
m4a <- gamlss(sc ~ 1, family = PIG)
m5a <- gamlss(sc ~ 1, family = ZANBI)
m6a <- gamlss(sc ~ 1, family = ZIPIG)
m7a <- gamlss(sc ~ 1, family = SI)

# compare dists
AIC(m1a,m2a,m3a,m4a,m5a,m6a,m7a)
# m4a is the best

#load rugby data
p2 <- as.character(unlist(read.csv("rugscore.csv")))
nms <- names(table(p2))[2:47]
p3 <- p2[p2 %in% nms]
p4 <- as.numeric(as.character(p3))

#fit
m1b <- gamlss(p4 ~ 1, family = PO)
m2b <- gamlss(p4 ~ 1, family = NBI)
m3b <- gamlss(p4 ~ 1, family = NBII)
m4b <- gamlss(p4 ~ 1, family = PIG)
m5b <- gamlss(p4 ~ 1, family = ZANBI)
m6b <- gamlss(p4 ~ 1, family = ZIPIG)
m7b <- gamlss(p4 ~ 1, family = SI)

#compare
AIC(m1b,m2b,m3b,m4b,m5b,m6b,m7b)
#m5b is best

# p of 60 in rugby
s1 <- pZANBI(60, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
             nu = exp(m5b$nu.coefficients))
# convert p in rugby to soccer distribution
qPIG(s1, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))

# the same again for zero
s2 <- pZANBI(0, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
             nu = exp(m5b$nu.coefficients))
qPIG(s2, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))

############################################################# 
########## output

> rm(list=ls())
> p1 <- read.csv("premgames.csv")
> sc <- c(p1$hgoal,p1$agoal)
> # sc is premier league goals
> 
> library(gamlss.dist)
> library(gamlss)
> 
> # fit dists
> m1a <- gamlss(sc ~ 1, family = PO)
> m2a <- gamlss(sc ~ 1, family = NBI)
> m3a <- gamlss(sc ~ 1, family = NBII)
> m4a <- gamlss(sc ~ 1, family = PIG)
> m5a <- gamlss(sc ~ 1, family = ZANBI)
> m6a <- gamlss(sc ~ 1, family = ZIPIG)
> m7a <- gamlss(sc ~ 1, family = SI)
> 
> # compare dists
> AIC(m1a,m2a,m3a,m4a,m5a,m6a,m7a)
    df      AIC
m4a  2 2334.244
m2a  2 2334.412
m3a  2 2334.412
m6a  3 2336.244
m7a  3 2336.244
m5a  3 2336.328
m1a  1 2341.862
> # m4a is the best
> 
> #load rugby data
> p2 <- as.character(unlist(read.csv("rugscore.csv")))
> nms <- names(table(p2))[2:47]
> p3 <- p2[p2 %in% nms]
> p4 <- as.numeric(as.character(p3))
> 
> #fit
> m1b <- gamlss(p4 ~ 1, family = PO)
> m2b <- gamlss(p4 ~ 1, family = NBI)
> m3b <- gamlss(p4 ~ 1, family = NBII)
> m4b <- gamlss(p4 ~ 1, family = PIG)
> m5b <- gamlss(p4 ~ 1, family = ZANBI)
> m6b <- gamlss(p4 ~ 1, family = ZIPIG)
> m7b <- gamlss(p4 ~ 1, family = SI)
> 
> #compare
> AIC(m1b,m2b,m3b,m4b,m5b,m6b,m7b)
    df      AIC
m5b  3 1721.183
m2b  2 1722.700
m3b  2 1722.700
m6b  3 1727.345
m4b  2 1732.172
m7b  3 1749.975
m1b  1 2265.146
> #m5b is best
> 
> # p of 60 in rugby
> s1 <- pZANBI(60, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
+              nu = exp(m5b$nu.coefficients))
> # convert p in rugby to soccer distribution
> qPIG(s1, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))
[1] 7
> 
> # the same again for zero
> s2 <- pZANBI(0, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
+              nu = exp(m5b$nu.coefficients))
> qPIG(s2, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))
[1] 0

About these ads

10 thoughts on “How to Convert Rugby into Football/Soccer Scores

  1. Hi, gamlss is a reasonably good idea here, but I really think that you should plot the estimated densities on top of the observed densities, and also use QQ plots on the residuals to make sure you’re estimating in an unbiased manner. AIC alone doesn’t tell you if the PIG and ZANBI distributions are capturing the densities accurately. Visual evaluation of the fit is really important, I think.

    A nonparametric alternative would be just to use the empirical CDFs to find scores with equivalent percentiles directly.

    • Hi Harlan,

      I briefly looked at some density histograms and tables, and it seems like both distributions do a reasonable job, although I am willing to be corrected on this.

      Also, I considered using a non-parametric bootstrap, as you suggested. However, I wanted to (briefly) show how the gamlss package works.

      Thanks for the comment.

  2. This is perhaps a stupid question, but should not the dual scores of a match (rugby or soccer) have a joint distribution? When one team is attacking, the other team must defend. When defending, you can not make an attack of your own, at the same time.

    Therefore, 7 in 7-5 is not the same 7 as in 7-0…

    • Nope, not stupid, a very useful comment.

      I have thought of this very point, but decided keep the analysis simple. Also, I think that more data would be needed to fit such distributions reliably.

      Thanks for the comment.

  3. This may be (erroneously) simplistic given my limited stats knowledge, but wouldn’t it be easier to plot all combinations of historical scores in sport no. 1 and 2, find a specific score’s relative frequency in sport no. 1, then look up the % on sport no. 2′s historical score plot to obtain the equivalent score?

    • Hi Frank,

      That’s not an erroneous method. In fact, its a more direct way of saying that the scores for both games should be modeled as a joint distribution as per the comments below.

  4. Pingback: Another linkdump « God plays dice

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s