How to Convert Rugby into Football/Soccer Scores

Following the Irish rugby team’s humiliating 60-0 defeat to New Zealand, an interesting question was posed on Twitter: what does a 60-0 result convert to in football/soccer?

Intrigued, I decided to gather some data from both the English premier league (this season, more data collected and future blog posts to come!) and the equivalent English league in rugby union (this season too). My solution to this question involved the following steps. Firstly, I assumed that the scoring process in both games follow parametric probability distributions. I then fitted these data to these distributions, and calculated both the distribution and quantile functions. This allowed me to see the probability of a team scoring 60 in rugby, and then convert that probability into football goals.

The scores in both games will take the form of some kind of count distribution. However, Rugby is a much higher scoring game, and it is unlikely that both of the score count processes are being generated from the same parametric distribution. To fit scores from both games to their respective distributions, I have chosen to use the gamlss package on CRAN. The advantage of the gamlss package is that it has the capability to fit a huge range of distributions.

The code below shows how I loaded these data and fit the scores for both football and rugby to a number of count related distributions. My final choice of distribution was based on a comparison of AIC values for each model. Based on these, football and rugby scores follow the Poisson-inverse Gaussian, and zero-adjusted and zero-inflated negative binomial distributions respectively. The pZANBI and qPIG functions calculate the location of a rugby score on the football score distribution.

To answer the question: a 60-0 score in rugby translates into a 7-0 score in football. Oh dear.

#### score analysis
rm(list=ls())
p1 <- read.csv("premgames.csv")
sc <- c(p1$hgoal,p1$agoal)
# sc is premier league goals

library(gamlss.dist)
library(gamlss)

# fit dists
m1a <- gamlss(sc ~ 1, family = PO)
m2a <- gamlss(sc ~ 1, family = NBI)
m3a <- gamlss(sc ~ 1, family = NBII)
m4a <- gamlss(sc ~ 1, family = PIG)
m5a <- gamlss(sc ~ 1, family = ZANBI)
m6a <- gamlss(sc ~ 1, family = ZIPIG)
m7a <- gamlss(sc ~ 1, family = SI)

# compare dists
AIC(m1a,m2a,m3a,m4a,m5a,m6a,m7a)
# m4a is the best

#load rugby data
p2 <- as.character(unlist(read.csv("rugscore.csv")))
nms <- names(table(p2))[2:47]
p3 <- p2[p2 %in% nms]
p4 <- as.numeric(as.character(p3))

#fit
m1b <- gamlss(p4 ~ 1, family = PO)
m2b <- gamlss(p4 ~ 1, family = NBI)
m3b <- gamlss(p4 ~ 1, family = NBII)
m4b <- gamlss(p4 ~ 1, family = PIG)
m5b <- gamlss(p4 ~ 1, family = ZANBI)
m6b <- gamlss(p4 ~ 1, family = ZIPIG)
m7b <- gamlss(p4 ~ 1, family = SI)

#compare
AIC(m1b,m2b,m3b,m4b,m5b,m6b,m7b)
#m5b is best

# p of 60 in rugby
s1 <- pZANBI(60, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
             nu = exp(m5b$nu.coefficients))
# convert p in rugby to soccer distribution
qPIG(s1, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))

# the same again for zero
s2 <- pZANBI(0, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
             nu = exp(m5b$nu.coefficients))
qPIG(s2, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))

############################################################# 
########## output

> rm(list=ls())
> p1 <- read.csv("premgames.csv")
> sc <- c(p1$hgoal,p1$agoal)
> # sc is premier league goals
> 
> library(gamlss.dist)
> library(gamlss)
> 
> # fit dists
> m1a <- gamlss(sc ~ 1, family = PO)
> m2a <- gamlss(sc ~ 1, family = NBI)
> m3a <- gamlss(sc ~ 1, family = NBII)
> m4a <- gamlss(sc ~ 1, family = PIG)
> m5a <- gamlss(sc ~ 1, family = ZANBI)
> m6a <- gamlss(sc ~ 1, family = ZIPIG)
> m7a <- gamlss(sc ~ 1, family = SI)
> 
> # compare dists
> AIC(m1a,m2a,m3a,m4a,m5a,m6a,m7a)
    df      AIC
m4a  2 2334.244
m2a  2 2334.412
m3a  2 2334.412
m6a  3 2336.244
m7a  3 2336.244
m5a  3 2336.328
m1a  1 2341.862
> # m4a is the best
> 
> #load rugby data
> p2 <- as.character(unlist(read.csv("rugscore.csv")))
> nms <- names(table(p2))[2:47]
> p3 <- p2[p2 %in% nms]
> p4 <- as.numeric(as.character(p3))
> 
> #fit
> m1b <- gamlss(p4 ~ 1, family = PO)
> m2b <- gamlss(p4 ~ 1, family = NBI)
> m3b <- gamlss(p4 ~ 1, family = NBII)
> m4b <- gamlss(p4 ~ 1, family = PIG)
> m5b <- gamlss(p4 ~ 1, family = ZANBI)
> m6b <- gamlss(p4 ~ 1, family = ZIPIG)
> m7b <- gamlss(p4 ~ 1, family = SI)
> 
> #compare
> AIC(m1b,m2b,m3b,m4b,m5b,m6b,m7b)
    df      AIC
m5b  3 1721.183
m2b  2 1722.700
m3b  2 1722.700
m6b  3 1727.345
m4b  2 1732.172
m7b  3 1749.975
m1b  1 2265.146
> #m5b is best
> 
> # p of 60 in rugby
> s1 <- pZANBI(60, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
+              nu = exp(m5b$nu.coefficients))
> # convert p in rugby to soccer distribution
> qPIG(s1, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))
[1] 7
> 
> # the same again for zero
> s2 <- pZANBI(0, mu = exp(m5b$mu.coefficients), sigma = exp(m5b$sigma.coefficients),
+              nu = exp(m5b$nu.coefficients))
> qPIG(s2, mu = exp(m4a$mu.coefficients), sigma = exp(m4a$sigma.coefficients))
[1] 0

Standard, Robust, and Clustered Standard Errors Computed in R

Where do these come from? Since most statistical packages calculate these estimates automatically, it is not unreasonable to think that many researchers using applied econometrics are unfamiliar with the exact details of their computation.

For the purposes of illustration, I am going to estimate different standard errors from a basic linear regression model: \textbf{y}=\textbf{X} \mathbf{\beta}+\textbf{u}, using the fertil2 dataset used in Christopher Baum’s book. Let’s load these data, and estimate a linear regression with the lm function (which estimates the parameters \hat{\mathbf{\beta}} using the all too familiar: ( \textbf{X}'\textbf{X})^{-1}\textbf{X}'\textbf{y} least squares estimator.

rm(list=ls())
library(foreign)
#load data
children <- read.dta("children.dta")
# lm formula and data
form <- ceb ~ age + agefbrth + usemeth
data <- children
# run regression
r1 <- lm(form, data)
# get stand errs
> summary(r1)

Call:
lm(formula = form, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8900 -0.7213 -0.0017  0.6950  6.2657 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.358134   0.173783   7.815 7.39e-15 ***
age          0.223737   0.003448  64.888  < 2e-16 ***
agefbrth    -0.260663   0.008795 -29.637  < 2e-16 ***
usemeth      0.187370   0.055430   3.380 0.000733 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1.463 on 3209 degrees of freedom
  (1148 observations deleted due to missingness)
Multiple R-squared: 0.5726,	Adjusted R-squared: 0.5722 
F-statistic:  1433 on 3 and 3209 DF,  p-value: < 2.2e-16 

When the error terms are assumed homoskedastic IID, the calculation of standard errors comes from taking the square root of the diagonal elements of the variance-covariance matrix which is formulated:

E[\textbf{uu}'|\textbf{X}] = \mathbf{\Sigma_{u}}

\mathbf{\Sigma_{u}} = \sigma^2 I_{N}

\textrm{Var}[\hat{\mathbf{\beta}}|\textbf{X}] = (\textbf{X}'\textbf{X})^{-1} (\textbf{X}' \mathbf{\Sigma_{u}} \textbf{X}) (\textbf{X}'\textbf{X})^{-1}

\textrm{Var}[\hat{\mathbf{\beta}}|\textbf{X}] = \sigma_{u}^{2}(\textbf{X}'\textbf{X})^{-1}

In practice, and in R, this is easy to do. Estimate the variance by taking the average of the ‘squared’ residuals \textbf{uu}', with the appropriate degrees of freedom adjustment. Code is below. As you can see, these standard errors correspond exactly to those reported using the lm function.

# get X matrix/predictors
X <- model.matrix(r1)
# number of obs
n <- dim(X)[1]
# n of predictors
k <- dim(X)[2]
# calculate stan errs as in the above
# sq root of diag elements in vcov
se <- sqrt(diag(solve(crossprod(X)) * as.numeric(crossprod(resid(r1))/(n-k))))
> se
(Intercept)         age    agefbrth     usemeth 
0.173782844 0.003448024 0.008795350 0.055429804 

In the presence of heteroskedasticity, the errors are not IID. Consequentially, it is inappropriate to use the average squared residuals. The robust approach, as advocated by White (1980) (and others too), captures heteroskedasticity by assuming that the variance of the residual, while non-constant, can be estimated as a diagonal matrix of each squared residual. In other words, the diagonal terms in \mathbf{\Sigma_{u}} will, for the most part, be different , so the j-th row-column element will be \hat{u}_{j}^{2}. Once again, in R this is trivially implemented.

# residual vector
u <- matrix(resid(r1))
# meat part Sigma is a diagonal with u^2 as elements
meat1 <- t(X) %*% diag(diag(crossprod(t(u)))) %*% X
# degrees of freedom adjust
dfc <- n/(n-k)    
# like before
se <- sqrt(dfc*diag(solve(crossprod(X)) %*% meat1 %*% solve(crossprod(X))))
> se
(Intercept)         age    agefbrth     usemeth 
0.167562394 0.004661912 0.009561617 0.060644558 

Adjusting standard errors for clustering can be important. For example, replicating a dataset 100 times should not increase the precision of parameter estimates. However, performing this procedure with the IID assumption will actually do this. Another example is in economics of education research, it is reasonable to expect that the error terms for children in the same class are not independent.

Clustering standard errors can correct for this. Assume m clusters. Like in the robust case, it is \textbf{X}' \mathbf{\Sigma_{u}} \textbf{X} or ‘meat’ part, that needs to be adjusted for clustering. In practice, this involves multiplying the residuals by the predictors for each cluster separately, and obtaining \tilde{\textbf{u}}_{j} = \sum^{N_{k}}_{i=1} \hat{u}_{i}\textbf{x}_{i}, an m by k matrix (where k is the number of predictors). ‘Squaring’ \tilde{\textbf{u}}_{j} results in a k by k matrix (the meat part). To get the standard errors, one performs the same steps as before, after adjusting the degrees of freedom for clusters.

# cluster name
cluster <- "children"
# matrix for loops
clus <- cbind(X,data[,cluster],resid(r1))
colnames(clus)[(dim(clus)[2]-1):dim(clus)[2]] <- c(cluster,"resid")
# number of clusters
m <- dim(table(clus[,cluster]))
# dof adjustment
dfc <- (m/(m-1))*((n-1)/(n-k))
# uj matrix
uclust <- matrix(NA, nrow = m, ncol = k)
gs <- names(table(data[,cluster]))
for(i in 1:m){
   uclust[i,] <- t(matrix(clus[clus[,cluster]==gs[i],k+2])) %*% clus[clus[,cluster]==gs[i],1:k] 
   }
# square root of diagonal on bread meat bread like before
se <- sqrt(diag(solve(crossprod(X)) %*% (t(uclust) %*% uclust) %*% solve(crossprod(X)))*dfc
> se
(Intercept)         age    agefbrth     usemeth 
 0.42485889  0.03150865  0.03542962  0.09435531 

For calculating robust standard errors in R, both with more goodies and in (probably) a more efficient way, look at the sandwich package. The same applies to clustering and this paper. However, here is a simple function called ols which carries out all of the calculations discussed in the above.

ols <- function(form, data, robust=FALSE, cluster=NULL,digits=3){
  r1 <- lm(form, data)
  if(length(cluster)!=0){
    data <- na.omit(data[,c(colnames(r1$model),cluster)])
    r1 <- lm(form, data)
  }
  X <- model.matrix(r1)
  n <- dim(X)[1]
  k <- dim(X)[2]
  if(robust==FALSE & length(cluster)==0){
    se <- sqrt(diag(solve(crossprod(X)) * as.numeric(crossprod(resid(r1))/(n-k))))
    res <- cbind(coef(r1),se)
  }
  if(robust==TRUE){
    u <- matrix(resid(r1))
    meat1 <- t(X) %*% diag(diag(crossprod(t(u)))) %*% X
    dfc <- n/(n-k)    
    se <- sqrt(dfc*diag(solve(crossprod(X)) %*% meat1 %*% solve(crossprod(X))))
    res <- cbind(coef(r1),se)
    }
  if(length(cluster)!=0){
    clus <- cbind(X,data[,cluster],resid(r1))
    colnames(clus)[(dim(clus)[2]-1):dim(clus)[2]] <- c(cluster,"resid")
    m <- dim(table(clus[,cluster]))
    dfc <- (m/(m-1))*((n-1)/(n-k))
    uclust  <- apply(resid(r1)*X,2, function(x) tapply(x, clus[,cluster], sum))
    se <- sqrt(diag(solve(crossprod(X)) %*% (t(uclust) %*% uclust) %*% solve(crossprod(X)))*dfc)   
    res <- cbind(coef(r1),se)
  }
  res <- cbind(res,res[,1]/res[,2],(1-pnorm(abs(res[,1]/res[,2])))*2)
  res1 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res)),nrow=dim(res)[1])
  rownames(res1) <- rownames(res)
  colnames(res1) <- c("Estimate","Std. Error","t value","Pr(>|t|)")
  return(res1)
}

# with data as before
> ols(ceb ~ age + agefbrth + usemeth,children)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.358      0.174   7.815    0.000
age            0.224      0.003  64.888    0.000
agefbrth      -0.261      0.009 -29.637    2.000
usemeth        0.187      0.055   3.380    0.001
> ols(ceb ~ age + agefbrth + usemeth,children,robust=T)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.358      0.168   8.105    0.000
age            0.224      0.005  47.993    0.000
agefbrth      -0.261      0.010 -27.261    2.000
usemeth        0.187      0.061   3.090    0.002
> ols(ceb ~ age + agefbrth + usemeth,children,cluster="children")
            Estimate Std. Error t value Pr(>|t|)
(Intercept)    1.358      0.425   3.197    0.001
age            0.224      0.032   7.101    0.000
agefbrth      -0.261      0.035  -7.357    2.000
usemeth        0.187      0.094   1.986    0.047

Visualizing Euro 2012: First Group Games

Now that every team has played a match it will be interesting to see how this has affected the (inverse) odds of victory. Since the plot in my last post was a bit ‘busy’, I have decided to use the facet_wrap function in gglplot2 to stratify by group.

Also, re-producing the ‘busy’ plot from the last post yields the following.

Germany, despite not playing well, has gained, while the Netherlands, despite playing quite well, have declined. These two countries will play each other in the next round, so it will be interesting to see how a victory for the Netherlands would change these graphics.

Data and code:

# after loading data as object called eur
n <- dim(eur)[1]
eur <- t(eur[1:n,])
dat <- NULL
for(i in 1:n){dat <- data.frame(rbind(dat,cbind(eur[-1,i],names(eur[-1,i]),i)))}

dat$V1 <- 1/as.numeric(as.character(dat$V1))
dat$V3 <- as.character(dat$V2)
dat$V3[dat$i!=n] <- c("")
dat$group <- ifelse(dat$V2 %in% c("RUS","GRE","POL","CZE"),"Group.A","Group.D")
dat$group <- ifelse(dat$V2 %in% c("GER","NED","POR","DEN"),"Group.B",dat$group)
dat$group <- ifelse(dat$V2 %in% c("IRL","CRO","ITA","ESP"),"Group.C",dat$group)
dat$i <- as.numeric(as.character(dat$i))

ggplot(dat, aes(x=i, y=V1, colour = V2, group=V2, label=V3)) + 
  geom_line(size=0.8) + geom_point(size=4, shape=21, fill="white") + #theme_bw() +
  geom_text(hjust=-0.3, vjust=0) +
  scale_x_continuous('Day',limits=c(1,(n+0.4)),breaks=1:n) +
  scale_y_continuous('1/Odds') +
  theme_bw() +
  opts(title = expression("Euro 2012, Inverse Odds of Victory"),
       legend.position=c(80,80))

ggplot(dat, aes(x=i, y=V1, colour = V2, group=V2, label=V3)) + 
  geom_line(size=0.8) + geom_point(size=4, shape=21, fill="white") + #theme_bw() +
  geom_text(hjust=-0.3, vjust=0.4) +
  scale_x_continuous('Day',limits=c(1,(n+0.8)),breaks=1:n) +
  scale_y_continuous('1/Odds') +
  facet_wrap( ~ group, ncol = 2, scales="free_y") +
  theme_bw() +
  opts(title = expression("Euro 2012, Inverse Odds of Victory"),
       legend.position=c(80,80))

Visualizing Euro 2012 with ggplot2

After scanning this paper by Zeileis, Leitner & Hornik, I thought it would be interesting to see how the victory odds for each team changes as Euro 2012 progresses. To do this, I am going to collect the daily inverse odds of a tournament victory offered by a popular betting site for each team.

Here is the first plot. Day one corresponds to the pretournament odds as given in the aforementioned paper for the popular betting site. These odds were obtained on the 9th of May, while day two’s odds were collected this morning.

I’ll update this in a week.

Let’s Party!

Exploring whether regression coefficients differ between groups is an important part of applied econometric research, and particularly for research with a policy based objective.

For example, a government in a developing country may decide to introduce free school lunches in an effort to improve childhood health. However, if this treatment is known to only improve the health of boys from the lowest socioeconomic strata, it makes sense that this group should be targeted to receive the treatment, while the additional public resources, which would have been unnecessarily used on the other groups, could be efficiently allocated elsewhere.

There are two conventional approaches to estimating these potentially differing effects. The first involves manually partitioning one’s data and performing separate analysis (i.e. one regression for boys the other for girls). The second involves including interaction terms in the regression model. The inclusion of interaction terms allows for different groups to have different slopes.

Problems with these aforementioned strategies arise when researchers would like to stratify the analysis across many groups. Splitting the analysis into different groups can be both a confusing (triple interaction terms anyone?) and inefficient way to conduct research. Furthermore, the results of stratification across a large number of groups can be somewhat difficult to present in a research paper (think of a table with one hundred result columns).

Thankfully, the party package on Cran offers a neat solution to the above concerns, as the functions in this package offer procedures for model based stratification. Following a model-based approach has the obvious advantage that it avoids unnecessary splitting of data, and can therefore be seen as a more efficient way of analyzing group differences.

The model based approach takes the regression model of interest and partitions the results into groups based on parameter instabilities indicated by structural break tests. More info on such tests is given in Zeileis (2005).

In the below, I provide a simple example of the party package at work. Obviously, I encourage interested users to read both the package vignettes, and associated literature before performing more complicated analysis on real data. Let there be three groups (z). In group 0, the effect of x on y is -0.5, in groups 1 and 2 this effect is +0.5. Based on the below plot, we can see that the model-based recursive partitioning approach both predicts the splits, and also the correct parameter estimates.

rm(list=ls())
library(party)
set.seed(1988)

# set up simulated data
z <- sample(c(0,1,2),2000,replace=T)
z1 <- ifelse(z==1,1,0)
z2 <- ifelse(z==2,1,0)
x <- rnorm(2000,0,1)
y <- 1 + 2*z1 + 2*z2 - 0.5*x + x*z1 + x*z2 + rnorm(2000,0,1) 

# model based partitioning of regression of y~x 
# over groups indicated by z
mod1 <- mob(y ~ x | factor(z))
# nice plot of results
plot(mod1)