How Much Should Bale Cost Real?

It looks increasingly likely that Gareth Bale will transfer from Tottenham to Real Madrid for a world record transfer fee. Negotiations are ongoing, with both parties keen to get the best deal possible deal with the transfer fee. Reports speculate that this transfer fee will be anywhere in the very wide range of £80m to £120m.

Given the topical nature of this transfer saga, I decided to explore the world record breaking transfer fee data, and see if these data can help predict what the Gareth Bale transfer fee should be. According to this Wikipedia article, there have been 41 record breaking transfers, from Willie Groves going from West Brom to Aston Villa in 1893 for £100, to Cristiano Ronaldo’s £80m 2009 transfer to Real Madrid from Manchester United.

When comparing any historical price data it is very important that we are comparing like with like. Clearly, a fee of £100 in 1893 is not the same as £100 in 2009. Therefore, the world record transfer fees need to be adjusted for inflation. To do this, I used the excellent measuringworth website, and converted all of the transfer fees into 2011 pounds sterling.


The plot above demonstrates a very strong linear relationship between logged real world record transfer fees and time. The R-squared indicates that the year of the transfer fee explains roughly 97% of the variation in price.

So, if Real Madrid are to pay a world transfer fee for Bale, how much does this model predict the fee will be? The above plot demonstrates what happens when the simple log-linear model is extrapolated to predict the world record transfer fee in 2013. The outcome here is 18.37, so around £96m, in 2011 prices. We can update this value to 2013 prices. Assuming a modest inflation rate of 2% we get £96m[exp(0.02*2)]=£99.4m. No small potatoes.


bale = read.csv("bale.csv")
# data from:

ols1 = lm(log(real2011)~year, bale)

# price
# inflate lets say 2% inflation

# nice ggplot
bale$lnprice2011 = log(bale$real2011)
addon = data.frame(year=2013,nominal=0,real2011=0,name="Bale?",

ggplot(bale, aes(x=year, y=lnprice2011, label=name)) + 
  geom_text(hjust=0.4, vjust=0.4) +
  stat_smooth(method = "lm",fullrange = TRUE, level = 0.975) +
  theme_bw(base_size = 12, base_family = "") +
  xlim(1885, 2020) + ylim(8, 20) +
  xlab("Year") + ylab("ln(Price)") +
  ggtitle("World Transfer Records, Real 2011 Prices (£)")+
  geom_point(aes(col="red"),size=4,data=addon) + 
  geom_text(aes(col="red", fontface=3),hjust=-0.1, vjust=0,size=7,data=addon) + 

The Frisch–Waugh–Lovell Theorem for Both OLS and 2SLS

The Frisch–Waugh–Lovell (FWL) theorem is of great practical importance for econometrics. FWL establishes that it is possible to re-specify a linear regression model in terms of orthogonal complements. In other words, it permits econometricians to partial out right-hand-side, or control, variables. This is useful in a variety of settings. For example, there may be cases where a researcher would like to obtain the effect and cluster-robust standard error from a model that includes many regressors, and therefore a computationally infeasible variance-covariance matrix.

Here are a number of practical examples. The first just takes a simple linear regression model, with two regressors: x1 and x2. To partial out the coefficients on the constant term and x2, we first regress x2 on y1 and save the residuals. We then regress x2 on x1 and save the residuals. The final stage regresses the second residuals on the first. The following code illustrates how one can obtain an identical coefficient on x1 by applying the FWL theorem.

x1 = rnorm(100)
x2 = rnorm(100)
y1 = 1 + x1 - x2 + rnorm(100)

r1 = residuals(lm(y1 ~ x2))
r2 = residuals(lm(x1 ~ x2))
# ols
coef(lm(y1 ~ x1 + x2))
# fwl ols
coef(lm(r1 ~ -1 + r2))

FWL is also relevant for all linear instrumental variable (IV) estimators. Here, I will show how this extends to the 2SLS estimator, where slightly more work is required compared to the OLS example in the above. Here we have a matrix of instruments (Z), exogenous variables (X), and endogenous variables (Y1). Let us imagine we want the coefficient on one endogenous variable y1. In this case we can apply FWL as follows. Regress X on each IV in Z in separate regressions, saving the residuals. Then regress X on y1, and X on y2, saving the residuals for both. In the last stage, perform a two-stage-least-squares regression of the X on y2 residuals on the X on y2 residuals using the residuals from X on each Z as instruments. An example of this is shown in the below code.


ov = rnorm(100)
z1 = rnorm(100) 
z2 = rnorm(100)
y1 = rnorm(100) + z1 + z2 + 1.5*ov
x1 = rnorm(100) + 0.5*z1 - z2
x2 = rnorm(100) 
y2 = 1 + y1 - x1 + 0.3*x2 + ov + rnorm(100)

r1 = residuals(lm(z1 ~ x1 + x2))
r2 = residuals(lm(z2 ~ x1 + x2))
r3 = residuals(lm(y1 ~ x1 + x2))
r4 = residuals(lm(y2 ~ x2 + x2))

# biased coef on y1 as expected for ols
# 2sls
# fwl 2sls

The FWL can also be extended to cases where there are multiple endogenous variables. I have demonstrated this case by extending the above example to model x1 as an endogenous variable.

# 2 endogenous variables
r5 = residuals(lm(z1 ~ x2))
r6 = residuals(lm(z2 ~ x2))
r7 = residuals(lm(y1 ~ x2))
r8 = residuals(lm(x1 ~ x2))
r9 = residuals(lm(y2 ~ x2))

# 2sls coefficients
p1 = fitted.values(lm(y1~z1+z2+x2))
p2 = fitted.values(lm(x1~z1+z2+x2))
lm(y2 ~ p1 + p2 + x2)

# 2sls fwl coefficients
p3 = fitted.values(lm(r7~-1+r5+r6))
p4 = fitted.values(lm(r8~-1+r5+r6))
lm(r9 ~ p3 + p4)

texreg: A package for beautiful and easily customizable LaTeX regression tables from R

There was a very informative post last week showing how the R package stargazer is used to generate nice LaTeX tables from a number of R objects. This package looks very useful. However, I would like to extol the virtues of another R package that converts model objects in R into LaTeX code: texreg.

For me, the texreg package has one very important advantage compared to stargazer, which is (and please correct me if I am wrong on this point) that the regression model’s output is very easy to customize. There are a number of examples where this feature of texreg is important, such as the inclusion of robust/cluster-robust standard errors, or changing the coefficients of a generalized linear model to read as marginal effects.

In the example below, I have replicated (with some modifications) the code used for my last blog post showing the importance of clustering standard errors. Basically this code simulates a bunch of data, estimates a regression model, then duplicates the dataset three times, which in turn reduces the standard errors of the regression model. The last step shows how the clustered standard errors will not be reduced by duplicating the dataset. Once I perform this analysis, I use the output code to generate a texreg object that gives the code needed to create the regression table in LaTeX.

In this example, I have used the texreg override commands in the texreg function so that the standard errors and p-values (needed to calculate the stars in the regression table) reflect the clustering performed with the function. The output from this, once the code is run through a LaTeX compiler (I use TeXworks), is shown in the image below (I have also adjusted the caption so that it is on the top of the table rather than the bottom). One can download the latest version of texreg from the packages R-forge project page. This page also features a very helpful and active forum.


library(lmtest) ; library(sandwich)

# data sim

x <- rnorm(1000)
y <- 5 + 2*x + rnorm(1000,0,40)
# regression
m1 <- lm(y~x)

# triple data
dat <- data.frame(x=c(x,x,x),y=c(y,y,y),g=c(1:1000,1:1000,1:1000))
# regressions
m2 <- lm(y~x, dat) # smaller StErrs

# cluster robust standard error function <- function(model, cluster){
  M <- length(unique(cluster))
  N <- length(cluster)
  K <- model$rank
  dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
  uj <- apply(estfun(model), 2, function(x) tapply(x, cluster, sum));
  rcse.cov <- dfc * sandwich(model, meat = crossprod(uj)/N) <- coeftest(model, rcse.cov)

m3 <-,dat$g)[[2]] # StErrs now back to what they are

       caption="The Importance of Clustering Standard Errors",

Linear Models with Multiple Fixed Effects

Estimating a least squares linear regression model with fixed effects is a common task in applied econometrics, especially with panel data. For example, one might have a panel of countries and want to control for fixed country factors. In this case the researcher will effectively include this fixed identifier as a factor variable, and then proceed to estimate the model that includes as many dummy variables (minus one if an intercept is included in the modelling equation) as there are countries. Obviously, this approach is computationally problematic when there are many fixed factors. In our simple example, an extra country will add an extra column to the X matrix used in the least squares calculation.

Fortunately, there are a number of data transformations that can be used in this panel setting. These include demeaning each within unit observation, using first differences, or including the group means as additional explanatory variables (as suggested by (Mundlak 1978)). However, these approaches only work well when there is one factor that the researcher wants to include fixed effects to account for.

Simen Gaure offers a solution this problem that allows for multiple fixed effects without resorting to a computationally burdensome methodology. Essentially the solution involves an elaboration of the group demeaning transformation mentioned in the above. More technical details can be found here or by referring to Gaure’s forthcoming article in Computational Statistics & Data Analysis. Those interested in implementing this estimation strategy in R can use the lfe package available on CRAN.

In the below, I have included a simple example of how the package works. In this example, the model needs to be set up to calculate fixed effects for two factor variables. Obviously, adding 2,000 columns to the data frame is not a convenient way to estimate the model that includes fixed effects for both the x2 and x3 variables. However, the felm function tackles this problem with ease. Stata has a similar function to feml, areg, although the areg function only allows for absorbed fixed effects in one variable.

# clear workspace
# load lfe package

# create data frame
x1 <- rnorm(10000)
x2 <- rep(1:1000,10)
x3 <- rep(1:1000,10)
e1 <- sin(x2) + 0.02*x3^2 + rnorm(10000)
y <- 10 + 2.5*x1 + (e1-mean(e1))
dat <- data.frame(x1,x2,x3,y)

# simple lm
# lm with fixed effects
felm(dat$y ~ dat$x1 + G(dat$x2) + G(dat$x3))

# output
# simple lm
> lm(y~x1)
Call: lm(formula = y ~ x1)
  (Intercept)           x1  
        10.47       -36.95  
> # lm with fixed effects
> felm(dat$y ~ dat$x1 + G(dat$x2) + G(dat$x3))

Heteroskedastic GLM in R

A commenter on my previous blog entry has drawn my attention to an R function called hetglm() that estimates heteroskedastic probit models. This function is contained in the glmx package. The glmx package is not available on CRAN yet, but thankfully can be downloaded here.

The hetglm() function has a number of computational advantages compared with the crude method outlined in my previous post. The following example replicates the previous analysis showing the speed advantage associated with using the hetglm() function.

rm(list=ls()) # clear ws
n <- 1000 # no. obs
x1 <- runif(n,-1,1) # predictor 1
x2 <- runif(n,-1,1) # " 2
e1 <- rnorm(n,0,1) # normal error
e2 <- (1 + 0.45*(x1+x2))*e1 # hetero error
y <- ifelse(0.5 + 0.5*x1 -0.5*x2 - e2 >0, 1, 0) # outcome
# estimate normal probit
r1 <- glm(y~x1+x2,family=binomial(link="probit")) 
system.time(ml1 <- maxLik(hll,start=c(0,0,0,0,0))) # maximize
system.time(h1 <- hetglm(y ~ x1 + x2))
# output
> system.time(ml1 <- maxLik(hll,start=c(0,0,0,0,0))) # maximize
   user  system elapsed 
   4.43    0.00    4.59 
> system.time(h1 <- hetglm(y ~ x1 + x2))
   user  system elapsed 
   0.11    0.00    0.11 

The Heteroskedastic Probit Model

Specification testing is an important part of econometric practice. However, from what I can see, few researchers perform heteroskedasticity tests after estimating probit/logit models. This is not a trivial point. Heteroskedasticity in these models can represent a major violation of the probit/logit specification, both of which assume homoskedastic errors.

Thankfully, tests for heteroskedasticity in these models exist, and it is also possible to estimate modified binary choice models that are robust to heteroskedastic errors. In this blog post I present an example of how to estimate a heteroskedastic probit in R, and also test for heteroskedasticity.

The standard probit model assumes that the error distribution of the latent model has a unit variance. The heteroskedastic probit model relaxes this assumption, and allows the error variance to depend on some of the predictors in the regression model. Those interested in further details of this model, and the potential implications of this form of model misspecification, should consult these notes.

In the code below, I simulate some data, specify the log-likelihood function for the heteroskedastic probit model, estimate this model via maximum likelihood, and then perform a simple LR test of homoskedasticity. Note the log-likelihood function can be simplified from:

\ln L (\beta, \gamma | X_{i}, Z_{i}) = \sum^{N}_{i=1} \{ Y_{i} \ln \Phi [X_{i}\beta \exp(-Z_{i}\gamma)] + (1-Y_{i}) \ln [1-\Phi (X_{i}\beta \exp(-Z_{i}\gamma))] \}


\ln L (\beta, \gamma | X_{i}, Z_{i}) = \sum^{N}_{i=1} \{ \ln \Phi [q_{i}(X_{i}\beta \exp(-Z_{i}\gamma))]\}

 where q_{i}=2y_{i}-1 uses the fact that the PDF of the normal distribution is symmetric to cut out some algebra (and possibly lead to an easier log-likelihood function to maximize). The log-liklihood function in the above may prove difficult to maximize, although there is an LM test for homoskedasticity, that only requires the restricted (standard probit model) to be estimated. An interesting application would be to see if Bayesian methods estimated using a Gibbs sampling (via STAN, Bugs, or JAGS) can help deal with computationally difficult log likelihoods in this regard. A topic, I will hopefully turn to in a future blog post.

# hetprob
rm(list=ls()) # clear ws
library(maxLik) # use max lik package

# dgp
n <- 1000 # no. obs
x1 <- runif(n,-1,1) # predictor 1
x2 <- runif(n,-1,1) # " 2
e1 <- rnorm(n,0,1) # normal error
e2 <- (1 + 0.45*(x1+x2))*e1 # hetero error
y <- ifelse(0.5 + 0.5*x1 -0.5*x2 - e2 >0, 1, 0) # outcome

# estimate normal probit
r1 <- glm(y~x1+x2,family=binomial(link="probit")) 

# hetprob llik
hll <- function(beta){
  beta1 <- beta[1:dim(X)[2]]
  beta2 <- beta[(1+dim(X)[2]):((dim(X)[2]+dim(X)[2])-1)]
  q <- 2*y-1 # sym tranform
  xbeta <- pnorm(q*((X %*% beta1)/exp(X[,-1] %*% beta2)))

X <- cbind(1,x1,x2)
ml1 <- maxLik(hll,start=c(0,0,0,0,0)) # maximize
sqrt(diag(-solve(ml1$hessian))) # get standard errors

# LR test of homosked
# 2*(LR_{unrestriced}-LR_{restriced})
LRtest <- 2*(ml1$maximum+1/2*r1$deviance)
# LS is chi^2 distributed with 2 dof
# REJECT (as expected)

Probit Models with Endogeneity

Dealing with endogeneity in a binary dependent variable model requires more consideration than the simpler continuous dependent variable case. For some, the best approach to this problem is to use the same methodology used in the continuous case, i.e. 2 stage least squares. Thus, the equation of interest becomes a linear probability model (LPM). The advantage of this approach is its simplicity; both in terms of estimation and interpretation (a 1-unit change in x causes a \beta change in the probability of y).

The disadvantage of this approach is that the LPM may imply probabilities outside the unit interval. It is typically for this reason that generalized linear models, like probit or logit, are used to model binary dependent variables in applied research, and an approach that extends the probit model to account for endogeneity was proposed by Rivers & Vuong (1988).

The Rivers & Vuong estimator assumes the following usual triangular system:

\textbf{y}_{2i}=1(\textbf{X}_{i}\boldsymbol{\beta}+\textbf{y}_{1i}\alpha+\boldsymbol{\epsilon}_{i}>0) (1),

\textbf{y}_{1i}=\textbf{X}_{i}\boldsymbol{\gamma}+\textbf{Z}_{i}\boldsymbol{\theta}+\textbf{v}_{i}) (2),

wherein the jointly normally distributed error terms correlate, the control variables are in \textbf{X}_{i} and the instrumental variables are in \textbf{Z}_{i}. A simple procedure that estimates this model is the following two-step approach. First regress \textbf{y}_{1i} on \textbf{X}_{i} and \textbf{Z}_{i}. Collect the residuals \hat{\textbf{v}}_{i}. The second step involves estimating the probit model of interest (2), and including the first stage residuals as an additional regressor. This method has also been termed the control function approach, as the inclusion of \hat{\textbf{v}}_{i} controls for the correlation between \textbf{v}_{i} and \boldsymbol{\epsilon}_{i}.

The second step coefficient estimates are scaled versions of their real values for reasons outlined here. It is possible to get the un-scaled values using the appropriate transformation. However, for analysis it is often more useful to know the Average Structural Function (ASF), as discussed in Blundell & Powell, 2004). The ASF can be seen as the policy response measure as it illustrates the probability of the dependent variable being a one given the values of the regressors, in the absence of endogeneity. Therefore, with the ASF it is possible to trace out the effects of changes in \textbf{y}_{1i} on the probability of \textbf{y}_{2i}. Additionally, it is possible to compute average partial/marginal effects with the ASF, as described here.

Estimating the ASF involves taking the average of the normal CDF transformed predictions where the scaled coefficients are multiplied by all of the variables at some fixed value (commonly the mean) and all of the first-stage residuals are multiplied by the relevant scaled coefficient. This calculation is formalized on page 7 of this document.

To simulate the effect of changes in the endogenous variable, the ASF can be recursively estimated with different values of \textbf{y}_{1i}. I have attached an example of how this calculation can be performed for a simple simulation in R. It would also be possible to construct confidence intervals for this ASF using bootstrapping methods. As is evident from the plot, the control function estimator correctly identifies the ASF.

n <- 10000
Sigma <- matrix(c(1,0.75,0.75,1),2,2)
u <- rmvnorm(n,rep(0,2),Sigma)

x1 <- rnorm(n)
x2 <- rnorm(n)
y1 <- 1.5 + 2*x1 - 2*x2 + u[,1] 
y2 <- ifelse(-0.25 - 1.25*x1 - 0.5*y1 + u[,2] > 0, 1, 0)

#true asf
eq1 <- function(x1,y1){pnorm(-0.25 - 1.25*x1 - 0.5*y1)}
data <- data.frame(cbind(mean(x1),seq(ceiling(min(y1)),floor(max(y1)),0.2)))
names(data) <- c("x1","y1")
data$asf <- eq1(data$x1,data$y1)

# naive probit: biased regression 
r1 <- glm(y2~x1+y1,binomial(link="probit"))
dat1 <- data.frame(cbind(mean(x1),seq(ceiling(min(y1)),floor(max(y1)),0.2)))
names(dat1) <- c("x1","y1")
asf1 <- cbind(dat1$y1,pnorm(predict(r1,dat1)))

# 2 step control function approach
v1 <- (residuals(lm(y1~x1+x2)))/sd(residuals(lm(y1~x1+x2)))
r2 <- glm(y2~x1+y1+v1,binomial(link="probit"))
# proceedure to get asf
asf2 <- cbind(seq(ceiling(min(y1)),floor(max(y1)),0.2),NA)
for(i in 1:dim(asf2)[1]){
  dat2 <- data.frame(cbind(mean(x1),asf2[i,1],v1))
  names(dat2) <- c("x1","y1","v1")
  asf2[i,2] <- mean(pnorm(predict(r2,dat2)))

# get respective asfs and plot
plotdat <- data.frame(rbind(cbind(data$y1,data$asf,"TRUE ASF"),
                            cbind(data$y1,asf2[,2],"2 STEP PROBIT")))
names(plotdat) <- c("Y1","Y2","Estimator")
plotdat$Y1 <- as.numeric(as.character(plotdat$Y1))
plotdat$Y2 <- as.numeric(as.character(plotdat$Y2))

ggplot(plotdat, aes(x=Y1, y=Y2, colour = Estimator, group=Estimator)) + 
  geom_line(size=0.8) + geom_point()+
  scale_x_continuous('Y1') +
  scale_y_continuous('P(Y2)') +
  theme_bw() +
  opts(title = expression("Average Structural Function Comparison"),legend.position=c(0.8,0.8))