If history can tell us anything about the World Cup, it’s that the host nation has an advantage of all other teams. Evidence of this was presented last night as the referee in the Brazil-Croatia match unjustly ruled in Brazil’s favour on several occasions. But what it is the statistical evidence of a host advantage?

To look at this, I downloaded these data from the Guardian’s website. With these, I ran a very simple probit model that regressed the probability of winning the world cup on whether the country was the host and also if the county was not the host but located in the same continent (I merge North and South America for this exercise). Obviously, this is quite a basic analysis, so I hope to build on these data as the tournament progresses and maybe and the 2010 data, and look at more sophisticated models.

> probitmfx(formula=winners ~ continent + hosts, data=wc)
Call:
probitmfx(formula = winners ~ continent + hosts, data = wc)

Marginal Effects:
dF/dx Std. Err.      z   P>|z|
continent 0.064425  0.027018 2.3845 0.01710 *
hosts     0.315378  0.121175 2.6027 0.00925 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

dF/dx is for discrete change for the following variables:

[1] "continent" "hosts"


The results are as we would expect. I am using the excellent mfx package to interpret the probit coefficients. Being the host nation increases the probability of being victorious by nearly 32%. So, going by historical trends, Brazil have a huge advantage for this world cup. If we look at countries in the same continent (think Argentina for this world cup) we see that there is a small advantage here, just over 6%.

Whether these results are robust to additional control variables and in the inclusion of fixed effects alongside heterogeneous time-varying effects is something I hope to probe.

# Heteroskedastic GLM in R

A commenter on my previous blog entry has drawn my attention to an R function called hetglm() that estimates heteroskedastic probit models. This function is contained in the glmx package. The glmx package is not available on CRAN yet, but thankfully can be downloaded here.

The hetglm() function has a number of computational advantages compared with the crude method outlined in my previous post. The following example replicates the previous analysis showing the speed advantage associated with using the hetglm() function.

rm(list=ls()) # clear ws
library(maxLik)
library(glmx)
n <- 1000 # no. obs
x1 <- runif(n,-1,1) # predictor 1
x2 <- runif(n,-1,1) # " 2
e1 <- rnorm(n,0,1) # normal error
e2 <- (1 + 0.45*(x1+x2))*e1 # hetero error
y <- ifelse(0.5 + 0.5*x1 -0.5*x2 - e2 >0, 1, 0) # outcome
# estimate normal probit
system.time(ml1 <- maxLik(hll,start=c(0,0,0,0,0))) # maximize
system.time(h1 <- hetglm(y ~ x1 + x2))

# output
> system.time(ml1 <- maxLik(hll,start=c(0,0,0,0,0))) # maximize
user  system elapsed
4.43    0.00    4.59
> system.time(h1 <- hetglm(y ~ x1 + x2))
user  system elapsed
0.11    0.00    0.11


# The Heteroskedastic Probit Model

Specification testing is an important part of econometric practice. However, from what I can see, few researchers perform heteroskedasticity tests after estimating probit/logit models. This is not a trivial point. Heteroskedasticity in these models can represent a major violation of the probit/logit specification, both of which assume homoskedastic errors.

Thankfully, tests for heteroskedasticity in these models exist, and it is also possible to estimate modified binary choice models that are robust to heteroskedastic errors. In this blog post I present an example of how to estimate a heteroskedastic probit in R, and also test for heteroskedasticity.

The standard probit model assumes that the error distribution of the latent model has a unit variance. The heteroskedastic probit model relaxes this assumption, and allows the error variance to depend on some of the predictors in the regression model. Those interested in further details of this model, and the potential implications of this form of model misspecification, should consult these notes.

In the code below, I simulate some data, specify the log-likelihood function for the heteroskedastic probit model, estimate this model via maximum likelihood, and then perform a simple LR test of homoskedasticity. Note the log-likelihood function can be simplified from:

$\ln L (\beta, \gamma | X_{i}, Z_{i}) = \sum^{N}_{i=1} \{ Y_{i} \ln \Phi [X_{i}\beta \exp(-Z_{i}\gamma)] + (1-Y_{i}) \ln [1-\Phi (X_{i}\beta \exp(-Z_{i}\gamma))] \}$

to:

$\ln L (\beta, \gamma | X_{i}, Z_{i}) = \sum^{N}_{i=1} \{ \ln \Phi [q_{i}(X_{i}\beta \exp(-Z_{i}\gamma))]\}$

where $q_{i}=2y_{i}-1$ uses the fact that the PDF of the normal distribution is symmetric to cut out some algebra (and possibly lead to an easier log-likelihood function to maximize). The log-liklihood function in the above may prove difficult to maximize, although there is an LM test for homoskedasticity, that only requires the restricted (standard probit model) to be estimated. An interesting application would be to see if Bayesian methods estimated using a Gibbs sampling (via STAN, Bugs, or JAGS) can help deal with computationally difficult log likelihoods in this regard. A topic, I will hopefully turn to in a future blog post.

# hetprob
rm(list=ls()) # clear ws
library(maxLik) # use max lik package

# dgp
n <- 1000 # no. obs
x1 <- runif(n,-1,1) # predictor 1
x2 <- runif(n,-1,1) # " 2
e1 <- rnorm(n,0,1) # normal error
e2 <- (1 + 0.45*(x1+x2))*e1 # hetero error
y <- ifelse(0.5 + 0.5*x1 -0.5*x2 - e2 >0, 1, 0) # outcome

# estimate normal probit

# hetprob llik
hll <- function(beta){
beta1 <- beta[1:dim(X)[2]]
beta2 <- beta[(1+dim(X)[2]):((dim(X)[2]+dim(X)[2])-1)]
q <- 2*y-1 # sym tranform
xbeta <- pnorm(q*((X %*% beta1)/exp(X[,-1] %*% beta2)))
sum(log(xbeta))
}

X <- cbind(1,x1,x2)
ml1 <- maxLik(hll,start=c(0,0,0,0,0)) # maximize
sqrt(diag(-solve(ml1$hessian))) # get standard errors # LR test of homosked # 2*(LR_{unrestriced}-LR_{restriced}) LRtest <- 2*(ml1$maximum+1/2*r1$deviance) # LS is chi^2 distributed with 2 dof 1-pchisq(LRtest,df=2) # REJECT (as expected)  # Probit Models with Endogeneity Dealing with endogeneity in a binary dependent variable model requires more consideration than the simpler continuous dependent variable case. For some, the best approach to this problem is to use the same methodology used in the continuous case, i.e. 2 stage least squares. Thus, the equation of interest becomes a linear probability model (LPM). The advantage of this approach is its simplicity; both in terms of estimation and interpretation (a 1-unit change in x causes a $\beta$ change in the probability of y). The disadvantage of this approach is that the LPM may imply probabilities outside the unit interval. It is typically for this reason that generalized linear models, like probit or logit, are used to model binary dependent variables in applied research, and an approach that extends the probit model to account for endogeneity was proposed by Rivers & Vuong (1988). The Rivers & Vuong estimator assumes the following usual triangular system: $\textbf{y}_{2i}=1(\textbf{X}_{i}\boldsymbol{\beta}+\textbf{y}_{1i}\alpha+\boldsymbol{\epsilon}_{i}>0)$ (1), $\textbf{y}_{1i}=\textbf{X}_{i}\boldsymbol{\gamma}+\textbf{Z}_{i}\boldsymbol{\theta}+\textbf{v}_{i})$ (2), wherein the jointly normally distributed error terms correlate, the control variables are in $\textbf{X}_{i}$ and the instrumental variables are in $\textbf{Z}_{i}$. A simple procedure that estimates this model is the following two-step approach. First regress $\textbf{y}_{1i}$ on $\textbf{X}_{i}$ and $\textbf{Z}_{i}$. Collect the residuals $\hat{\textbf{v}}_{i}$. The second step involves estimating the probit model of interest (2), and including the first stage residuals as an additional regressor. This method has also been termed the control function approach, as the inclusion of $\hat{\textbf{v}}_{i}$ controls for the correlation between $\textbf{v}_{i}$ and $\boldsymbol{\epsilon}_{i}$. The second step coefficient estimates are scaled versions of their real values for reasons outlined here. It is possible to get the un-scaled values using the appropriate transformation. However, for analysis it is often more useful to know the Average Structural Function (ASF), as discussed in Blundell & Powell, 2004). The ASF can be seen as the policy response measure as it illustrates the probability of the dependent variable being a one given the values of the regressors, in the absence of endogeneity. Therefore, with the ASF it is possible to trace out the effects of changes in $\textbf{y}_{1i}$ on the probability of $\textbf{y}_{2i}$. Additionally, it is possible to compute average partial/marginal effects with the ASF, as described here. Estimating the ASF involves taking the average of the normal CDF transformed predictions where the scaled coefficients are multiplied by all of the variables at some fixed value (commonly the mean) and all of the first-stage residuals are multiplied by the relevant scaled coefficient. This calculation is formalized on page 7 of this document. To simulate the effect of changes in the endogenous variable, the ASF can be recursively estimated with different values of $\textbf{y}_{1i}$. I have attached an example of how this calculation can be performed for a simple simulation in R. It would also be possible to construct confidence intervals for this ASF using bootstrapping methods. As is evident from the plot, the control function estimator correctly identifies the ASF.  rm(list=ls()) library(mvtnorm) n <- 10000 Sigma <- matrix(c(1,0.75,0.75,1),2,2) u <- rmvnorm(n,rep(0,2),Sigma) x1 <- rnorm(n) x2 <- rnorm(n) y1 <- 1.5 + 2*x1 - 2*x2 + u[,1] y2 <- ifelse(-0.25 - 1.25*x1 - 0.5*y1 + u[,2] > 0, 1, 0) #true asf eq1 <- function(x1,y1){pnorm(-0.25 - 1.25*x1 - 0.5*y1)} data <- data.frame(cbind(mean(x1),seq(ceiling(min(y1)),floor(max(y1)),0.2))) names(data) <- c("x1","y1") data$asf <- eq1(data$x1,data$y1)

# naive probit: biased regression
dat1 <- data.frame(cbind(mean(x1),seq(ceiling(min(y1)),floor(max(y1)),0.2)))
names(dat1) <- c("x1","y1")
asf1 <- cbind(dat1$y1,pnorm(predict(r1,dat1))) # 2 step control function approach v1 <- (residuals(lm(y1~x1+x2)))/sd(residuals(lm(y1~x1+x2))) r2 <- glm(y2~x1+y1+v1,binomial(link="probit")) # proceedure to get asf asf2 <- cbind(seq(ceiling(min(y1)),floor(max(y1)),0.2),NA) for(i in 1:dim(asf2)[1]){ dat2 <- data.frame(cbind(mean(x1),asf2[i,1],v1)) names(dat2) <- c("x1","y1","v1") asf2[i,2] <- mean(pnorm(predict(r2,dat2))) } # get respective asfs and plot plotdat <- data.frame(rbind(cbind(data$y1,data$asf,"TRUE ASF"), cbind(data$y1,asf1[,2],"PROBIT"),
cbind(data$y1,asf2[,2],"2 STEP PROBIT"))) names(plotdat) <- c("Y1","Y2","Estimator") plotdat$Y1 <- as.numeric(as.character(plotdat$Y1)) plotdat$Y2 <- as.numeric(as.character(plotdat$Y2)) library(ggplot2) ggplot(plotdat, aes(x=Y1, y=Y2, colour = Estimator, group=Estimator)) + geom_line(size=0.8) + geom_point()+ scale_x_continuous('Y1') + scale_y_continuous('P(Y2)') + theme_bw() + opts(title = expression("Average Structural Function Comparison"),legend.position=c(0.8,0.8))  # Optim, you’re doing it wrong? Call me uncouth, but I like my TV loud, my beer cold and my optimization functions as simple as possible. Therefore, what I write in this blog post is very much from a layman’s perspective, and I am happy to be corrected on any fundamental errors. I have recently become interested in writing my own maximum likelihood estimators. However, before examining any complex optimization problems, I wanted to perform a very simple analysis, and estimate a standard probit regression model. Obviously, this is a simple model, which can be estimated using the glm function. In the following, I am assuming that the glm coefficients and standard errors are the (most) correct ones (again, I am happy to be corrected on this). The data and model I use is one of female labor force participation taken from William H. Greene’s well-known econometrics textbook. lfp <- read.csv("lfp.csv") lfp$kids <- ifelse(lfp$k6==1 | lfp$k18==1, 1, 0)
formula=lfp ~ age+I(age^2)+log(inc)+ed+kids
data=lfp
aa <- lm(formula,data)
X <- model.matrix(aa)
y <- data.frame(lfp$lfp) pll <- function(beta){ sum(y*log(pnorm(X %*% beta)) + (1-y)*log(1-pnorm(X %*% beta))) }  The above code loads these data and specifies the basic probit log-likelihood (pll). The idea here is: given the design matrix X and outcome vector y, what beta vector maximizes the log-likelihood function pll? To perform this optimization problem, I use the following two functions: optim, which is part of the stats package, and maxLik, a function from the package of the same name. > system.time(ml1 <- optim(coef(aa)*2.5, pll, method="BFGS", + control=list(maxit=5000, fnscale=-1), hessian=T)) user system elapsed 2.59 0.00 2.66 > system.time(ml2 <- maxLik(pll,start=coef(aa)*2.5)) user system elapsed 4.36 0.00 4.46 > system.time(mod1 <- glm(formula,family=binomial(link="probit"),data)) user system elapsed 0.03 0.00 0.04 > > cbind(ml1$par,ml2$estimate,coef(mod1)) [,1] [,2] [,3] (Intercept) -3.3929077241 -5.326569347 -5.32655206 age -0.0565559741 0.128949421 0.12894921 I(age^2) 0.0004543657 -0.001675693 -0.00167569 log(inc) 0.4287758432 0.223211383 0.22320995 ed 0.0812164389 0.086559458 0.08655950 kids -0.3168540384 -0.323633523 -0.32363339 > cbind(sqrt(diag(-solve(ml1$hessian))),
+       sqrt(diag(-solve(ml2$hessian))), + sqrt(diag(vcov(mod1)))) [,1] [,2] [,3] (Intercept) 9.387307e-01 1.6182304697 1.5569666788 age 8.835891e-03 0.0659315872 0.0642575252 I(age^2) 8.978703e-05 0.0007601266 0.0007413629 log(inc) 9.897385e-02 0.0998697296 0.0999188711 ed 2.302763e-02 0.0230260173 0.0229592562 kids 1.020938e-01 0.1017291904 0.1015104540  In the above, I contrast the output obtained by the two optimization functions, alongside the results produced using the standard glm function. Firstly, we can see that optim is almost twice the speed of maxLik. However, the coefficients produced by the maxLik function (column 2) are much closer to the glm coefficients (column 3) than those obtained via optim (column 1). A comparison of the standard errors, taken from the Hessian matrix, show that the maxLik function once again ‘outperforms’ the optim function, which appears to yield completely incorrect standard errors. > pgr <- function(beta){ + t(y*(dnorm(X %*% beta)/pnorm(X %*% beta)) - + (1-y)*(dnorm(X %*% beta)/(1-pnorm(X %*% beta)))) %*% X + } > system.time(ml1 <- optim(coef(aa)*2.5, pll, pgr, method="BFGS", + control=list(maxit=5000, fnscale=-1), hessian=T)) user system elapsed 0.40 0.00 0.42 > system.time(ml2 <- maxLik(pll, pgr, start=coef(aa)*2.5)) user system elapsed 0.57 0.00 0.58 > cbind(ml1$par,ml2$estimate,coef(mod1)) [,1] [,2] [,3] (Intercept) -5.326570797 -5.32656034 -5.32655206 age 0.128949476 0.12894917 0.12894921 I(age^2) -0.001675693 -0.00167569 -0.00167569 log(inc) 0.223211464 0.22321101 0.22320995 ed 0.086559408 0.08655946 0.08655950 kids -0.323633519 -0.32363352 -0.32363339 > cbind(sqrt(diag(-solve(ml1$hessian))),
+       sqrt(diag(-solve(ml2$hessian))), + sqrt(diag(vcov(mod1)))) [,1] [,2] [,3] (Intercept) 0.50590822 1.5510444988 1.5569666788 age NaN 0.0640955648 0.0642575252 I(age^2) NaN 0.0007394648 0.0007413629 log(inc) 0.09833250 0.0989695518 0.0999188711 ed 0.02293623 0.0230289024 0.0229592562 kids 0.10153358 0.1017304094 0.1015104540 Warning message: In sqrt(diag(-solve(ml1$hessian))) : NaNs produced


The slow performance of both optim and maxLik is because I did not specify the gradient function (i.e. first derivatives). Without a gradient function both optimizers are forced to approximate. In the above, I add a gradient function pgr. This speeds up both optimization procedures considerably. Furthermore, the similarity of the estimated coefficients is evident. Nevertheless, problems are encountered in the calculation of the Hessian matrix in the optim case. Once again, fewer problems are encountered with the maxLik function.

Based on my experience thus far, I would recommend the use of maxLik compared to optim.

# Probit/Logit Marginal Effects in R

The common approach to estimating a binary dependent variable regression model is to use either the logit or probit model. Both are forms of generalized linear models (GLMs), which can be seen as modified linear regressions that allow the dependent variable to originate from non-normal distributions.

The coefficients in a linear regression model are marginal effects, meaning that they can be treated as partial derivatives. This makes the linear regression model very easy to interpret. For example, the fitted linear regression model y=x*b tells us that a one unit increase in x increases y by b units.

Unfortunately, this is not the case in GLMs, because fitted GLMs take the form, y=G (x*b), where G(.) is the known link function (i.e. inverse logistic for logit). Since we want to calculate the slope of x which is inside the function G(.), calculating marginal effects that are comparable to their linear model counterparts involves using the chain rule.

Empirical economic research typically cites the marginal effects since they are intuitive and easy to digest. Therefore, it is important for researchers to be able to compute these results. A more formal treatment of this problem can be found in the following paper, where we see that the solution is to multiply the estimated GLM coefficients by the probability density function of the linked distribution (which is the derivative of the cumulative density function). Interestingly, the linked paper also supplies some R code which calculates marginal effects for both the probit or logit models. In the code below, I demonstrate a similar function that calculates ‘the average of the sample marginal effects’.

mfxboot <- function(modform,dist,data,boot=1000,digits=3){
# get marginal effects
pdf <- ifelse(dist=="probit",
marginal.effects <- pdf*coef(x)
# start bootstrap
bootvals <- matrix(rep(NA,boot*length(coef(x))), nrow=boot)
set.seed(1111)
for(i in 1:boot){
samp1 <- data[sample(1:dim(data)[1],replace=T,dim(data)[1]),]
pdf1 <- ifelse(dist=="probit",
bootvals[i,] <- pdf1*coef(x1)
}
res <- cbind(marginal.effects,apply(bootvals,2,sd),marginal.effects/apply(bootvals,2,sd))
if(names(x$coefficients[1])=="(Intercept)"){ res1 <- res[2:nrow(res),] res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res1)),nrow=dim(res1)[1]) rownames(res2) <- rownames(res1) } else { res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep="")),nrow=dim(res)[1])) rownames(res2) <- rownames(res) } colnames(res2) <- c("marginal.effect","standard.error","z.ratio") return(res2) }  This command also provides bootstrapped standard errors, which account for both the uncertainty in the predicted values and the estimated coefficients. The R package ‘erer’ also has a function that calculates these marginal effects. Illustrations of the above function, alongside code for a nice ggplot2 figure are displayed below. library(AER) data(SwissLabor) mfx1 <- mfxboot(participation ~ . + I(age^2),"probit",SwissLabor) mfx2 <- mfxboot(participation ~ . + I(age^2),"logit",SwissLabor) mfx3 <- mfxboot(participation ~ . + I(age^2),"probit",SwissLabor,boot=100,digits=4) mfxdat <- data.frame(cbind(rownames(mfx1),mfx1)) mfxdat$me <- as.numeric(as.character(mfxdat$marginal.effect)) mfxdat$se <- as.numeric(as.character(mfxdat\$standard.error))

# coefplot
library(ggplot2)
ggplot(mfxdat, aes(V1, marginal.effect,ymin = me - 2*se,ymax= me + 2*se)) +
scale_x_discrete('Variable') +
scale_y_continuous('Marginal Effect',limits=c(-0.5,1)) +
theme_bw() +
geom_errorbar(aes(x = V1, y = me),size=.3,width=.2) +
geom_point(aes(x = V1, y = me)) +
geom_hline(yintercept=0) +
coord_flip() +
opts(title="Marginal Effects with 95% Confidence Intervals")