Call me uncouth, but I like my TV loud, my beer cold and my optimization functions as simple as possible. Therefore, what I write in this blog post is very much from a layman’s perspective, and I am happy to be corrected on any fundamental errors.

I have recently become interested in writing my own maximum likelihood estimators. However, before examining any complex optimization problems, I wanted to perform a very simple analysis, and estimate a standard probit regression model.

Obviously, this is a simple model, which can be estimated using the glm function. In the following, I am assuming that the glm coefficients and standard errors are the (most) correct ones (again, I am happy to be corrected on this). The data and model I use is one of female labor force participation taken from William H. Greene’s well-known econometrics textbook.

lfp <- read.csv("lfp.csv") lfp$kids <- ifelse(lfp$k6==1 | lfp$k18==1, 1, 0) formula=lfp ~ age+I(age^2)+log(inc)+ed+kids data=lfp aa <- lm(formula,data) X <- model.matrix(aa) y <- data.frame(lfp$lfp) pll <- function(beta){ sum(y*log(pnorm(X %*% beta)) + (1-y)*log(1-pnorm(X %*% beta))) }

The above code loads these data and specifies the basic probit log-likelihood (pll). The idea here is: given the design matrix X and outcome vector y, what beta vector maximizes the log-likelihood function pll? To perform this optimization problem, I use the following two functions: optim, which is part of the stats package, and maxLik, a function from the package of the same name.

> system.time(ml1 <- optim(coef(aa)*2.5, pll, method="BFGS", + control=list(maxit=5000, fnscale=-1), hessian=T)) user system elapsed 2.59 0.00 2.66 > system.time(ml2 <- maxLik(pll,start=coef(aa)*2.5)) user system elapsed 4.36 0.00 4.46 > system.time(mod1 <- glm(formula,family=binomial(link="probit"),data)) user system elapsed 0.03 0.00 0.04 > > cbind(ml1$par,ml2$estimate,coef(mod1)) [,1] [,2] [,3] (Intercept) -3.3929077241 -5.326569347 -5.32655206 age -0.0565559741 0.128949421 0.12894921 I(age^2) 0.0004543657 -0.001675693 -0.00167569 log(inc) 0.4287758432 0.223211383 0.22320995 ed 0.0812164389 0.086559458 0.08655950 kids -0.3168540384 -0.323633523 -0.32363339 > cbind(sqrt(diag(-solve(ml1$hessian))), + sqrt(diag(-solve(ml2$hessian))), + sqrt(diag(vcov(mod1)))) [,1] [,2] [,3] (Intercept) 9.387307e-01 1.6182304697 1.5569666788 age 8.835891e-03 0.0659315872 0.0642575252 I(age^2) 8.978703e-05 0.0007601266 0.0007413629 log(inc) 9.897385e-02 0.0998697296 0.0999188711 ed 2.302763e-02 0.0230260173 0.0229592562 kids 1.020938e-01 0.1017291904 0.1015104540

In the above, I contrast the output obtained by the two optimization functions, alongside the results produced using the standard glm function. Firstly, we can see that optim is almost twice the speed of maxLik. However, the coefficients produced by the maxLik function (column 2) are much closer to the glm coefficients (column 3) than those obtained via optim (column 1). A comparison of the standard errors, taken from the Hessian matrix, show that the maxLik function once again ‘outperforms’ the optim function, which appears to yield completely incorrect standard errors.

> pgr <- function(beta){ + t(y*(dnorm(X %*% beta)/pnorm(X %*% beta)) - + (1-y)*(dnorm(X %*% beta)/(1-pnorm(X %*% beta)))) %*% X + } > system.time(ml1 <- optim(coef(aa)*2.5, pll, pgr, method="BFGS", + control=list(maxit=5000, fnscale=-1), hessian=T)) user system elapsed 0.40 0.00 0.42 > system.time(ml2 <- maxLik(pll, pgr, start=coef(aa)*2.5)) user system elapsed 0.57 0.00 0.58 > cbind(ml1$par,ml2$estimate,coef(mod1)) [,1] [,2] [,3] (Intercept) -5.326570797 -5.32656034 -5.32655206 age 0.128949476 0.12894917 0.12894921 I(age^2) -0.001675693 -0.00167569 -0.00167569 log(inc) 0.223211464 0.22321101 0.22320995 ed 0.086559408 0.08655946 0.08655950 kids -0.323633519 -0.32363352 -0.32363339 > cbind(sqrt(diag(-solve(ml1$hessian))), + sqrt(diag(-solve(ml2$hessian))), + sqrt(diag(vcov(mod1)))) [,1] [,2] [,3] (Intercept) 0.50590822 1.5510444988 1.5569666788 age NaN 0.0640955648 0.0642575252 I(age^2) NaN 0.0007394648 0.0007413629 log(inc) 0.09833250 0.0989695518 0.0999188711 ed 0.02293623 0.0230289024 0.0229592562 kids 0.10153358 0.1017304094 0.1015104540 Warning message: In sqrt(diag(-solve(ml1$hessian))) : NaNs produced

The slow performance of both optim and maxLik is because I did not specify the gradient function (i.e. first derivatives). Without a gradient function both optimizers are forced to approximate. In the above, I add a gradient function pgr. This speeds up both optimization procedures considerably. Furthermore, the similarity of the estimated coefficients is evident. Nevertheless, problems are encountered in the calculation of the Hessian matrix in the optim case. Once again, fewer problems are encountered with the maxLik function.

Based on my experience thus far, I would recommend the use of maxLik compared to optim.