Optim, you’re doing it wrong?

Call me uncouth, but I like my TV loud, my beer cold and my optimization functions as simple as possible. Therefore, what I write in this blog post is very much from a layman’s perspective, and I am happy to be corrected on any fundamental errors.

I have recently become interested in writing my own maximum likelihood estimators. However, before examining any complex optimization problems, I wanted to perform a very simple analysis, and estimate a standard probit regression model.

Obviously, this is a simple model, which can be estimated using the glm function. In the following, I am assuming that the glm coefficients and standard errors are the (most) correct ones (again, I am happy to be corrected on this). The data and model I use is one of female labor force participation taken from William H. Greene’s well-known econometrics textbook.

lfp <- read.csv("lfp.csv")
lfp$kids <- ifelse(lfp$k6==1 | lfp$k18==1, 1, 0)
formula=lfp ~ age+I(age^2)+log(inc)+ed+kids
aa <- lm(formula,data)
X <- model.matrix(aa)
y <- data.frame(lfp$lfp)
pll <- function(beta){
  sum(y*log(pnorm(X %*% beta)) + (1-y)*log(1-pnorm(X %*% beta)))

The above code loads these data and specifies the basic probit log-likelihood (pll). The idea here is: given the design matrix X and outcome vector y, what beta vector maximizes the log-likelihood function pll? To perform this optimization problem, I use the following two functions: optim, which is part of the stats package, and maxLik, a function from the package of the same name.

> system.time(ml1 <- optim(coef(aa)*2.5, pll, method="BFGS",
+              control=list(maxit=5000, fnscale=-1), hessian=T))
   user  system elapsed 
   2.59    0.00    2.66 
> system.time(ml2 <- maxLik(pll,start=coef(aa)*2.5))
   user  system elapsed 
   4.36    0.00    4.46 
> system.time(mod1 <- glm(formula,family=binomial(link="probit"),data))
   user  system elapsed 
   0.03    0.00    0.04 
> cbind(ml1$par,ml2$estimate,coef(mod1))
                     [,1]         [,2]        [,3]
(Intercept) -3.3929077241 -5.326569347 -5.32655206
age         -0.0565559741  0.128949421  0.12894921
I(age^2)     0.0004543657 -0.001675693 -0.00167569
log(inc)     0.4287758432  0.223211383  0.22320995
ed           0.0812164389  0.086559458  0.08655950
kids        -0.3168540384 -0.323633523 -0.32363339
> cbind(sqrt(diag(-solve(ml1$hessian))),
+       sqrt(diag(-solve(ml2$hessian))),
+       sqrt(diag(vcov(mod1))))
                    [,1]         [,2]         [,3]
(Intercept) 9.387307e-01 1.6182304697 1.5569666788
age         8.835891e-03 0.0659315872 0.0642575252
I(age^2)    8.978703e-05 0.0007601266 0.0007413629
log(inc)    9.897385e-02 0.0998697296 0.0999188711
ed          2.302763e-02 0.0230260173 0.0229592562
kids        1.020938e-01 0.1017291904 0.1015104540

In the above, I contrast the output obtained by the two optimization functions, alongside the results produced using the standard glm function. Firstly, we can see that optim is almost twice the speed of maxLik. However, the coefficients produced by the maxLik function (column 2) are much closer to the glm coefficients (column 3) than those obtained via optim (column 1). A comparison of the standard errors, taken from the Hessian matrix, show that the maxLik function once again ‘outperforms’ the optim function, which appears to yield completely incorrect standard errors.

> pgr <- function(beta){
+   t(y*(dnorm(X %*% beta)/pnorm(X %*% beta)) - 
+     (1-y)*(dnorm(X %*% beta)/(1-pnorm(X %*% beta)))) %*% X
+ }
> system.time(ml1 <- optim(coef(aa)*2.5, pll, pgr, method="BFGS",
+                          control=list(maxit=5000, fnscale=-1), hessian=T))
   user  system elapsed 
   0.40    0.00    0.42 
> system.time(ml2 <- maxLik(pll, pgr, start=coef(aa)*2.5))
   user  system elapsed 
   0.57    0.00    0.58 
> cbind(ml1$par,ml2$estimate,coef(mod1))
                    [,1]        [,2]        [,3]
(Intercept) -5.326570797 -5.32656034 -5.32655206
age          0.128949476  0.12894917  0.12894921
I(age^2)    -0.001675693 -0.00167569 -0.00167569
log(inc)     0.223211464  0.22321101  0.22320995
ed           0.086559408  0.08655946  0.08655950
kids        -0.323633519 -0.32363352 -0.32363339
> cbind(sqrt(diag(-solve(ml1$hessian))),
+       sqrt(diag(-solve(ml2$hessian))),
+       sqrt(diag(vcov(mod1))))
                  [,1]         [,2]         [,3]
(Intercept) 0.50590822 1.5510444988 1.5569666788
age                NaN 0.0640955648 0.0642575252
I(age^2)           NaN 0.0007394648 0.0007413629
log(inc)    0.09833250 0.0989695518 0.0999188711
ed          0.02293623 0.0230289024 0.0229592562
kids        0.10153358 0.1017304094 0.1015104540
Warning message:
In sqrt(diag(-solve(ml1$hessian))) : NaNs produced

The slow performance of both optim and maxLik is because I did not specify the gradient function (i.e. first derivatives). Without a gradient function both optimizers are forced to approximate. In the above, I add a gradient function pgr. This speeds up both optimization procedures considerably. Furthermore, the similarity of the estimated coefficients is evident. Nevertheless, problems are encountered in the calculation of the Hessian matrix in the optim case. Once again, fewer problems are encountered with the maxLik function.

Based on my experience thus far, I would recommend the use of maxLik compared to optim.


19 thoughts on “Optim, you’re doing it wrong?

    • but mle (which is in the stats4 package) and mle2/bbmle are just wrappers around optim (although mle2 also allows you to use optimizers from the optimx package). I could also provide a wrapper for the optimizers in maxLik if there were sufficient demand … could you post the data somewhere?

  1. Thanks for this, never used maxLik before…

    BTW, just be sure to check the relative tolerance used by optim. Making this smaller might increase the accuracy of the standard errors… It will, however, make it run slower.

    • Thanks Stefan, increasing accuracy of the tolerance may be a potential solution. However, the intention of my post was to show that the optim does not perform well under the default settings. I thought optim would be able to handle a simple problem like this one quite easily with the default settings. Looks like I was wrong though!

  2. Basing the comparison between two (or more) optimization solvers on one example is actually called bad habit. Someone else will come up with another example where maxLik is slow or incorrect and optim is as accurate as you ever wished. Optimization as a general task is not that easy, unfortunately, even if your test function might look easy on first view.
    You could, of course, send this example to the maintainer of a solver function and ask for help or explanation. But, please, refrain from these kinds of statements.

  3. I agree that optim has some issues; sometimes it fails to find max/min or finds a wrong one. In this case, if you look at the Hessian, you’ll see the age and age^2 have very tiny numbers, that may cause some numeric issues. If you rescale age by dividing it by 10, then optim works fine, returning the same results as other procedures.

  4. Of course optim isn’t performing as well. It’s not a gauss-newton solver. Why use a quasi newton solver if you can stably calculate second derivatives without too much trouble (which is what glm does, and what maxLik does by default). I’ve often wished that optim knew what to do with second derivatives. Even with the Hessian=TRUE option, it doesn’t explicitly calculate the hessian until the end. optim builds up the Hessian using only the first derivative, and this isn’t always accurate or stable. But in very large, non-likelihood problems, the second derivative isn’t always feasible, and optim is then the only choice.

    You can actually see where optim is having trouble… it’s in the strong correlation between Age and Age^2. Being true gauss-newton solvers, maxLik and glm are better able to figure out this correlation. Even in the first case, without the derivatives, optim does pretty good with the ed and kids coefficients and their standard errors. If you reparameterized the model, I bet optim would do better, despite not using hessian information.

  5. To put at least one of the responses in context, Hans Werner and I received a note from Rob Hyndman about this blog post right after I’d given a seminar on R’s optimization and nonlinear estimation tools at KU Leuven. I’d mentioned the serious issue that giving novice optimization folk the tools we do is akin to letting someone with L plates (student driver) participate in last weekend’s F1 race at Monte Carlo. They are powerful tools, but rarely do what you want without some care, like the scaling suggested in one of the comments.

    As the author of 3 of the codes Brian Ripley incorporated some time ago into optim(), I’m painfully aware of this, and have been striving with some other folk to produce wrappers that will safeguard and guide users. It’s a big, awkward task.

    Be aware that any comment like “You should use Y because X doesn’t work well” is unscientific and dangerous unless there is a serious level of analysis (about 2-3 Ph D theses of work in my experience) to back it up. To that end, I’ll recommend you NOT use CG in optim(), one of my least successful efforts in the late 1970s. Rcgmin is much better, but the reasons are too long to list here.

    The R tools we have are pretty good, but they are in the form of a naked circular saw blade. You need the table and the guards and the work guides to render them useful. Mostly these are still lacking.

    A side issue is that of Gauss-Newton solvers. They work extremely well when they work, but are prone to fail miserably when the unstated assumptions behind them are not fulfilled. Thus nls() will be very fast when you give it a reasonable start, and has lots of really useful tools, but it won’t give the answer to an interpolation problem that is well-posed, because it is explicitly NOT intended for small-residual problems. For that, a Marquardt code is better, but will be much “slower” due to the stabilizations inherent in the code. (See the nlmrt experimental package on R-forge.r-project.org in the optimizer project).

    John Nash

    • Great comment. Thanks for contributing John.

      The point of this blogpost was not: “You should use Y because X doesn’t work well”. My point was: “For novices, X does not always work well off the default settings, and novices should not assume so, even though convergence has been achieved.”

  6. being a newbie myself, I appreciate learning from your scripts, but would prefer not having to figure out what columns in the dataset your variable names correspond to. 🙂

    • Hi Vincent,

      Thanks for the comment. I am on holidays at the moment, but I will update the script with some comments when I get back. Hope you still found it useful 🙂

  7. May I suggest that you learn how to use nonlinear solvers. There is much more than just writing a function and giving it to the solver, though this is the (mis)impression given by every econometrics/statistical text. Why do you rely on the default options? To do so is a recipe for disaster, as shown repeatedly in the literature. May I recommend:

    B. D. McCullough and Charles G. Renfro
    “Some Numerical Aspects of Nonlinear Estimation,”
    Journal of Economic and Social Measurement, 26(1), 63-77, 2000

    B. D. McCullough
    “Some Details of Nonlinear Estimation,” Chapter Eight in
    Numerical Methods in Statistical Computing for the Social Sciences,
    Micah Altman, Jeff Gill and Michael P. McDonald, editors
    New York: Wiley, 2004

    B. D. McCullough and H. D. Vinod
    “Verifying the Solution from a Nonlinear Solver: A Case Study,”
    American Economic Review 93(3), 873-892, 2003
    with comments and replies at American Economic Review 94(1), 2004

    with an important correction at

    B. D. McCullough and H. D. Vinod
    “Verifying the Solution from a Nonlinear Solver: Reply,”
    American Economic Review 94(1), 400-403, 2004 [reply to Drukker and Wiggins]

  8. Dear Contributor,

    we found your comment about optim and maxLik very interesting, because it was reproducing some of the problems we had. We tried to rerun your example in order to understand what was going on. We could reproduce exactly your results. Then, we had the idea of rescaling the age data, dividing it by 10. This is a usual practice, essential when doing numerical integration in a Bayesian framework. Having scaled the age variable, optim, maxLik and glm give the same results up to the third decimal and the NaNs desappear.
    What we can conclude from this example is first that optim is more sentive to numerical problems than maxLik and that the problem of scaling is located when computing a numerical gradient. A second conclusion is that scaling which is essential in numerical integration can be also of importance for numerical optimisation.

    Best regards
    Michel Lubrano and Zhou Xun

    • what actually we found is that the default numeric gradient measured in optim is problematic. If we provide our own symmetric numeric gradient then optim would return the same thing as maxLik in both the example shown here and in our more complex case (dynamic two tiered tobit model with serial correlation).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s