Optim, you’re doing it wrong?

Call me uncouth, but I like my TV loud, my beer cold and my optimization functions as simple as possible. Therefore, what I write in this blog post is very much from a layman’s perspective, and I am happy to be corrected on any fundamental errors.

I have recently become interested in writing my own maximum likelihood estimators. However, before examining any complex optimization problems, I wanted to perform a very simple analysis, and estimate a standard probit regression model.

Obviously, this is a simple model, which can be estimated using the glm function. In the following, I am assuming that the glm coefficients and standard errors are the (most) correct ones (again, I am happy to be corrected on this). The data and model I use is one of female labor force participation taken from William H. Greene’s well-known econometrics textbook.

lfp <- read.csv("lfp.csv")
lfp$kids <- ifelse(lfp$k6==1 | lfp$k18==1, 1, 0)
formula=lfp ~ age+I(age^2)+log(inc)+ed+kids
data=lfp
aa <- lm(formula,data)
X <- model.matrix(aa)
y <- data.frame(lfp$lfp)
pll <- function(beta){
  sum(y*log(pnorm(X %*% beta)) + (1-y)*log(1-pnorm(X %*% beta)))
  }

The above code loads these data and specifies the basic probit log-likelihood (pll). The idea here is: given the design matrix X and outcome vector y, what beta vector maximizes the log-likelihood function pll? To perform this optimization problem, I use the following two functions: optim, which is part of the stats package, and maxLik, a function from the package of the same name.

> system.time(ml1 <- optim(coef(aa)*2.5, pll, method="BFGS",
+              control=list(maxit=5000, fnscale=-1), hessian=T))
   user  system elapsed 
   2.59    0.00    2.66 
> system.time(ml2 <- maxLik(pll,start=coef(aa)*2.5))
   user  system elapsed 
   4.36    0.00    4.46 
> system.time(mod1 <- glm(formula,family=binomial(link="probit"),data))
   user  system elapsed 
   0.03    0.00    0.04 
> 
> cbind(ml1$par,ml2$estimate,coef(mod1))
                     [,1]         [,2]        [,3]
(Intercept) -3.3929077241 -5.326569347 -5.32655206
age         -0.0565559741  0.128949421  0.12894921
I(age^2)     0.0004543657 -0.001675693 -0.00167569
log(inc)     0.4287758432  0.223211383  0.22320995
ed           0.0812164389  0.086559458  0.08655950
kids        -0.3168540384 -0.323633523 -0.32363339
> cbind(sqrt(diag(-solve(ml1$hessian))),
+       sqrt(diag(-solve(ml2$hessian))),
+       sqrt(diag(vcov(mod1))))
                    [,1]         [,2]         [,3]
(Intercept) 9.387307e-01 1.6182304697 1.5569666788
age         8.835891e-03 0.0659315872 0.0642575252
I(age^2)    8.978703e-05 0.0007601266 0.0007413629
log(inc)    9.897385e-02 0.0998697296 0.0999188711
ed          2.302763e-02 0.0230260173 0.0229592562
kids        1.020938e-01 0.1017291904 0.1015104540

In the above, I contrast the output obtained by the two optimization functions, alongside the results produced using the standard glm function. Firstly, we can see that optim is almost twice the speed of maxLik. However, the coefficients produced by the maxLik function (column 2) are much closer to the glm coefficients (column 3) than those obtained via optim (column 1). A comparison of the standard errors, taken from the Hessian matrix, show that the maxLik function once again ‘outperforms’ the optim function, which appears to yield completely incorrect standard errors.

> pgr <- function(beta){
+   t(y*(dnorm(X %*% beta)/pnorm(X %*% beta)) - 
+     (1-y)*(dnorm(X %*% beta)/(1-pnorm(X %*% beta)))) %*% X
+ }
> system.time(ml1 <- optim(coef(aa)*2.5, pll, pgr, method="BFGS",
+                          control=list(maxit=5000, fnscale=-1), hessian=T))
   user  system elapsed 
   0.40    0.00    0.42 
> system.time(ml2 <- maxLik(pll, pgr, start=coef(aa)*2.5))
   user  system elapsed 
   0.57    0.00    0.58 
> cbind(ml1$par,ml2$estimate,coef(mod1))
                    [,1]        [,2]        [,3]
(Intercept) -5.326570797 -5.32656034 -5.32655206
age          0.128949476  0.12894917  0.12894921
I(age^2)    -0.001675693 -0.00167569 -0.00167569
log(inc)     0.223211464  0.22321101  0.22320995
ed           0.086559408  0.08655946  0.08655950
kids        -0.323633519 -0.32363352 -0.32363339
> cbind(sqrt(diag(-solve(ml1$hessian))),
+       sqrt(diag(-solve(ml2$hessian))),
+       sqrt(diag(vcov(mod1))))
                  [,1]         [,2]         [,3]
(Intercept) 0.50590822 1.5510444988 1.5569666788
age                NaN 0.0640955648 0.0642575252
I(age^2)           NaN 0.0007394648 0.0007413629
log(inc)    0.09833250 0.0989695518 0.0999188711
ed          0.02293623 0.0230289024 0.0229592562
kids        0.10153358 0.1017304094 0.1015104540
Warning message:
In sqrt(diag(-solve(ml1$hessian))) : NaNs produced

The slow performance of both optim and maxLik is because I did not specify the gradient function (i.e. first derivatives). Without a gradient function both optimizers are forced to approximate. In the above, I add a gradient function pgr. This speeds up both optimization procedures considerably. Furthermore, the similarity of the estimated coefficients is evident. Nevertheless, problems are encountered in the calculation of the Hessian matrix in the optim case. Once again, fewer problems are encountered with the maxLik function.

Based on my experience thus far, I would recommend the use of maxLik compared to optim.

Advertisements

Time-Series Policy Evaluation in R

Quantifying the success of government policies is clearly important. Randomized control trials, like those conducted by drug companies, are often described as the ‘gold-standard’ for policy evaluation. Under these, a policy is implemented in/to one area/group (treatment), but not in/to another (control). The difference in outcomes between the two areas or groups represents the effectiveness of the policy.

Unfortunately, there many circumstances in which governments introduce policies and are not able to measure how are well, if at all, they work. A good example of this is childhood vaccinations against disease, since it is unethical to deny some children a potentially life-saving treatment. This makes evaluating these policies difficult, because there is no counterfactual event, i.e. everyone gets the treatment (in this case vaccination).

However, it is possible to evaluate the effectiveness of certain policies with time-series econometrics. For this example, I am going to use a series of monthly road death statistics in Ireland over the period 2003 to present. In November 2011, a new drink driving law was introduced which lowered the blood-alcohol limits. How has the introduction of this law changed road fatalities?

To evaluate this policy, I estimated a simple state space model using the R package sspir. The code I used for this example is almost exactly like the van drivers example in this paper. The following generalized linear model is assumed: Po(y_{t}) = a_{t} + b(LAW) + S_{t}, where the number of deaths, y_{t}, is a Poisson process that depends on a random walk, a_{t}, time-varying seasonal patterns, S_{t}, and the introduction of the law. The b parameter is the measure of the policy’s effect.

The usefulness of this model for policy evaluation, compared to an OLS linear regression for example, stems from the time-varying intercept a_{t}. In effect, this parameter substitutes the counterfactual or control group that we would need in a randomized control trial because it takes into account the underlying trend in the time-series before the policy change. The law dummy variable adds a shift to the underlying trend, acting as the relevant treatment effect.

A full overview of the state space system, and estimation using the Kalman filter is beyond the scope of this blog post. Interested readers should consult this paper and the references therein. The code to generate these results is below.

> head(road)
  deaths drinkdrive ind
1     20          0   1
2     21          0   2
3     33          0   3
4     23          0   4
5     28          0   5
6     37          0   6

mle <- function(psi) {
  m1 <- ssm(deaths ~ tvar(1) + drinkdrive + tvar(sumseason(time,12)),
            family=poisson(link="log"), data=road,
            phi=c(1,exp(psi)),C0=diag(13)*1000)
  m1fit <- getFit(m1)
  return(extended(m1fit)$likelihood)
}

res1 <- optim(par=c(0,0),fn=mle,method=c("L-BFGS-B"),
              control=list(fnscale=-1),hessian=FALSE)

The first part of the code sets up the state space system inside a function. The reason for this is to calculate the signal-to-noise ratio, which I call psi, using maximum likelihood. The signal-to-noise ratio dictates the ‘wiggliness’ of the random walk a_{t}. Too much signal will result in a_{t} matching the series exactly. Too little, and the random walk will collapse to a constant term, as in OLS. A similar argument can be made for the time-varying seasonal patterns.

After calculating the signal-to-noise ratio, I estimated the model with the maximum likelihood parameters. The function ssm automatically runs the iterated Kalman smoother, which provides the relevant information.

# estimate with mle pars
m2 <- ssm(deaths ~ tvar(1) + drinkdrive + tvar(sumseason(time,12)),
          family=poisson(link="log"), data=road,
          phi=c(1,exp(res1$par)),C0=diag(13)*1000)
m2fit <- getFit(m2)
          
100 * (exp(m2fit$m[1,2])-1) # effect of drink driving law

# calculate sd
m2sd <- NULL
for (i in 1:length(road$deaths)) {
  thisone <- m2fit$C[[i]][1:2,1:2]  
  if (road$drinkdrive[i]==0) { 
    m2sd <- c(m2sd,sqrt(thisone[1,1])) 
  }
  else
    m2sd <- c(m2sd,sqrt(sum(thisone)))
}

# plot
road$ind <- 1:112
plot(road$ind,road$deaths,ylim=c(0,50),col="red",type="l",axes=F,
     xlab="Year",ylab="Death Count",main="Monthly Road Deaths, Ireland")
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]),col="blue")
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]+2*m2sd),lty=2,col="blue")
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]-2*m2sd),lty=2,col="blue")
abline(v=107)
axis(1, at=c(1,25,49,73,97), lab=c("2003","2005","2007","2009","2011"))
axis(2)
box()

The following is a plot showing the underlying series (red), the policy relevant effect (shift in the blue line, black vertical line represents the policy change), and uncertainty (blue dotted lines). Perhaps it is too early to suggest this policy was ineffective. However, as is evident in the graph, this policy change did not result in less road deaths. In fact, this policy actually led to a 5.6% increase in road deaths, although a large amount of uncertainty surrounds these estimates, as shown by the 95% confidence intervals.

While I use this as an example, I should be clear that I do not condone driving under the influence of alcohol, nor do I think this was a bad policy. These data may not be a perfect measure of the policy’s effectiveness. Perhaps there are other time-series, like head-on collisions, for which the policy did make a discernable difference. Additionally, there may be previous policy changes in the series which I am unaware of.