Optim, you’re doing it wrong?

Call me uncouth, but I like my TV loud, my beer cold and my optimization functions as simple as possible. Therefore, what I write in this blog post is very much from a layman’s perspective, and I am happy to be corrected on any fundamental errors.

I have recently become interested in writing my own maximum likelihood estimators. However, before examining any complex optimization problems, I wanted to perform a very simple analysis, and estimate a standard probit regression model.

Obviously, this is a simple model, which can be estimated using the glm function. In the following, I am assuming that the glm coefficients and standard errors are the (most) correct ones (again, I am happy to be corrected on this). The data and model I use is one of female labor force participation taken from William H. Greene’s well-known econometrics textbook.

lfp <- read.csv("lfp.csv")
lfp$kids <- ifelse(lfp$k6==1 | lfp$k18==1, 1, 0)
formula=lfp ~ age+I(age^2)+log(inc)+ed+kids
aa <- lm(formula,data)
X <- model.matrix(aa)
y <- data.frame(lfp$lfp)
pll <- function(beta){
  sum(y*log(pnorm(X %*% beta)) + (1-y)*log(1-pnorm(X %*% beta)))

The above code loads these data and specifies the basic probit log-likelihood (pll). The idea here is: given the design matrix X and outcome vector y, what beta vector maximizes the log-likelihood function pll? To perform this optimization problem, I use the following two functions: optim, which is part of the stats package, and maxLik, a function from the package of the same name.

> system.time(ml1 <- optim(coef(aa)*2.5, pll, method="BFGS",
+              control=list(maxit=5000, fnscale=-1), hessian=T))
   user  system elapsed 
   2.59    0.00    2.66 
> system.time(ml2 <- maxLik(pll,start=coef(aa)*2.5))
   user  system elapsed 
   4.36    0.00    4.46 
> system.time(mod1 <- glm(formula,family=binomial(link="probit"),data))
   user  system elapsed 
   0.03    0.00    0.04 
> cbind(ml1$par,ml2$estimate,coef(mod1))
                     [,1]         [,2]        [,3]
(Intercept) -3.3929077241 -5.326569347 -5.32655206
age         -0.0565559741  0.128949421  0.12894921
I(age^2)     0.0004543657 -0.001675693 -0.00167569
log(inc)     0.4287758432  0.223211383  0.22320995
ed           0.0812164389  0.086559458  0.08655950
kids        -0.3168540384 -0.323633523 -0.32363339
> cbind(sqrt(diag(-solve(ml1$hessian))),
+       sqrt(diag(-solve(ml2$hessian))),
+       sqrt(diag(vcov(mod1))))
                    [,1]         [,2]         [,3]
(Intercept) 9.387307e-01 1.6182304697 1.5569666788
age         8.835891e-03 0.0659315872 0.0642575252
I(age^2)    8.978703e-05 0.0007601266 0.0007413629
log(inc)    9.897385e-02 0.0998697296 0.0999188711
ed          2.302763e-02 0.0230260173 0.0229592562
kids        1.020938e-01 0.1017291904 0.1015104540

In the above, I contrast the output obtained by the two optimization functions, alongside the results produced using the standard glm function. Firstly, we can see that optim is almost twice the speed of maxLik. However, the coefficients produced by the maxLik function (column 2) are much closer to the glm coefficients (column 3) than those obtained via optim (column 1). A comparison of the standard errors, taken from the Hessian matrix, show that the maxLik function once again ‘outperforms’ the optim function, which appears to yield completely incorrect standard errors.

> pgr <- function(beta){
+   t(y*(dnorm(X %*% beta)/pnorm(X %*% beta)) - 
+     (1-y)*(dnorm(X %*% beta)/(1-pnorm(X %*% beta)))) %*% X
+ }
> system.time(ml1 <- optim(coef(aa)*2.5, pll, pgr, method="BFGS",
+                          control=list(maxit=5000, fnscale=-1), hessian=T))
   user  system elapsed 
   0.40    0.00    0.42 
> system.time(ml2 <- maxLik(pll, pgr, start=coef(aa)*2.5))
   user  system elapsed 
   0.57    0.00    0.58 
> cbind(ml1$par,ml2$estimate,coef(mod1))
                    [,1]        [,2]        [,3]
(Intercept) -5.326570797 -5.32656034 -5.32655206
age          0.128949476  0.12894917  0.12894921
I(age^2)    -0.001675693 -0.00167569 -0.00167569
log(inc)     0.223211464  0.22321101  0.22320995
ed           0.086559408  0.08655946  0.08655950
kids        -0.323633519 -0.32363352 -0.32363339
> cbind(sqrt(diag(-solve(ml1$hessian))),
+       sqrt(diag(-solve(ml2$hessian))),
+       sqrt(diag(vcov(mod1))))
                  [,1]         [,2]         [,3]
(Intercept) 0.50590822 1.5510444988 1.5569666788
age                NaN 0.0640955648 0.0642575252
I(age^2)           NaN 0.0007394648 0.0007413629
log(inc)    0.09833250 0.0989695518 0.0999188711
ed          0.02293623 0.0230289024 0.0229592562
kids        0.10153358 0.1017304094 0.1015104540
Warning message:
In sqrt(diag(-solve(ml1$hessian))) : NaNs produced

The slow performance of both optim and maxLik is because I did not specify the gradient function (i.e. first derivatives). Without a gradient function both optimizers are forced to approximate. In the above, I add a gradient function pgr. This speeds up both optimization procedures considerably. Furthermore, the similarity of the estimated coefficients is evident. Nevertheless, problems are encountered in the calculation of the Hessian matrix in the optim case. Once again, fewer problems are encountered with the maxLik function.

Based on my experience thus far, I would recommend the use of maxLik compared to optim.

Time-Series Policy Evaluation in R

Quantifying the success of government policies is clearly important. Randomized control trials, like those conducted by drug companies, are often described as the ‘gold-standard’ for policy evaluation. Under these, a policy is implemented in/to one area/group (treatment), but not in/to another (control). The difference in outcomes between the two areas or groups represents the effectiveness of the policy.

Unfortunately, there many circumstances in which governments introduce policies and are not able to measure how are well, if at all, they work. A good example of this is childhood vaccinations against disease, since it is unethical to deny some children a potentially life-saving treatment. This makes evaluating these policies difficult, because there is no counterfactual event, i.e. everyone gets the treatment (in this case vaccination).

However, it is possible to evaluate the effectiveness of certain policies with time-series econometrics. For this example, I am going to use a series of monthly road death statistics in Ireland over the period 2003 to present. In November 2011, a new drink driving law was introduced which lowered the blood-alcohol limits. How has the introduction of this law changed road fatalities?

To evaluate this policy, I estimated a simple state space model using the R package sspir. The code I used for this example is almost exactly like the van drivers example in this paper. The following generalized linear model is assumed: Po(y_{t}) = a_{t} + b(LAW) + S_{t}, where the number of deaths, y_{t}, is a Poisson process that depends on a random walk, a_{t}, time-varying seasonal patterns, S_{t}, and the introduction of the law. The b parameter is the measure of the policy’s effect.

The usefulness of this model for policy evaluation, compared to an OLS linear regression for example, stems from the time-varying intercept a_{t}. In effect, this parameter substitutes the counterfactual or control group that we would need in a randomized control trial because it takes into account the underlying trend in the time-series before the policy change. The law dummy variable adds a shift to the underlying trend, acting as the relevant treatment effect.

A full overview of the state space system, and estimation using the Kalman filter is beyond the scope of this blog post. Interested readers should consult this paper and the references therein. The code to generate these results is below.

> head(road)
  deaths drinkdrive ind
1     20          0   1
2     21          0   2
3     33          0   3
4     23          0   4
5     28          0   5
6     37          0   6

mle <- function(psi) {
  m1 <- ssm(deaths ~ tvar(1) + drinkdrive + tvar(sumseason(time,12)),
            family=poisson(link="log"), data=road,
  m1fit <- getFit(m1)

res1 <- optim(par=c(0,0),fn=mle,method=c("L-BFGS-B"),

The first part of the code sets up the state space system inside a function. The reason for this is to calculate the signal-to-noise ratio, which I call psi, using maximum likelihood. The signal-to-noise ratio dictates the ‘wiggliness’ of the random walk a_{t}. Too much signal will result in a_{t} matching the series exactly. Too little, and the random walk will collapse to a constant term, as in OLS. A similar argument can be made for the time-varying seasonal patterns.

After calculating the signal-to-noise ratio, I estimated the model with the maximum likelihood parameters. The function ssm automatically runs the iterated Kalman smoother, which provides the relevant information.

# estimate with mle pars
m2 <- ssm(deaths ~ tvar(1) + drinkdrive + tvar(sumseason(time,12)),
          family=poisson(link="log"), data=road,
m2fit <- getFit(m2)
100 * (exp(m2fit$m[1,2])-1) # effect of drink driving law

# calculate sd
m2sd <- NULL
for (i in 1:length(road$deaths)) {
  thisone <- m2fit$C[[i]][1:2,1:2]  
  if (road$drinkdrive[i]==0) { 
    m2sd <- c(m2sd,sqrt(thisone[1,1])) 
    m2sd <- c(m2sd,sqrt(sum(thisone)))

# plot
road$ind <- 1:112
     xlab="Year",ylab="Death Count",main="Monthly Road Deaths, Ireland")
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]),col="blue")
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]+2*m2sd),lty=2,col="blue")
lines(exp(road$drinkdrive*m2fit$m[,2] + m2fit$m[,1]-2*m2sd),lty=2,col="blue")
axis(1, at=c(1,25,49,73,97), lab=c("2003","2005","2007","2009","2011"))

The following is a plot showing the underlying series (red), the policy relevant effect (shift in the blue line, black vertical line represents the policy change), and uncertainty (blue dotted lines). Perhaps it is too early to suggest this policy was ineffective. However, as is evident in the graph, this policy change did not result in less road deaths. In fact, this policy actually led to a 5.6% increase in road deaths, although a large amount of uncertainty surrounds these estimates, as shown by the 95% confidence intervals.

While I use this as an example, I should be clear that I do not condone driving under the influence of alcohol, nor do I think this was a bad policy. These data may not be a perfect measure of the policy’s effectiveness. Perhaps there are other time-series, like head-on collisions, for which the policy did make a discernable difference. Additionally, there may be previous policy changes in the series which I am unaware of.

Simple Spatial Correlograms for Cross-Country Analysis in R

Accounting for temporal dependence in econometric analysis is important, as the presence of temporal dependence violates the assumption that observations are independent units. Historically, much less attention has been paid to correcting for spatial dependence, which, if present, also violates this independence assumption.

The comparability of temporal and spatial dependence is useful for illustrating why spatial dependence may matter. For example, a time-series analysis of a county’s GDP would examine how related this year’s figure was to preceding years, whereas a spatial analysis would examine how related one country’s figure was to that of their neighbours.

However, these temporal and spatial issues are not completely analogous. This difference is highlighted with the creation of lagged variables. In time-series, the lag structure is intuitively easy to comprehend. A variable y_{t} has lags of y_{t-1}, …, y_{t-k}.

Lagging a spatial variable introduces more ambiguity. A spatially lagged variable of y_{i} can be calculated as \rho(y_{i}) where \rho is a spatial weighting matrix. Interested readers should consult further sources on this, but needless to say there are a number of ways (nearest neighbour, distance etc.) in which we could define \rho.

The difference between spatial and temporal processes in the creation of lagged variables is somewhat compounded when we consider that it is not possible to create ‘lags of lags’ in the same manner as in time-series (so \rho^2 is not the valid). Your neighbour’s neighbour might be your neighbour.

Given the difficulty in creating spatial lags of spatial lags, is there an easy way of recreating a simple time-series correlogram for a spatial variable? Yes, indeed it is. The solution I present here is not an exclusive way to analyze a spatially dependent variable.

For this analysis I use different countries spatial coordinates and create a correlogram of population density in 1900 (these data are taken from this paper). The first step involves loading these data into the R workspace such that the data frame contains the latitude and longitude coordinates alongside the potentially spatially autocorrelated variable.

> pop <- read.csv("dat2.csv")
> head(pop)
        lon       lat              country        dens
1  66.02554  33.83319          Afghanistan  79.2543748
2  17.55091 -12.29862               Angola  24.2375278
3  20.07006  41.14267              Albania 282.7854408
4  54.33089  23.91317 United Arab Emirates   5.0438742
5  44.93819  40.29368              Armenia   0.2136146
6 134.48679 -25.73251            Australia   4.8896695

With the data frame, I calculated a distance matrix with the ‘rdist.earth’ function from the package ‘fields’. This function measures the distance between each set of data points based on their coordinates. This function recognizes that the world is not flat, and as such calculates what are known as great-circle distances.

# calculate great cirlce distances with fields package
# popdists matrix
popdists <- as.matrix(rdist.earth(cbind(pop$lon, pop$lat), miles = F, R = NULL))
diag(popdists) <- 0

With this distance based matrix it is possible to calculate a correlogram. This is performed through the function ‘autocorr’. There are three arguments. The first is w is the distance-based weights matrix. The second, x is the variable of interest (note: the order of x and w should be the same as in the original data frame), and the third are the distance bands.

# autocorr function
autocorr <- function(w,x,dist=500){
  aa <- ceiling(max(w)/dist)
  dists <- seq(0,aa*dist,dist)
  cors <- NULL
  for(i in 1:aa){
    w1 <- ifelse(w > dists[i] & w <= dists[i+1], 1, 0) 
    w2 <- w1
    for(j in 1:dim(w1)[1]){
      nu <- sum(w1[j,])
        w2[j,] <- w1[j,]/nu
    lag <- w2 %*% x
    cors <- c(cors,cor(x,lag))

The function works in the following manner. For each step there is a distance band. Here, I have specified that the distance bands are 500km. For the first-step, the function calculates the correlation of the variable of interest (x) and its spatially lagged neighbours – who are defined as having a centre-point within a 500km radius. The next step calculates this correlation for neighbours falling into the 500km to 1,000 km distance band, so there will be a new spatial weighting matrix \rho. This is an iterative procedure, and continues up until the bands exceed the furthest neighbours in the data frame.

# at 500km
ac1 <- autocorr(w=popdists,x=pop$dens,dist=500)
# 1000 km
ac2 <- autocorr(w=popdists,x=pop$dens,dist=1000)

# MC analysis
it <- 1000
mc <- matrix(NA,nrow=it,ncol=length(ac1))
for(i in 1:it){
  pop$rand <- sample(pop$dens,length(pop$dens),replace=F)
  mc[i,] <- autocorr(w=popdists,x=pop$rand,dist=500)

ac1 <- data.frame(cbind(ac1,seq(500,20000,500)))
ac1 <- cbind(ac1,t(apply(mc,2,quantile, probs = c(0.025,0.975))))
names(ac1) <- c("ac","dist","lci","uci")

ggplot(ac1, aes(dist, ac)) +
  geom_point(colour = "darkblue", size = 3) +
  geom_line(colour = "red") +
  scale_x_continuous('Great Circle Distance (KM)',limits=c(0,20000)) + 
  scale_y_continuous('Autocorrelation',limits=c(-0.4,0.7)) +
  theme_bw() + 
  geom_hline(yintercept=0) +   
  geom_smooth(aes(ymin = lci, ymax = uci), stat="identity",fill="blue",colour="darkblue") 

I also calculated 95% confidence intervals using a simple Monte Carlo technique. This code, and that to generate the nice ggplot2 figure is shown below.

The analysis in the above provides a simple example of how R can be used to generate descriptive spatial statistics. Those interested in implementing spatial methodologies should examine the wide variety of packages on CRAN which can be used for spatial analysis. The ‘spdep‘ package provides most of the essentials. The recent release of the ‘splm‘ package, containing functions for panel regression analysis, is also noteworthy, and potentially very important, contribution.

An ivreg2 function for R

The ivreg2 command is one of the most popular routines in Stata. The reason for this popularity is its simplicity. A one-line ivreg2 command generates not only the instrumental variable regression coefficients and their standard errors, but also a number of other statistics of interest.

I have come across a number of functions in R that calculate instrumental variable regressions. However, none appear to (and correct me if I am wrong) offer an output similar to the ivreg2 command in Stata. The function below is my first attempt to replicate Stata’s ivreg2.


There are four required arguments. The ‘form’ argument is the second stage regression, written in the same manner as any regression model in R. The ‘endog’ argument is a character object with the name of the endogenous variable. The user should specify the instrumental variable(s) with the ‘iv’ argument. These instruments should be contained in ‘data’ – a data frame object. Note, the function in its current state only allows of one endogenous variable (which is usually more than enough for the researcher to contend with). Furthermore, make sure that there are no ‘NA’ values in the data frame being passed through the function.

This function performs a 2SLS regression calculating the usual regression output, a weak identification F-statistic, the Wu-Hausman test of endogeneity, and, in the case where there is more than one-instrument, a Sargan test. The weak identification statistic is used to determine whether the instrument(s) is(are) sufficiently correlated with the endogenous variable of interest. The ‘rule-of-thumb’ critical statistic here is ten. A Wu-Hausman test examines the difference between the IV and OLS coefficients. Rejecting the null hypothesis indicates the presence of endogeneity. Finally, the Sargan over-identification test is used in the cases where there are more instruments than endogenous regressors. A rejection of the null in this test means that the instruments are not exclusively affecting the outcome of interest though the endogenous variable.

The code for this function, alongside an example with the well known Mroz data, is shown below.

> mroz <- read.dta("mroz.dta")
> mroz <- mroz[,c("hours","lwage","educ","age","kidslt6","kidsge6","nwifeinc","exper")]
> ivreg2(form=hours ~ lwage + educ + age + kidslt6 + kidsge6 + nwifeinc,
+       endog="lwage",iv=c("exper"),data=na.omit(mroz))
                Coef    S.E. t-stat p-val
(Intercept) 2478.435 655.207  3.783 0.000
lwage       1772.323 594.185  2.983 0.003
educ        -201.187  69.910 -2.878 1.996
age          -11.229  10.537 -1.066 1.713
kidslt6     -191.659 195.761 -0.979 1.672
kidsge6      -37.732  63.635 -0.593 1.447
nwifeinc      -9.978   7.174 -1.391 1.836

     First Stage F-test
[1,]             12.965

     Wu-Hausman F-test p-val
[1,]             36.38     0

     Sargan test of over-identifying restrictions 
[1,] "No test performed. Model is just identified"
ivreg2 <- function(form,endog,iv,data,digits=3){
  # library(MASS)
  # model setup
  r1 <- lm(form,data)
  y <- r1$fitted.values+r1$resid
  x <- model.matrix(r1)
  aa <- rbind(endog == colnames(x),1:dim(x)[2])  
  z <- cbind(x[,aa[2,aa[1,]==0]],data[,iv])  
  colnames(z)[(dim(z)[2]-length(iv)+1):(dim(z)[2])] <- iv  
  # iv coefficients and standard errors
  z <- as.matrix(z)
  pz <- z %*% (solve(crossprod(z))) %*% t(z)
  biv <- solve(crossprod(x,pz) %*% x) %*% (crossprod(x,pz) %*% y)
  sigiv <- crossprod((y - x %*% biv),(y - x %*% biv))/(length(y)-length(biv))
  vbiv <- as.numeric(sigiv)*solve(crossprod(x,pz) %*% x)
  res <- cbind(biv,sqrt(diag(vbiv)),biv/sqrt(diag(vbiv)),(1-pnorm(biv/sqrt(diag(vbiv))))*2)
  res <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res)),nrow=dim(res)[1])
  rownames(res) <- colnames(x)
  colnames(res) <- c("Coef","S.E.","t-stat","p-val")
  # First-stage F-test
  y1 <- data[,endog]
  z1 <- x[,aa[2,aa[1,]==0]]
  bet1 <- solve(crossprod(z)) %*% crossprod(z,y1)
  bet2 <- solve(crossprod(z1)) %*% crossprod(z1,y1)
  rss1 <- sum((y1 - z %*% bet1)^2)
  rss2 <- sum((y1 - z1 %*% bet2)^2)
  p1 <- length(bet1)
  p2 <- length(bet2)
  n1 <- length(y)
  fs <- abs((rss2-rss1)/(p2-p1))/(rss1/(n1-p1))
  firststage <- c(fs)
  firststage <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),firststage)),ncol=length(firststage))
  colnames(firststage) <- c("First Stage F-test")
  # Hausman tests
  bols <- solve(crossprod(x)) %*% crossprod(x,y) 
  sigols <- crossprod((y - x %*% bols),(y - x %*% bols))/(length(y)-length(bols))
  vbols <- as.numeric(sigols)*solve(crossprod(x))
  sigml <- crossprod((y - x %*% bols),(y - x %*% bols))/(length(y))
  x1 <- x[,!(colnames(x) %in% "(Intercept)")]
  z1 <- z[,!(colnames(z) %in% "(Intercept)")]
  pz1 <- z1 %*% (solve(crossprod(z1))) %*% t(z1)
  biv1 <- biv[!(rownames(biv) %in% "(Intercept)"),]
  bols1 <- bols[!(rownames(bols) %in% "(Intercept)"),]
  # Durbin-Wu-Hausman chi-sq test:
  # haus <- t(biv1-bols1) %*% ginv(as.numeric(sigml)*(solve(crossprod(x1,pz1) %*% x1)-solve(crossprod(x1)))) %*% (biv1-bols1)
  # hpvl <- 1-pchisq(haus,df=1)
  # Wu-Hausman F test
  resids <- NULL
  resids <- cbind(resids,y1 - z %*% solve(crossprod(z)) %*% crossprod(z,y1))
  x2 <- cbind(x,resids)
  bet1 <- solve(crossprod(x2)) %*% crossprod(x2,y)
  bet2 <- solve(crossprod(x)) %*% crossprod(x,y)
  rss1 <- sum((y - x2 %*% bet1)^2)
  rss2 <- sum((y - x %*% bet2)^2)
  p1 <- length(bet1)
  p2 <- length(bet2)
  n1 <- length(y)
  fs <- abs((rss2-rss1)/(p2-p1))/(rss1/(n1-p1))
  fpval <- 1-pf(fs, p1-p2, n1-p1)
  #hawu <- c(haus,hpvl,fs,fpval)
  hawu <- c(fs,fpval)
  hawu <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),hawu)),ncol=length(hawu))
  #colnames(hawu) <- c("Durbin-Wu-Hausman chi-sq test","p-val","Wu-Hausman F-test","p-val")
  colnames(hawu) <- c("Wu-Hausman F-test","p-val")  
  # Sargan Over-id test
  ivres <- y - (x %*% biv)
  oid <- solve(crossprod(z)) %*% crossprod(z,ivres)
  sstot <- sum((ivres-mean(ivres))^2)
  sserr <- sum((ivres - (z %*% oid))^2)
  rsq <- 1-(sserr/sstot)
  sargan <- length(ivres)*rsq
  spval <- 1-pchisq(sargan,df=length(iv)-1)
  overid <- c(sargan,spval)
  overid <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),overid)),ncol=length(overid))
  colnames(overid) <- c("Sargan test of over-identifying restrictions","p-val")
    overid <- t(matrix(c("No test performed. Model is just identified")))
    colnames(overid) <- c("Sargan test of over-identifying restrictions")
  full <- list(results=res, weakidtest=firststage, endogeneity=hawu, overid=overid)