Malthus in 21st Century Europe

Reverend Thomas Malthus is well known for his pessimistic views on population growth and economic welfare. The ubiquitous ‘Malthusian model’ is simple and lucid tool which offers an explanation as to why living standards showed no substantive improvement between the Neolithic revolution and the 19th century.

A consequence of the Malthusian model’s popularity is that many people have overlooked the motivation and core point of Malthus’ analysis. Malthus’ aim, outlined in various editions of An Essay on the Principle of Population (notably the second), was to encourage social reforms which promoted later marriage, particularly amongst the poorest in society.

His argument was simple. Sex outside marriage was socially unacceptable, so marriage marked the beginning of sexual relations. Malthus, a clergyman, did not consider fertility control within marriage as an option. Therefore, the earlier couples married, the higher their fertility would be. If early marriage was common across society, this would (exponentially) increase the next generation and thereby reduce income per person because population has increased more than income. Ultimately, Malthus promoted later/delayed marriage as a tool for economic growth and encouraged the foundations of social institutions capable of enforcing this.

While Malthus is one of the most studied and remarked upon character in the history of economic thought, the consensus is that he got it wrong. The purpose of this blog post is to ask: Can any of Malthus’ ideas help us understand economic and demographic trends in the modern Europe?

In the 200 plus years since Malthus wrote the first essay fertility, the institution of marriage and economic conditions have changed immensely in Europe. We might expect economic conditions and marriage to be unrelated, since marriage is no longer a prerequisite for childbirth. Similarly, the economic cost of marriage is (or at least can be) lower than in preindustrial societies where dowries are involved. In addition, modern welfare systems offer child support which helps the poorest in society, something which Malthus, as an opponent of the English Poor Laws, may have disagreed with (although his opposition to poor relief softened somewhat over time).

The above graphic is consistent with Malthus’ hopes. When economic conditions (measured by unemployment change) decline people delay or postpone marital unions. Assuming that there is a delay in the proposition of marriage, it is better to use the lagged unemployment rate. However, this does not matter as there appears to also be a strong relationship between contemporaneous unemployment rate and the marriage rate.

While Malthus might have found the first plot somewhat comforting, we also know that it is socially acceptable for couples to have births outside of wedlock. The figure above demonstrates that despite the social acceptance of out of wedlock births, crude birth and marriage rates are still highly correlated modern Europe. With this in mind, the link between unemployment and fertility is illustrated in the below. Once again the link is clear.

One of the most striking aspects of these graphics is the Crisis. Since 2008, most European nations have experienced a collapse in economic conditions, and inevitable rise in unemployment. Despite this, the demographic trends still obey what would seem to be ‘Malthusian’ logic.

Probit/Logit Marginal Effects in R

The common approach to estimating a binary dependent variable regression model is to use either the logit or probit model. Both are forms of generalized linear models (GLMs), which can be seen as modified linear regressions that allow the dependent variable to originate from non-normal distributions.

The coefficients in a linear regression model are marginal effects, meaning that they can be treated as partial derivatives. This makes the linear regression model very easy to interpret. For example, the fitted linear regression model y=x*b tells us that a one unit increase in x increases y by b units.

Unfortunately, this is not the case in GLMs, because fitted GLMs take the form, y=G (x*b), where G(.) is the known link function (i.e. inverse logistic for logit). Since we want to calculate the slope of x which is inside the function G(.), calculating marginal effects that are comparable to their linear model counterparts involves using the chain rule.

Empirical economic research typically cites the marginal effects since they are intuitive and easy to digest. Therefore, it is important for researchers to be able to compute these results. A more formal treatment of this problem can be found in the following paper, where we see that the solution is to multiply the estimated GLM coefficients by the probability density function of the linked distribution (which is the derivative of the cumulative density function). Interestingly, the linked paper also supplies some R code which calculates marginal effects for both the probit or logit models. In the code below, I demonstrate a similar function that calculates ‘the average of the sample marginal effects’.

mfxboot <- function(modform,dist,data,boot=1000,digits=3){
  x <- glm(modform, family=binomial(link=dist),data)
  # get marginal effects
  pdf <- ifelse(dist=="probit",
                mean(dnorm(predict(x, type = "link"))),
                mean(dlogis(predict(x, type = "link"))))
  marginal.effects <- pdf*coef(x)
  # start bootstrap
  bootvals <- matrix(rep(NA,boot*length(coef(x))), nrow=boot)
  for(i in 1:boot){
    samp1 <- data[sample(1:dim(data)[1],replace=T,dim(data)[1]),]
    x1 <- glm(modform, family=binomial(link=dist),samp1)
    pdf1 <- ifelse(dist=="probit",
                   mean(dnorm(predict(x1, type = "link"))),
                   mean(dlogis(predict(x1, type = "link"))))
    bootvals[i,] <- pdf1*coef(x1)
  res <- cbind(marginal.effects,apply(bootvals,2,sd),marginal.effects/apply(bootvals,2,sd))
    res1 <- res[2:nrow(res),]
    res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res1)),nrow=dim(res1)[1])     
    rownames(res2) <- rownames(res1)
    } else {
    res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep="")),nrow=dim(res)[1]))
    rownames(res2) <- rownames(res)
  colnames(res2) <- c("marginal.effect","standard.error","z.ratio")  

This command also provides bootstrapped standard errors, which account for both the uncertainty in the predicted values and the estimated coefficients. The R package ‘erer’ also has a function that calculates these marginal effects. Illustrations of the above function, alongside code for a nice ggplot2 figure are displayed below.

mfx1 <- mfxboot(participation ~ . + I(age^2),"probit",SwissLabor)
mfx2 <- mfxboot(participation ~ . + I(age^2),"logit",SwissLabor)
mfx3 <- mfxboot(participation ~ . + I(age^2),"probit",SwissLabor,boot=100,digits=4)

mfxdat <- data.frame(cbind(rownames(mfx1),mfx1))
mfxdat$me <- as.numeric(as.character(mfxdat$marginal.effect))
mfxdat$se <- as.numeric(as.character(mfxdat$standard.error))

# coefplot
ggplot(mfxdat, aes(V1, marginal.effect,ymin = me - 2*se,ymax= me + 2*se)) +
  scale_x_discrete('Variable') +
  scale_y_continuous('Marginal Effect',limits=c(-0.5,1)) +
  theme_bw() + 
  geom_errorbar(aes(x = V1, y = me),size=.3,width=.2) + 
  geom_point(aes(x = V1, y = me)) +
  geom_hline(yintercept=0) + 
  coord_flip() +
  opts(title="Marginal Effects with 95% Confidence Intervals")

Dummies for Dummies

Most R functions used in econometrics convert factor variables into a set of dummy/binary variables automatically. This is useful when estimating a linear model, saving the user from the laborious activity of manually including the dummy variables as regressors. However, what if you want to reshape your dataframe so that it contains such dummy variables?

The following function, datdum(.), is a simple workaround. The first argument is the factor variable (which can also be a character), the second is the dataframe and the third is the name you want to call these dummy variables.

datdum <- function(x, data, name){
  data$rv <- rnorm(dim(data)[1],1,1)
  mm <- data.frame(model.matrix(lm(data$rv~-1+factor(data[,x]))))
  names(mm) <- paste(name,1:dim(mm)[2],sep=".")
  data$rv <- NULL
  data <- cbind(data,mm)

# simple example
dat <- c("A","B","C")
dat <- data.frame(dat)
# output
> dat
1   A
2   B
3   C
> datdum(x="dat",data=dat,name="category")
  dat category.1 category.2 category.3
1   A          1          0          0
2   B          0          1          0
3   C          0          0          1

Instrumental Variables without Traditional Instruments

Typically, regression models in empirical economic research suffer from at least one form of endogeneity bias.

The classic example is economic returns to schooling, where researchers want to know how much increased levels of education affect income. Estimation using a simple linear model, regressing income on schooling, alongside a bunch of control variables, will typically not yield education’s true effect on income. The problem here is one of omitted variables – notably unobserved ability. People who are more educated may be more motivated or have other unobserved characteristics which simultaneously affect schooling and future lifetime earnings.

Endogeneity bias plagues empirical research. However, there are solutions, the most common being instrumental variables (IVs). Unfortunately, the exclusion restrictions needed to justify the use of traditional IV methodology may be impossible to find.

So, what if you have an interesting research question, some data, but endogeneity with no IVs. You should give up, right? Wrong. According to Lewbel (forthcoming in Journal of Business and Economic Statistics), it is possible to overcome the endogeneity problem without the use of a traditional IV approach.

Lewbel’s paper demonstrates how higher order moment restrictions can be used to tackle endogeneity in triangular systems. Without going into too much detail (interested readers can consult Lewbel’s paper), this method is like the traditional two-stage instrumental variable approach, except the first-stage exclusion restriction is generated by the control, or exogenous, variables which we know are heteroskedastic (interested practitioners can test for this in the usual way, i.e. a White test).

In the code below, I demonstrate how one could employ this approach in R using the GMM framework outlined by Lewbel. My code only relates to a simple example with one endogenous variable and two exogenous variables. However, it would be easy to modify this code depending on the model.

# gmm function for 1 endog variable with 2 hetero exogenous variable
# outcome in the first column of 'datmat', endog variable in second
# constant and exog variables in the next three
# hetero exog in the last two (i.e no constant)
g1 <- function(theta, datmat) {
  #set up data
  y1 <- matrix(datmat[,1],ncol=1)
  y2 <- matrix(datmat[,2],ncol=1)
  x1 <- matrix(datmat[,3:5],ncol=3)
  z1 <- matrix(datmat[,4:5],ncol=2)
  # if the variable in the 4th col was not hetero
  # this could be modified so:
  # z1 <- matrix(datmat[,5],ncol=1)

  #set up moment conditions
  in1 <- (y1 -theta[1]*x1[,1]-theta[2]*x1[,2]-theta[3]*x1[,3])
  M <- NULL
  for(i in 1:dim(z1)[2]){
    M <- cbind(M,(z1[,i]-mean(z1[,i])))
  in2 <- (y2 -theta[4]*x1[,1]-theta[5]*x1[,2]-theta[6]*x1[,3]-theta[7]*y1)
  for(i in 1:dim(x1)[2]){M <- cbind(M,in1*x1[,i])}
  for(i in 1:dim(x1)[2]){M <- cbind(M,in2*x1[,i])}
  for(i in 1:dim(z1)[2]){M <- cbind(M,in2*((z1[,i]-mean(z1[,i]))*in1))}
# so estimation is easy
# gmm(function(...), data matrix, initial values vector)
# e.g : gmm(g1, x =as.matrix(dat),c(1,1,1,1,1,1,1))

I also tested the performance of Lewbel’s GMM estimator in comparison a mis-specified OLS estimator. In the code below, I perform 500 simulations of a triangular system containing an omitted variable. For the GMM estimator, it is useful to have good initial starting values. In this simple example, I use the OLS coefficients. In more complicated settings, it is advisable to use the estimates from the 2SLS procedure outlined in Lewbel’s paper. The distributions of the coefficient estimates are shown in the plot below. The true value, indicated by the vertical line, is one. It is pretty evident that the Lewbel approach works very well. I think this method could be very useful in a number of research disciplines.

beta1 <- beta2 <- NULL
for(k in 1:500){
  #generate data (including intercept)
  x1 <- rnorm(1000,0,1)
  x2 <- rnorm(1000,0,1)
  u <- rnorm(1000,0,1)
  s1 <- rnorm(1000,0,1)
  s2 <- rnorm(1000,0,1)
  ov <- rnorm(1000,0,1)
  e1 <- u + exp(x1)*s1 + exp(x2)*s1
  e2 <- u + exp(-x1)*s2 + exp(-x2)*s2
  y1 <- 1 + x1 + x2 + ov + e2 
  y2 <- 1 + x1 + x2 + y1 + 2*ov + e1
  x3 <- rep(1,1000)
  dat <- cbind(y1,y2,x3,x1,x2)
  #record ols estimate
  beta1 <- c(beta1,coef(lm(y2~x1+x2+y1))[4])
  #init values for iv-gmm
  init <- c(coef(lm(y2~x1+x2+y1)),coef(lm(y1~x1+x2)))
  #record gmm estimate
  beta2 <- c(beta2,coef(gmm(g1, x =as.matrix(dat),init))[7])

d <- data.frame(rbind(cbind(beta1,"OLS"),cbind(beta2,"IV-GMM")))
d$beta1 <- as.numeric(as.character(d$beta1))$beta1, d$V2,xlab=("Endogenous Coefficient"))
 title("Lewbel and OLS Estimates")
 legend("topright", levels(d$V2),lty=c(1,2,3),col=c(2,3,4),bty="n")

Temperature Change in Ireland

Has Ireland gotten any warmer? Ask any punter on the street and they will happily inform you of wild swings, trends and dips. “Back when I was a child”, “when I was younger”, or “years ago” are the usual refrains.

What’s the evidence? To answer this, I will use the temperature data from my previous post alongside the R package bcp. The bcp package stands for Bayesian Change Point, and it implements the change point methodology of Barry & Hartigan (1993). A good overview of how to use the bcp R package is offered by Erdman & Emerson (2007).

These series are monthly and run from 1866 to 2011. There are two measures for each month, the average daily high or low. I use both here, and perform the multivariate bcp procedure for each month separately, with average high and low for each month.

The figure above shows the output produced when I look at the January temperatures. Clearly, there are no major changes. The January temperature in Ireland is the same now as it was in 1866. The results for the other months are very similar. However, there are some changes in the October series, with both the posterior means for both max and min temperatures increasing by around 1 degree Celsius (see below).

Has Ireland gotten warmer? Yes, but only in October, for some reason unknown to me.

# with web-scraped data from the previous post


cold <- NULL
for (i in 1:12){
  mins <- armagh[armagh[,2]==i,4]
  cold <- cbind(cold,mins)
cold <- cold[14:159,]

warm <- NULL
for (i in 1:12){
  maxs <- armagh[armagh[,2]==i,3]
  warm <- cbind(warm,maxs)
warm <- warm[14:159,]

bcp.0 <- bcp(cbind(cold[,1],warm[,1]), burnin = 1000, mcmc = 5000)
bcp.0 <- bcp(cbind(cold[,10],warm[,10]), burnin = 1000, mcmc = 5000)

Web-Scraping in R

Web-scraping, or web-crawling, sounds like a seedy activity worthy of an Interpol investigative department. The reality, however, is far less nefarious. Web-scraping is any procedure by which someone extracts data from the internet. Given that it’s possible to get the internet on computers these days; web-scrapping opens an array of interesting possibilities to social-science researchers as it is possible to harvest massive datasets in short periods of times.

In the following code, I illustrate a very simple web-scraping routine. The object of this exercise is to scrape some Irish weather time-series, which I will look at in a future post. The main function of interest here is:


which reads these data into the R workspace. The url object in my example is a .txt file, however if the url address is written in html the readLines command will read all of the lines of the html. In effect, the readLines object will be a character vector with a length equal to the number of lines in the html code. In this case, extracting the data you need from the crap you don’t requires a bit of jiggery-pokery formally referred to as parsing. Thankfully, there are a whole bunch of functions in the Hmisc and other packages which can be used to do this in a systematic way. I have found the following particularly useful for parsing:

substring.location(...) ; grep(...) ; substr(...) ; strsplit(...)
# clean data

# web url
site <- ""

# call in data with try command in while loop
i <- 1
while (i < 2){
  aa <- try(read.table(site,sep="\t"))
  if (class(aa) == "try-error") {
    } else {
      i <- i + 1

# grand! now inspect and trim off crap
aa <- aa[6:dim(aa)[1],]

# data is melted together so some tidying required
bb <- cc <- dd <- c()
for (i in (1:length(aa))){
	bb <- unlist(strsplit(as.character(aa[i]), " "))
    cc <- bb[nchar(bb)>0] ; cc <- cc[1:7]
    dd <- rbind(dd,cc)

row.names(dd) <- dd[,1]

colnm <- c(dd[1,1],dd[1,2],paste(dd[1,3],dd[2,1],sep=" "), paste(dd[1,4],
		dd[2,2],sep=" "), paste(dd[1,5],dd[2,3],sep=" "), 
		paste(dd[1,6],dd[2,4],sep=" "), paste(dd[1,7],dd[2,5],sep=" "))

colnames(dd) <- colnm

armagh <- data.frame(dd[-c(1,2),])

for (i in (1:dim(armagh)[2])){
	armagh[,i] <- as.numeric(as.character(armagh[,i]))

decmin <- armagh[armagh[,2]==12,4]
year <- armagh[armagh[,2]==12,1]
wh1 <- data.frame(cbind(armagh$tmin.degC[armagh$mm==12],armagh$yyyy[armagh$mm==12]))
wh1 <- na.omit(wh1)

# nice plot
ggplot(wh1, aes(X2,X1)) + 
  geom_line(colour="red") + 
  theme_bw() +
  scale_x_continuous('Year') + 
  scale_y_continuous('Minimum Temperature - Degree Celsius') +
  opts(title = expression("December Average Daily Minimum Temperature - Armagh 1865-2011"))

In the script above, I call in these data, tidy them up and then do a pretty graph with the excellent ggplot2 package. I use a while loop with the try function to call in the data. This can be very important for those interested in scraping data systematically. Sometimes the readLines function will not be able to establish a connection to the url address of interest. The try function in the while loop here ensures that in the event that R is not able to make the connection, it will try again until a connection is established. The equivalent to this is pressing refresh in your internet browser. When scraping data iteratively from a large number of url addresses, connection difficulties are inevitable, and therefore using the try function in while loop can save a lot of hassle.

Happy scraping, ya filthy animal.