Gibbs Samplers in R and Notes

I uploaded several R files and notes for Gibbs samplers to my Website’s resource page. They include: Factor Analysis, Finite Mixture of Linear Regression Models and Probit Regression. I have written all this a while ago when I was reading about these models. Most of the samplers are fairly standard and well-implemented efficiently in R packages calling c++ in the background. These packages are also referenced in the R files. Nevertheless, the R code might be useful for pedagogical purposes or as building block for more complex schemes.

Advertisements

Running Parallel MCMC without specific R packages

Running multiple, distinct Markov chains that have a posterior density as their equilibrium density is useful to use convergence diagnostics such as the Gelman / Rubin diagnostic (Gelman & Rubin, 1992). In order to save time, one likes to run each Markov chain in parallel. There are a number of packages in R (parallel, snow,…) to run code in parallel. However, I find it more useful to write a script for a single chain and then dispatch it multiple times from the terminal. Here is an example with the MCMCpack: 

Save the following as “sim.R”:

substream <- as.numeric(commandArgs(trailingOnly = TRUE))

library(MCMCpack)
data(birthwt)

filename <- paste("posterior_", substream, ".Rdata" , sep="", collapse="")
out <- MCMCprobit(low~as.factor(race)+smoke, data=birthwt, 
 b0 = 0, B0 = 10, seed=list(NA, substream))
save(out,file=filename)

The first line grabs the seed argument from the terminal and pass it to MCMCprobit function which generates a single chain. The script can be executed from the terminal with four different substream seeds as follows:

Rscript --arch=x86_64 sim.R 1 & Rscript --arch=x86_64 sim.R 2 & Rscript --arch=x86_64 sim.R 3 & Rscript --arch=x86_64 sim.R 4

If you don’t have a 64-Bit machine, use “x86_32”. Provided that you have four cores, one core will be used for each of these jobs. Next, you can fuse the output together in an mcmc.list object and run the Gelman/Rubin diagnostic.

library(coda)
load("posterior_1.Rdata")
out1 <- out
load("posterior_2.Rdata")
out2 <- out
load("posterior_3.Rdata")
out3 <- out
load("posterior_4.Rdata")
out4 <- out

posteriors <- as.mcmc.list(list( as.mcmc(out1), as.mcmc(out2), 
 as.mcmc(out3), as.mcmc(out4)))
gelman.diag(posteriors)

The great benefit of this approach is that the R code is less crammed with parallelization functions and the process can be monitored since the output is dumped right into the terminal window.

References
Gelman, Andrew & Donald B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7(4):457–472.


Get a slice of a mcmclist faster

Say you sampled a large amount of parameters from a posterior and stored the parameters in a mcmclist object. In order to analyze your results, you want to slice the mcmclist and only select a subset of the parameters. One way to do this is:

mcmcsample[,i,]

where mcmcsample is your mcmclist with the posterior sample and i the parameter you are interested in. Turns out, this function is very slow. A faster way is to use this little function:

getMCMCSample <- function(mcmclist,i){
  chainext <- function(x,i) return(x[,i])
  return(as.mcmc.list(lapply(mcmclist, chainext, i = i)))
  }

Example:

# Run a toy model
library(MCMCpack)
counts <- c(18,17,15,20,10,20,25,13,12)
outcome <- gl(3,1,9)
treatment <- gl(3,3)
posterior1 <- MCMCpoisson(counts ~ outcome + treatment)
posterior2 <- MCMCpoisson(counts ~ outcome + treatment)
posterior3 <- MCMCpoisson(counts ~ outcome + treatment)
mcmclist <- mcmc.list(posterior1,posterior2,posterior3)

system.time(mcmclist[,2:3,])
system.time(getMCMCSample(mcmclist, 2:3))

The build-in way takes 0.003 sec on my machine, while getMCMCSample gets it done in 0.001. For this little example, the difference is negligible. But as it turns out, for large posterior samples, it really makes a difference.