Gnuplot and the Beta Distribution

I recently discovered gnuplot, a command-line plotting tool. I discovered it while trying to get some work done with Maxima, a Computer Algebra System Software. Gnuplot doesn’t make plots as pretty as R’s ggplots or d3 – but it is flexible and you get visual impressions of your math functions quickly. Here is a little animation that I have done with the Beta distribution: code and result.


Running Parallel MCMC without specific R packages

Running multiple, distinct Markov chains that have a posterior density as their equilibrium density is useful to use convergence diagnostics such as the Gelman / Rubin diagnostic (Gelman & Rubin, 1992). In order to save time, one likes to run each Markov chain in parallel. There are a number of packages in R (parallel, snow,…) to run code in parallel. However, I find it more useful to write a script for a single chain and then dispatch it multiple times from the terminal. Here is an example with the MCMCpack: 

Save the following as “sim.R”:

substream <- as.numeric(commandArgs(trailingOnly = TRUE))

library(MCMCpack)
data(birthwt)

filename <- paste("posterior_", substream, ".Rdata" , sep="", collapse="")
out <- MCMCprobit(low~as.factor(race)+smoke, data=birthwt, 
 b0 = 0, B0 = 10, seed=list(NA, substream))
save(out,file=filename)

The first line grabs the seed argument from the terminal and pass it to MCMCprobit function which generates a single chain. The script can be executed from the terminal with four different substream seeds as follows:

Rscript --arch=x86_64 sim.R 1 & Rscript --arch=x86_64 sim.R 2 & Rscript --arch=x86_64 sim.R 3 & Rscript --arch=x86_64 sim.R 4

If you don’t have a 64-Bit machine, use “x86_32”. Provided that you have four cores, one core will be used for each of these jobs. Next, you can fuse the output together in an mcmc.list object and run the Gelman/Rubin diagnostic.

library(coda)
load("posterior_1.Rdata")
out1 <- out
load("posterior_2.Rdata")
out2 <- out
load("posterior_3.Rdata")
out3 <- out
load("posterior_4.Rdata")
out4 <- out

posteriors <- as.mcmc.list(list( as.mcmc(out1), as.mcmc(out2), 
 as.mcmc(out3), as.mcmc(out4)))
gelman.diag(posteriors)

The great benefit of this approach is that the R code is less crammed with parallelization functions and the process can be monitored since the output is dumped right into the terminal window.

References
Gelman, Andrew & Donald B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7(4):457–472.


Political Science Job Market – We need a model!

APSA recently published some ugly plots on the number of Job Postings in APSA e‐Jobs. I wonder why we don’t have a predictive model for that? It should be possible to predict the number of Job postings based on some indicators (US GDP growth, NSF budget, etc.). Anyone?


Social Science vs Activism?

But superficial activism does little to actually help the world relative to more sustainable policies. What the world needs is not just another good cause to rally around, but rigorously tested solutions that actually benefit people.

Source: Yale Daily News

I agree.


A Puzzle

All scientific knowledge is tentative.

Science is like a giant puzzle.

Have you ever solved a puzzle with pieces that disappear after some time?


Standard error / quantiles for percentage change in hazard rate

Box-Steffensmeier / Jones (2004, 60) suggest in their book “Event History Modeling” to calculate the “percentage change in hazard rate” to simplify the interpretation of estimates from Cox models. The quantity is defined as:

\%h(t) = \frac{ \mathrm{exp}(b X_1) - \mathrm{exp}(b X_2) }{\mathrm{exp}(b X_2)}

I wrote a little R function for calculating this quantity. Since the coefficients (b) are unknown quantities and can only be estimated, the function takes into account their uncertainty and simulates a distribution of the “percentage change in hazard rate”. The output is the mean and the quantiles of the simulated distribution. Licht (2011) suggested something similar in the context of non-proportional hazard modeling.

The working of the code is pretty simple. 1) Estimate the model, 2) Get the Hessian and coefficient vector, 3) Use them to generate draws from a multivariate normal distribution, 4) Calculate the hazard change quantity for each draw and 5) Estimate the mean and quantiles for this simulated distribution. See King et al (2000) for something similar in the context of other models.

References
Box-Steffensmeier, J. M., & Jones, B. S. (2004). Event History Modeling: A Guide for Social Scientists. Cambridge: Cambridge University Press.
Licht, Amanda A. (2011). Change Comes with Time: Substantive Interpretation of Nonproportional Hazards in Event History Analysis. Political Analysis, 19(1), 227-243.
King, G., Tomz, M., & Wittenberg, J. (2000). Making the Most of Statistical Analyses: Improving Interpretation and Presentation. American Journal of Political Science, 44(2), 347-361.


Update: Version Control for Latex files

Github can be used to version control your tex files, which is handy if you are working on a large writing project, e.g. a dissertation. Here is a quick tutorial on how to include a footnote in your compiled tex files that tells you the git commit ID and the commit date. See also my first post here.

Install the gitinfo package for Latex. Next, initialize a new git repository, e.g. via the terminal and cd into the hooks folder.


git init
cd ./.git/hooks

Now, replace the files with these hooks (zip). I built them according to gitinfo package (documentation. Next, include this header into your Latex file:


\usepackage{gitinfo}
\title{Selection, Aggregation and Missing data}
\author{sumtxt\footnote{Last commit: \gitCommitterIsoDate\ Git-Hash: \gitAbbrevHash. Please do not distribute without the permission of the author.}}

and compile the tex files. The footnote (should) now read: “Last commit: 2013-01-03 22:09:06 +0100 Git-Hash: 4b23635. Please do not distribute without the permission of the author.”

Update (6/19/13): Some other useful advices here.