How to setup R and JAGS on Amazon EC2

Yesterday I setup a Amazon EC2 machine and installed R, JAGS and rjags. Unfortunately I was unable to compile the latest version of JAGS. The configuration step failed with an error that says that a specific LAPACK library is missing. I tried to install the missing library manually but for some reason JAGS doesn’t recognize the new version. After an hour, I decided to go with the older JAGS version. Here is how I installed all three packages.

[If you never worked on an EC2 machine: Here are two [1,2] helpful tutorials how start an EC2 instance and install the command line tool on your computer to access it via ssh.]

Login via ssh. In a first step we install an older version of R. It is necessary to use the older version in order to get JAGS 1.0.4 running later. After downloading the R source file, we unzip it, cd into the folder and start installing a couple of additional libraries and compilers that we need to compile R (and later JAGS). We then configure the R installer without any x-window support and install everything. The last line makes sure that you can start R with simply typing ‘R’.

cd ~
wget http://cran.r-project.org/src/base/R-2/R-2.10.0.tar.gz 
tar xf R-2.10.0.tar.gz
cd ~/R-2.10.0
sudo yum install gcc
sudo yum install gcc-c++
sudo yum install gcc-gfortran
sudo yum install readline-devel
sudo yum install make
./configure --with-x=no
make
PATH=$PATH:~/download/R-2.10.0/bin/

Next, try to start R and install the coda package.

R
install.packages('coda')
q()

Now, we proceed with installing JAGS. First, get some libraries and the JAGS source code. Unzip, cd into the folder and run configure with specifying the shared library folder. The specification of the shared library path is crucial, otherwise rjags fails to load in R.

sudo yum install blas-devel lapack-devel
cd ~/download/
wget http://sourceforge.net/projects/mcmc-jags/files/JAGS/1.0/JAGS-1.0.4.tar.gz/download
tar xf JAGS-1.0.4.tar.gz
cd ~/download/JAGS-1.0.4
./configure --libdir=/usr/local/lib64 
make 
sudo make install

In a last step, download the appropriate rjags version and install it it.

cd ~/download/
wget http://cran.r-project.org/src/contrib/Archive/rjags/rjags_1.0.3-13.tar.gz
R --with-jags-modules=/usr/local/lib/JAGS/modules/ CMD INSTALL rjags_1.0.3-13.tar.gz 

Sources of inspiration. Here, here, here and here.

Advertisements

Visualizing networks with ggplot2 in R

When I had to visualize some network data last semester in my social network analysis class, I wasn’t happy with the plot function in R‘s sna-package. It is not very flexible and doesn’t allow to modify the graph figure flexible. Thus, I decided to write a little function to visualize network data with the ggplot2 engine.

The biggest challenge in network visualization is usually to come up with the coordinates of the nodes in the two dimensional space. The sna-package relies on a set of functions that can calculate a set of optimal coordinates with respect to some criteria. Two of the most prominente algorithms (Fruchterman & Reingold’s force-directed placement algorithm and Kamada-Kawai’s) are implemented in the sna-package function gplot.layout.fruchtermanreingold and gplot.layout.kamadakawai. Both can be used in my function below.

In the first part of the function, the layout function calculates the coordinates for every node in a two dimensional space. In line 14 to 18 the function takes the node coordinates and combines them with the edge list data to come up with the coordinate pairs to characterize the edges in the network.
In the middle part the data are passed to the ggplot function and used to plot the nodes (a set of points) and edges (a set of segments). In line 26 to 30 I am discarding the default grid from the ggplot figure and other default layout elements. The last part of the code generates a random network and passes it to the plot function.

library(network)
library(ggplot2)
library(sna)
library(ergm)


plotg <- function(net, value=NULL) {
	m <- as.matrix.network.adjacency(net) # get sociomatrix
	# get coordinates from Fruchterman and Reingold's force-directed placement algorithm.
	plotcord <- data.frame(gplot.layout.fruchtermanreingold(m, NULL)) 
	# or get it them from Kamada-Kawai's algorithm: 
	# plotcord <- data.frame(gplot.layout.kamadakawai(m, NULL)) 
	colnames(plotcord) = c("X1","X2")
	edglist <- as.matrix.network.edgelist(net)
	edges <- data.frame(plotcord[edglist[,1],], plotcord[edglist[,2],])
	plotcord$elements <- as.factor(get.vertex.attribute(net, "elements"))
	colnames(edges) <-  c("X1","Y1","X2","Y2")
	edges$midX  <- (edges$X1 + edges$X2) / 2
	edges$midY  <- (edges$Y1 + edges$Y2) / 2
	pnet <- ggplot()  + 
			geom_segment(aes(x=X1, y=Y1, xend = X2, yend = Y2), 
				data=edges, size = 0.5, colour="grey") +
			geom_point(aes(X1, X2,colour=elements), data=plotcord) +
			scale_colour_brewer(palette="Set1") +
			scale_x_continuous(breaks = NA) + scale_y_continuous(breaks = NA) +
			# discard default grid + titles in ggplot2 
			opts(panel.background = theme_blank()) + opts(legend.position="none")+
			opts(axis.title.x = theme_blank(), axis.title.y = theme_blank()) +
			opts( legend.background = theme_rect(colour = NA)) + 
			opts(panel.background = theme_rect(fill = "white", colour = NA)) + 
			opts(panel.grid.minor = theme_blank(), panel.grid.major = theme_blank())
	return(print(pnet))
}


g <- network(150, directed=FALSE, density=0.03)
classes <- rbinom(150,1,0.5) + rbinom(150,1,0.5) + rbinom(150,1,0.5)
set.vertex.attribute(g, "elements", classes)

plotg(g)

I was too lazy to make this function more general (and user friendly). That’s why, for most practical purposes it needs to be modified to make pretty visualization – but nevertheless I hope that it provides a useful jumping point for others. Some of my plots from the class are below. I included them to show the flexibility when using the ggplot2 engine instead of sna’s default plot function. Unfortunately I can’t post the data for these networks.

Update: Another interesting approach to visualized geocoded network data in R is explained in the FlowingData blog.


Chrome Browser History in R

Visualizing the number of visits per websites over months is pretty easy. I took the following steps to extract the data and visualize them.

1) Find the data. Google Chrome stores the browsing history in SQLite files. If you have Mac, you can easily look into them by first navigating in the Terminal to the folder with the files (usually:
~/Library/Application\ Support/Google/Chrome/Default ) and then type: sqlite3 Archived\ History . This opens the SQLite client. You can now look at the table layout visits in the file Archived History via:

.tables
.schema visits

2.) Extract the data. Next I wanted to have a csv-file with the timestamp of the visit and the URL. I extracted the data with a modified Python script, that I found here.

import os
import datetime
import sqlite3
import codecs, re

pattern = "(((http)|(https))(://)(www.)|().*?)\.[a-z]*/"

SQL_STATEMENT = 'SELECT urls.url, visit_time FROM visits, urls WHERE visits.url=urls.id;'

storage = codecs.open('out.csv', 'w', 'utf-8')

def date_from_webkit(webkit_timestamp):
    epoch_start = datetime.datetime(1601,1,1)
    delta = datetime.timedelta(microseconds=int(webkit_timestamp))
    return epoch_start + delta

paths = ["~/Archived History", "~/History"] 

for path in paths:
	c = sqlite3.connect(path)
	for row in c.execute(SQL_STATEMENT):
		date_time = date_from_webkit(row[1])
		url = re.search(pattern, row[0])
		try: urlc = url.group(0)
		except: urlc = "ERROR"
		storage.write(str(date_time)[0:19] + "\t" + urlc + "\n")

The script opens two history files from the Google Chrome Browser and selects two variables from them (urls.url, visit_time). Using the function date_from_webkit it walks through the timestamp variable and converts it to a more readable format (“%Y-%m-%d %H:%M:%S). It also extracts the domain of the URL using a regular expression defined in the variable pattern. The last step outputs a csv file with a timestamp column and a short URL.

3.) Visualize. The output of the Python script can be easily imported into R for any kind of analysis. I made the graphic above with the following code:

library(plyr)
library(ggplot2)

# Import the data
data <- read.csv("out.csv", sep="\t")
colnames(data) <- c("time","url")
data$time <- as.POSIXct(data$time)
data$day <- format(data$time, format="%Y-%m-%d")

# Count visits per day
lgrep <- function(x,pat){ c <- grep(pat, x$url); return(length(c)) }

counts.google <- ddply(data, .(day), "lgrep", pat="www.google", .progress="text")
counts.mail <- ddply(data, .(day), "lgrep", pat="mail.google", .progress="text")
counts.facebook <- ddply(data, .(day), "lgrep", pat="facebook", .progress="text")
counts.spiegel <- ddply(data, .(day), "lgrep", pat="spiegel", .progress="text")
counts.nytimes <- ddply(data, .(day), "lgrep", pat="nytimes", .progress="text")
counts.wiki <- ddply(data, .(day), "lgrep", pat="wikipedia", .progress="text")
counts.leo <- ddply(data, .(day), "lgrep", pat="dict.leo", .progress="text")
counts.hulu <- ddply(data, .(day), "lgrep", pat="hulu", .progress="text")

# Make new data.frame
df <- data.frame(day=counts.google$day, Google=counts.google$lgrep, GMail = counts.mail$lgrep, Facebook=counts.facebook$lgrep, SPIEGEL=counts.spiegel$lgrep, NYTimes=counts.nytimes$lgrep, Wikipedia=counts.wiki$lgrep, Leo=counts.leo$lgrep, hulu=counts.hulu$lgrep) 
em <- melt(df, id = "day")

# Plot it 
ggplot(aes(as.Date(day), value, color = variable), colour=clarity , data=em) + 
	scale_x_date('') + 
	stat_smooth() + 
	scale_y_continuous('visits') + 
	geom_line(alpha=0.10) +  
	geom_point(alpha=0.20) + 
	opts(legend.title = theme_text(colour = 'white', size = 0)) + 
	scale_colour_brewer(palette="Set1")

Apps, packages and services for Political Scientists

Apps

  • Papers – organizes your PDF library + citation database
  • Mendeley – manages citation database (desktop and online app)
  • GitHub Mac – simple version control
  • RStudio – a nice IDE for R if you don’t like plain coding

Packages

  • countrycode – R package that “can convert to and from 7 different country coding schemes. It uses regular expressions to convert long country names (e.g. Sri Lanka) into any of those coding schemes [Correlates of War character, CoW-numeric, ISO3-character, ISO3-numeric, ISO2-character, IMF, FIPS 10-4, official English short country names (ISO), continent, region], or into standardized country names (official short English).”
  • datasciencetoolkit – Python toolbox to extract various data from texts
  • BeautifulSoup – helps to scrap websites with a few lines of Python code

Services

  • AWS/EC2 – run R and Python in the cloud to save time
  • TileMill – makes nice maps to visualize geocoded data
  • Detexify – translates handwritten symbols in LaTex code
  • RegEr – instant regular expression tester

The current issue of The Political Methodologist discusses how to organize research, especially the data analysis workflow. Worth reading!