Nuclear power plants and danger zones worldwide

A couple of days ago The Guardian’s Data Blog published data about the location and status of nuclear plants around the globe. I used their data to create a Google Map that shows a 20km Fukushima evacuation zone and a 30km Chernobyl exclusion zone around each active plant. Look at the map here.

Another interesting map was made by Robert Johnston. He collected data on worldwide nuclear test zones. Look at the data on his website or look at this map generated from Johnston’s Google Earth KML file.

Graphics with Processing

When I like to visualize data, I am mostly using the gglot2 library in R. My impression is that this package provides the most advance data visualization environment in R. But there is one thing, that you can’t do with ggplot2: Creating interactive graphics.

If I need an interactive graphic, Processing is my choice. Processing is a dialect of Java and provides a (simple) programming environment for creating data visualizations. The cool thing about Processing is that you can display your interactive graphics on a website using the JavaScript library processing.js and the HTML5 canvas-tag.

There is a fair number of tutorials in the web. I recommend to get your hands on Visualizing Data written by one of the developers of Processing, Ben Fry. It has some where useful examples and is written super accesible.

Other tutorials I found useful:

There is one thing, that took me a while to recognize: If you use processing.js you need to put the processing code inside the html body-tag! If it is outside (in the head-tag), it won’t work.

How MySQL supports data collection and analysis

Frequently I scrape textual data from the Internet or digital documents (e.g. pdfs) and combine these data with some other data for an analysis. I usually use not only Python for scraping/refining, but also R. And usually I also switch between both environments during data collection, refining and analysis. A typical workflow goes like this: Scrape the data and make a basic cleanup (strip html etc.), send the data to R for some data aggregation and re-structuring (merge with other data etc.), send the data to Processing to make some nice interactive graphics (if necessary) and finally go back to R and run a model. The key question is: How to exchange the data between these three environments? The most obvious answer: Use csv-files. But in my experience, a local MySQL database is more useful, since then it is easier to:

  • subset the data and selectively import / export data to / from each environment
  • directly search the data without importing the full dataset (e.g –> R is super-slow in searching text vectors)
  • separate tables in a MySQL database can be used to make every refining step reversible
  • easy to migrate the data to the web

Of course, all these advantages only apply if you work with big datasets. One the other hand, if you work with textual data, you certainly quickly approaching “big”.

There are tons of tutorials in the web, explaining how to use set-up MySQL and use it with Python, R and Processing. Here is a list of those that I found most helpful at the beginning:

Some hints:

  • Install the 5.1. Version of MySQL – not the 5.5! Looks like that the new version has some bugs (see also here). After installation my MySQL server didn’t start at all.
  • If you get an error while installing MySQL-Python (the driver to connect from Python to MySQL) via easy_install, use this (replace XYZ with your MySQL Version!): PATH=$PATH:/usr/local/mysql-XYZ/bin sudo easy_install -Z MySQL-python No worries, this is only modifying your PATH once – not permanently! (Source)
  • if you want to play around without installing MySQL, download XAMPP and create a socket using sudo ln -s /tmp/mysql.sock /Applications/XAMPP/xamppfiles/var/mysql/mysql.sock That way R/Python/Processing can connect to it.

Crowd Forecasting and Elections

DIE ZEIT, a german weekly newspaper, asked people to forecast the election results of the state election in Hamburg. 810 participated from Feb 16 to 19 on Facebook and Twitter. Below the distribution of the forecasts for each party, along with the final election results from Feb 20 (red line) and the result of a telephone survey (green line) undertaken by GMS – a commercial public opinion polling service – between Feb 15 to 17. Data source: here, here and here.

CDU SPD GAL Linke FDP Sonstige
Crowd 25.41 41.54 15.77 6.54 5.20 5.54
GMS 25.00 43.00 15.00 6.00 5.00 6.00
Result 21.90 48.30 11.20 6.40 6.60 5.50

It is pretty amazing, that the crowd forecast is almost identical with the survey results from GMS. Although both failed to predict the election results with a reasonable accuracy, the experiment shows that crowd forecasting might be an interesting method to poll the population with lesser cost and almost the same accuracy as conventional polling. I am sort of surprised about the high correlation between the GMS results and the crowd forecast, since I had suspected that they diverge due to self-selection biases in the crowd forecast experiment.