I spent the morning writing a short R script that transforms the UCDP/PRIO Armed Conflict dataset into a panel dataset in which each conflict is observed from the year it started to the year it ended (or the last year in the dataset). The unit of observation in the original dataset is an active conflict observed in a year t. More precisely, it is an ongoing conflict episode (for a particular conflict) in a year t. Inactive years between conflict episodes are not recorded (see figure with an example from Ethiopia and two variables from the dataset). The purpose of the code below (especially the second part) is to fill-in the observations between conflict episodes. In addition to the panel structure, I wanted to have a more natural variable structure / variable names (this is what the first part is doing).
The code below might not be optimal for all applications, but it is probably still useful for others that need to get a panel version of the UCDP/PRIO. If you spot mistakes or have ideas for improvements, please share them in the comments. For an introduction to the dataset, definition of variables etc, please read the original codebook.
Read the rest of this entry »
A couple of days ago The Guardian’s Data Blog published data about the location and status of nuclear plants around the globe. I used their data to create a Google Map that shows a 20km Fukushima evacuation zone and a 30km Chernobyl exclusion zone around each active plant. Look at the map here.
Another interesting map was made by Robert Johnston. He collected data on worldwide nuclear test zones. Look at the data on his website or look at this map generated from Johnston’s Google Earth KML file.
Frequently I scrape textual data from the Internet or digital documents (e.g. pdfs) and combine these data with some other data for an analysis. I usually use not only Python for scraping/refining, but also R. And usually I also switch between both environments during data collection, refining and analysis. A typical workflow goes like this: Scrape the data and make a basic cleanup (strip html etc.), send the data to R for some data aggregation and re-structuring (merge with other data etc.), send the data to Processing to make some nice interactive graphics (if necessary) and finally go back to R and run a model. The key question is: How to exchange the data between these three environments? The most obvious answer: Use csv-files. But in my experience, a local MySQL database is more useful, since then it is easier to:
- subset the data and selectively import / export data to / from each environment
- directly search the data without importing the full dataset (e.g –> R is super-slow in searching text vectors)
- separate tables in a MySQL database can be used to make every refining step reversible
- easy to migrate the data to the web
Of course, all these advantages only apply if you work with big datasets. One the other hand, if you work with textual data, you certainly quickly approaching “big”.
There are tons of tutorials in the web, explaining how to use set-up MySQL and use it with Python, R and Processing. Here is a list of those that I found most helpful at the beginning:
- Getting started with MySQL
- Access MySQL from R (MAC)
- Connecting from Python to MySQL
- Using MySQL from Processing
- Install the 5.1. Version of MySQL – not the 5.5! Looks like that the new version has some bugs (see also here). After installation my MySQL server didn’t start at all.
- If you get an error while installing MySQL-Python (the driver to connect from Python to MySQL) via easy_install, use this (replace XYZ with your MySQL Version!):
PATH=$PATH:/usr/local/mysql-XYZ/bin sudo easy_install -Z MySQL-pythonNo worries, this is only modifying your PATH once – not permanently! (Source)
- if you want to play around without installing MySQL, download XAMPP and create a socket using
sudo ln -s /tmp/mysql.sock /Applications/XAMPP/xamppfiles/var/mysql/mysql.sockThat way R/Python/Processing can connect to it.
In the recent issue of the TIME magazine James Poniewozik makes an interesting observation about Web 2.0: “The ability to share information is changing our lives in tiny ways (see your Facebook news feed) and huge ones — witness the WikiLeaks phenomenon (…) Information has its limits, however. We’re becoming a society of constant documenters, but recording problems is not always enough to fix them” (source).
I agree with that.