How MySQL supports data collection and analysis

Frequently I scrape textual data from the Internet or digital documents (e.g. pdfs) and combine these data with some other data for an analysis. I usually use not only Python for scraping/refining, but also R. And usually I also switch between both environments during data collection, refining and analysis. A typical workflow goes like this: Scrape the data and make a basic cleanup (strip html etc.), send the data to R for some data aggregation and re-structuring (merge with other data etc.), send the data to Processing to make some nice interactive graphics (if necessary) and finally go back to R and run a model. The key question is: How to exchange the data between these three environments? The most obvious answer: Use csv-files. But in my experience, a local MySQL database is more useful, since then it is easier to:

  • subset the data and selectively import / export data to / from each environment
  • directly search the data without importing the full dataset (e.g –> R is super-slow in searching text vectors)
  • separate tables in a MySQL database can be used to make every refining step reversible
  • easy to migrate the data to the web

Of course, all these advantages only apply if you work with big datasets. One the other hand, if you work with textual data, you certainly quickly approaching “big”.

There are tons of tutorials in the web, explaining how to use set-up MySQL and use it with Python, R and Processing. Here is a list of those that I found most helpful at the beginning:

Some hints:

  • Install the 5.1. Version of MySQL – not the 5.5! Looks like that the new version has some bugs (see also here). After installation my MySQL server didn’t start at all.
  • If you get an error while installing MySQL-Python (the driver to connect from Python to MySQL) via easy_install, use this (replace XYZ with your MySQL Version!): PATH=$PATH:/usr/local/mysql-XYZ/bin sudo easy_install -Z MySQL-python No worries, this is only modifying your PATH once – not permanently! (Source)
  • if you want to play around without installing MySQL, download XAMPP and create a socket using sudo ln -s /tmp/mysql.sock /Applications/XAMPP/xamppfiles/var/mysql/mysql.sock That way R/Python/Processing can connect to it.