Crowd Forecasting and Elections

DIE ZEIT, a german weekly newspaper, asked people to forecast the election results of the state election in Hamburg. 810 participated from Feb 16 to 19 on Facebook and Twitter. Below the distribution of the forecasts for each party, along with the final election results from Feb 20 (red line) and the result of a telephone survey (green line) undertaken by GMS – a commercial public opinion polling service – between Feb 15 to 17. Data source: here, here and here.

CDU SPD GAL Linke FDP Sonstige
Crowd 25.41 41.54 15.77 6.54 5.20 5.54
GMS 25.00 43.00 15.00 6.00 5.00 6.00
Result 21.90 48.30 11.20 6.40 6.60 5.50

It is pretty amazing, that the crowd forecast is almost identical with the survey results from GMS. Although both failed to predict the election results with a reasonable accuracy, the experiment shows that crowd forecasting might be an interesting method to poll the population with lesser cost and almost the same accuracy as conventional polling. I am sort of surprised about the high correlation between the GMS results and the crowd forecast, since I had suspected that they diverge due to self-selection biases in the crowd forecast experiment.

Numbers don’t speak for themselves

This article in Wire (“The End of Theory: The Data Deluge Makes the Scientific Method Obsolete“) is a bit older, but I was hinted to it only some days ago. The author claims, that “faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete”. And further: “With enough data, the numbers speak for themselves”.

I don’t think so. I suspect that what the author actually means is that we may not anymore need to think much about sampling since we have a complete dataset from a population of interest (e.g. all customers of firm X).

But even if that’s true (what I also doubt), numbers don’t speak for themselves. We still need statistics to test competing theoretical models, discover patterns in data (e.g. via clustering/classification) or simply reduce the massive amount of data to something that we actually can process in a reasonable amount of time (e.g. dimensionality reduction via scaling).