Chrome Browser History in R

Visualizing the number of visits per websites over months is pretty easy. I took the following steps to extract the data and visualize them.

1) Find the data. Google Chrome stores the browsing history in SQLite files. If you have Mac, you can easily look into them by first navigating in the Terminal to the folder with the files (usually:
~/Library/Application\ Support/Google/Chrome/Default ) and then type: sqlite3 Archived\ History . This opens the SQLite client. You can now look at the table layout visits in the file Archived History via:

.tables
.schema visits

2.) Extract the data. Next I wanted to have a csv-file with the timestamp of the visit and the URL. I extracted the data with a modified Python script, that I found here.

import os
import datetime
import sqlite3
import codecs, re

pattern = "(((http)|(https))(://)(www.)|().*?)\.[a-z]*/"

SQL_STATEMENT = 'SELECT urls.url, visit_time FROM visits, urls WHERE visits.url=urls.id;'

storage = codecs.open('out.csv', 'w', 'utf-8')

def date_from_webkit(webkit_timestamp):
    epoch_start = datetime.datetime(1601,1,1)
    delta = datetime.timedelta(microseconds=int(webkit_timestamp))
    return epoch_start + delta

paths = ["~/Archived History", "~/History"] 

for path in paths:
	c = sqlite3.connect(path)
	for row in c.execute(SQL_STATEMENT):
		date_time = date_from_webkit(row[1])
		url = re.search(pattern, row[0])
		try: urlc = url.group(0)
		except: urlc = "ERROR"
		storage.write(str(date_time)[0:19] + "\t" + urlc + "\n")

The script opens two history files from the Google Chrome Browser and selects two variables from them (urls.url, visit_time). Using the function date_from_webkit it walks through the timestamp variable and converts it to a more readable format (“%Y-%m-%d %H:%M:%S). It also extracts the domain of the URL using a regular expression defined in the variable pattern. The last step outputs a csv file with a timestamp column and a short URL.

3.) Visualize. The output of the Python script can be easily imported into R for any kind of analysis. I made the graphic above with the following code:

library(plyr)
library(ggplot2)

# Import the data
data <- read.csv("out.csv", sep="\t")
colnames(data) <- c("time","url")
data$time <- as.POSIXct(data$time)
data$day <- format(data$time, format="%Y-%m-%d")

# Count visits per day
lgrep <- function(x,pat){ c <- grep(pat, x$url); return(length(c)) }

counts.google <- ddply(data, .(day), "lgrep", pat="www.google", .progress="text")
counts.mail <- ddply(data, .(day), "lgrep", pat="mail.google", .progress="text")
counts.facebook <- ddply(data, .(day), "lgrep", pat="facebook", .progress="text")
counts.spiegel <- ddply(data, .(day), "lgrep", pat="spiegel", .progress="text")
counts.nytimes <- ddply(data, .(day), "lgrep", pat="nytimes", .progress="text")
counts.wiki <- ddply(data, .(day), "lgrep", pat="wikipedia", .progress="text")
counts.leo <- ddply(data, .(day), "lgrep", pat="dict.leo", .progress="text")
counts.hulu <- ddply(data, .(day), "lgrep", pat="hulu", .progress="text")

# Make new data.frame
df <- data.frame(day=counts.google$day, Google=counts.google$lgrep, GMail = counts.mail$lgrep, Facebook=counts.facebook$lgrep, SPIEGEL=counts.spiegel$lgrep, NYTimes=counts.nytimes$lgrep, Wikipedia=counts.wiki$lgrep, Leo=counts.leo$lgrep, hulu=counts.hulu$lgrep) 
em <- melt(df, id = "day")

# Plot it 
ggplot(aes(as.Date(day), value, color = variable), colour=clarity , data=em) + 
	scale_x_date('') + 
	stat_smooth() + 
	scale_y_continuous('visits') + 
	geom_line(alpha=0.10) +  
	geom_point(alpha=0.20) + 
	opts(legend.title = theme_text(colour = 'white', size = 0)) + 
	scale_colour_brewer(palette="Set1")
Advertisements

One Comment on “Chrome Browser History in R”

  1. […] You still need to categorize sites into positive/negative productivity, but it’s a start. Chrome History in R is a good place to start if you’re interested in taking that route. I don’t currently […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s