The goal of this post is to export a Mozilla Firefox Browser history and import in R as a dataframe.
Browser history data
Firefox saves your browsing history in a file called places.sqlite
. This file contains several tables, like bookmarks, favicons or the history.
To get a dataframe with visited websites, you need two tables from the sqlite
file:
moz_historyvisits
: it contains all websites you visited with time and date. All websites have an id instead of a readable URL.moz_places
: it contains the translation of the websites id and its actual URL.
More on the database schema:
Import the data into R
sqlite
files can be imported with the package RSQLite
.
First, find the places.sqlite
on your computer. You can get the path, by visiting about:support
in Firefox and looking for the Profiles directory.
library(RSQLite)
library(purrr)
library(here)
# connect to database
con <- dbConnect(drv = RSQLite::SQLite(),
dbname = "path/to/places.sqlite",
bigint="character")
# get all tables
tables <- dbListTables(con)
# remove internal tables
tables <- tables[tables != "sqlite_sequence"]
# create a list of dataframes
list_of_df <- purrr::map(tables, ~{
dbGetQuery(conn = con, statement=paste0("SELECT * FROM '", .x, "'"))
})
# get the list of dataframes some names
names(list_of_df) <- tables
Extract browser history
Next, we extract the two tables with the information we need, join them and keep only the visited url, the time and the URL id.
There are two caveats:
- The timestamps are saved in the PRTime format, which is basically an unix timestamp and you have to convert it in a human-readable format
- Extract the domain of a URL using the
urltools
package, e.g. gettingtwitter.com
instead oftwitter.com/cutterkom
library(urltools)
# get the two dataframes
history <- list_of_df[["moz_historyvisits"]]
urls <- list_of_df[["moz_places"]]
df <- left_join(history, urls, by = c("place_id" = "id")) %>%
select(place_id, url, visit_date) %>%
# convert the unix timestamp
mutate(date = as.POSIXct(as.numeric(visit_date)/1000000, origin = '1970-01-01', tz = 'GMT'),
# extract the domains from the URL, e.g. `twitter.com` instead of `twitter.com/cutterkom`
domain = str_remove(urltools::domain(url), "www\\."))
2 Kommentare
Thank you for this interesting post.
Minor comment: For the very last command to execute, I had to load two more packages.
Package dplyr is needed for the command mutate.
Package stringr is needed for the command str_remove.
Indeed! `tidyverse` is my base R 😀