How to Export Your Mozilla Firefox History as a Dataframe in R

The goal of this post is to export a Mozilla Firefox Browser history and import in R as a dataframe.

Browser history data

Firefox saves your browsing history in a file called places.sqlite. This file contains several tables, like bookmarks, favicons or the history.

To get a dataframe with visited websites, you need two tables from the sqlite file:

  1. moz_historyvisits: it contains all websites you visited with time and date. All websites have an id instead of a readable URL.
  2. moz_places: it contains the translation of the websites id and its actual URL.

More on the database schema:

Import the data into R

sqlite files can be imported with the package RSQLite.

First, find the places.sqlite on your computer. You can get the path, by visiting about:support in Firefox and looking for the Profiles directory.

library(RSQLite)
library(purrr)
library(here)

# connect to database
con <- dbConnect(drv = RSQLite::SQLite(), 
                 dbname = "path/to/places.sqlite",
                 bigint="character")

# get all tables
tables <- dbListTables(con)

# remove internal tables
tables <- tables[tables != "sqlite_sequence"]
# create a list of dataframes
list_of_df <- purrr::map(tables, ~{
  dbGetQuery(conn = con, statement=paste0("SELECT * FROM '", .x, "'"))
})
# get the list of dataframes some names
names(list_of_df) <- tables

Extract browser history

Next, we extract the two tables with the information we need, join them and keep only the visited url, the time and the URL id.

There are two caveats:

  1. The timestamps are saved in the PRTime format, which is basically an unix timestamp and you have to convert it in a human-readable format
  2. Extract the domain of a URL using the urltools package, e.g. getting twitter.com instead of twitter.com/cutterkom
library(urltools)
# get the two dataframes 
history <- list_of_df[["moz_historyvisits"]]
urls <- list_of_df[["moz_places"]]

df <- left_join(history, urls, by = c("place_id" = "id")) %>% 
  select(place_id, url, visit_date) %>% 
  # convert the unix timestamp
  mutate(date = as.POSIXct(as.numeric(visit_date)/1000000, origin = '1970-01-01', tz = 'GMT'),
  # extract the domains from the URL, e.g. `twitter.com` instead of `twitter.com/cutterkom`
         domain = str_remove(urltools::domain(url), "www\\."))

Schreibe einen Kommentar