Blog

Sorry, auch Datenanalysen sind nicht der Heilige Gral der Objektivität

Datenanalysen sind nicht neutral: Jede Entscheidung über Variablen oder Methodik ist schlussendlich auch eine inhaltliche Entscheidung. Das zeigt anschaulich eine Studie, über die das Spektrum Magazin schreibt:

Bekommen schwarze Fußballspieler häufiger rote Karten als Nicht-Schwarze? Das war die Frage, auf die Forscherinnen und Forscher 29 verschiedene Antworten gaben. Die Ergebnisse unterscheiden sich zum Teil deutlich und widersprachen sich auch. Und das, obwohl alle den exakt gleichen Datensatz zur Verfügung hatten.

Die Unterschiede ergeben sich zum Beispiel aus folgenden Punkten:

  • Was sind die Annahmen über die Verteilung der Daten?
  • Können sich Schiedsrichter und Spieler beeinflussen?
  • Sind rote Karten voneinander unabhängig?
  • Werden alle Variablen in die Analyse miteinbezogen? „Gut zwei Drittel der Teams hatten beispielsweise die Position des Spielers auf dem Platz berücksichtigt, aber nur drei Prozent die Gesamtzahl der Platzverweise, die ein Schiedsrichter verhängte.“

Und was folgt daraus? Sind Analysen nicht mehr zu trauen? Natürlich nicht, aber wie so oft hilft ein Bewusstsein, dass auch Datenanalysen keine in Stein gemeisselten Ergebnisse produzieren. Wie im Journalismus gilt auch hier: Transparenz erhöht die Glaubwürdigkeit.

The best defense against subjectivity in science is to expose it. Transparency in data, methods, and process gives the rest of the community opportunity to see the decisions, question them, offer alternatives, and test these alternatives in further research.

Studie „Many Analysts, One Data Set“

Hat eine schwarze Hautfarbe nun Einfluss auf Platzverweise? Zwei Drittel der Analysen sagen „ja“, ein Drittel „nein“.

via WZB Data Science Blog

How to Export Your Mozilla Firefox History as a Dataframe in R

The goal of this post is to export a Mozilla Firefox Browser history and import in R as a dataframe.

Browser history data

Firefox saves your browsing history in a file called places.sqlite. This file contains several tables, like bookmarks, favicons or the history.

To get a dataframe with visited websites, you need two tables from the sqlite file:

  1. moz_historyvisits: it contains all websites you visited with time and date. All websites have an id instead of a readable URL.
  2. moz_places: it contains the translation of the websites id and its actual URL.

More on the database schema:

Import the data into R

sqlite files can be imported with the package RSQLite.

First, find the places.sqlite on your computer. You can get the path, by visiting about:support in Firefox and looking for the Profiles directory.

library(RSQLite)
library(purrr)
library(here)

# connect to database
con <- dbConnect(drv = RSQLite::SQLite(), 
                 dbname = "path/to/places.sqlite",
                 bigint="character")

# get all tables
tables <- dbListTables(con)

# remove internal tables
tables <- tables[tables != "sqlite_sequence"]
# create a list of dataframes
list_of_df <- purrr::map(tables, ~{
  dbGetQuery(conn = con, statement=paste0("SELECT * FROM '", .x, "'"))
})
# get the list of dataframes some names
names(list_of_df) <- tables

Extract browser history

Next, we extract the two tables with the information we need, join them and keep only the visited url, the time and the URL id.

There are two caveats:

  1. The timestamps are saved in the PRTime format, which is basically an unix timestamp and you have to convert it in a human-readable format
  2. Extract the domain of a URL using the urltools package, e.g. getting twitter.com instead of twitter.com/cutterkom
library(urltools)
# get the two dataframes 
history <- list_of_df[["moz_historyvisits"]]
urls <- list_of_df[["moz_places"]]

df <- left_join(history, urls, by = c("place_id" = "id")) %>% 
  select(place_id, url, visit_date) %>% 
  # convert the unix timestamp
  mutate(date = as.POSIXct(as.numeric(visit_date)/1000000, origin = '1970-01-01', tz = 'GMT'),
  # extract the domains from the URL, e.g. `twitter.com` instead of `twitter.com/cutterkom`
         domain = str_remove(urltools::domain(url), "www\\."))

When Two Points on a Circle Form a Line

There are many ways to produce computer created abstract images. I show you one them written in R, that leads to images like these:

First of all, let’s set the stage with a config part:

#### load packages
#### instead of tidyverse you can also use just ggplot2, purrr and magrittr
library(here)
library(tidyverse)

####
#### Utils functions neccessary: 
#### You can find them in the generativeart package on Github: github.com/cutterkom/generativeart.
#### Here they are stored in `src/generate_img.R`.
####
source(here("src/generate_img.R"))
NR_OF_IMG <- 1
LOGFILE_PATH <- "logfile/logfile.csv"

The base concept is:

  • form a starting distribution of the points
  • transform the data

In this case, our starting point is a circle. I create the data with a formula called get_circle_data(). The function was proposed on Stackoverflow by Joran Elias.

get_circle_data <- function(center = c(0,0), radius = 1, npoints = 100){
  tt <- seq(0, 2*pi, length.out = npoints)
  xx <- center[1] + radius * cos(tt)
  yy <- center[2] + radius * sin(tt)
  return(data.frame(x = xx, y = yy))
}

The circle dataframe goes straight into a generate_data(), where every point on the circle is connected to excatly one other point. The connections between a pair of coordinates are based on randomness, see sample(nrow(df2)):

generate_data <- function() {
  print("generate data")
  df <- get_circle_data(c(0,0), 1, npoints = 100)
  df2 <- df %>% 
    mutate(xend = x,
           yend = y) %>% 
    select(-x, -y)
  df2 <- df2[sample(nrow(df2)),]
  df <- bind_cols(df, df2)
  return(df)
} 

The dataframe is input to a ggplot::geom_segment() plotting function:

generate_plot <- function(df, file_name, coord) {
  print("generate plot")
  plot <- df %>% 
    ggplot() +
    geom_segment(aes(x = x, y = y, xend = xend, yend = yend), color = "black", size = 0.25, alpha = 0.6) +
    theme_void() +
    coord_equal()
  
  print("image saved...")
  plot
}

Now we have all parts gathered to run the wrapper function generate_img from the generativeart package that indeed creates an image:

generate_img()

From here, you can play with the input parameters to generate different looking images. You can change these variables in get_circle_data():

  • center = c(0,0): changes nothing when you draw only one circle, the center can be anywhere
  • radius = 1: numbers greater than 1 for rings within the circle
  • npoints = 100: Higher numbers for denser circle lines

You can find the code in an .Rmd script on Github.

Automatisierter Journalismus: Schreiben nach Zahlen

Radar ist eine Presseagentur aus Großbritannien, deren Quelle offene Daten sind. Mit Hilfe von Software schreiben Journalisten dann nicht einen Text, sondern viele Texte gleichzeitig:

Our journalists select the most promising data, mine the data to find the story, develop the different angles and then compose a template that instructs the technology on what sentence to write as it computes the numbers in the spread sheet. We are writing stories as mini-algorithms for each new set of data.


Mehr dazu in diesem Text: How RADAR became front page news: Lessons from the first year of an automated news agency

Bei der Süddeutschen Zeitung haben wir das für die Landtagswahlen im Herbst auch gemacht: Ein statistisches Modell hat jedes Stimmkreisergebnis mit allen anderen Stimmkreisen verglichen und vom Resultat abhängige Texte formuliert. Zum Beispiel für München-Mitte:

Unsere Tweets damals nach der Wahl:

Wenn die Vergangenheit aus dem Chatfenster grüßt

„Kein Hallo, kein Tschüss“ hieß ein Vortrag, den ich zusammen mit Elisabeth Gamperl beim Netzkongress 2017 hielt. Darin ging es um Freundschaft in digitalen Zeiten, diesem endlosen Strom an Nachrichten, Halbsätzen, Emojis.

Alle Nachrichten liegen auf Servern, jederzeit abrufbar und nachzulesen. Bei Facebook haben das in den letzten Tagen viele gemacht, denn durch einen Bug in Facebooks Software wurden Nutzerinnen auf alte Nachrichten aufmerksam. The Atlantic hat darüber geschrieben: The Infinite Weirdness of Never-Ending Chat Histories heißt der Text, der beides aufgreift: Das wohlige Gefühl, wenn man in die eigene Vergangenheit eintaucht, die digitale Kiste mit Briefen, Eintrittskarten, sonstigen Souvenieren des eigenen Lebens. Und das harte Aufschlagen in einer Vergangenheit :

But these threads are just as often unnerving. Chat provides an immediate portal into your past in a way that a photo doesn’t. When you look at an old picture, you’re never remembering things the way they really were—you’re projecting your own memory of that event or day. Revisiting the same period through an old chat history is different. Chat records offer concrete evidence of the way things really felt in that moment: the embarrassing slang you used, the plans you made, the idle thoughts you shared with friends. A chat history forces you to confront a version of who you are that you probably forgot about. Part of what made Facebook users affected by the bug so uncomfortable was seeing an old version of themselves pop up without warning.


Die Werkzeuge, die wir benutzen, prägen unsere Verhaltensweisen. Und mit den digitalen Nachrichtenlieferanten passieren einige neue Dinge, wie zum Beispiel:

“I switch chat platforms to avoid ever getting back to that context,” says Anushk Mittal, a developer and student in Georgia. Mittal says that if he has a bad interaction or ghosted someone on Instagram DM, for instance, he’ll often just add them on a different platform to start fresh instead of reopening the old wound. Facebook, for its part, appears to have realized how awkward these eternal histories can be. Now, when you click to message someone via their profile, a new chat window, devoid of history, appears. When that person responds, however, you’re forced back into the thread.

In die Vergangenheit gestoßen werden, das gibt es auch bei anderen Dienstleistern, wie Amazon: Schon mal alle Adressen durchgescrollt, an die du Sachen liefern lassen hast?

The R package to Create Generative Art


Do you want to create #generativeart with #rstats? I made a package for this purpose. It is called generativeart and you can find it on Github.

You can find more images on my Instagram account @cutterkom.

Description

One overly simple but useful definition is that generative art is art programmed using a computer that intentionally introduces randomness as part of its creation process.
Why Love Generative Art? – Artnome

The R package generativeart let’s you create images based on many thousand points. The position of every single point is calculated by a formula, which has random parameters. Because of the random numbers, every image looks different.

In order to make an image reproducible, generative art implements a log file that saves the file_name, the seed and the formula.

The R package to Create Generative Art weiterlesen

Generative Art: How thousands of points can form beautiful images

These images are based on simple points. This post explains how it works.

1. Step: Point, points, points …

The starting point is a rectangle a grid that is populated with many thousand points, in this case 3,969.

The retangle is placed in a coordinate system, so every point has two coordinates (x, y).

2. Step:

Now, the position of every single points is transformed. This new position is calculated by a formula, which has random parameters. Because of the random numbers, every image looks different.

For example, using a combination of sine, cosine and the random factor:

Circle resembling shapes are created by using a polar coordinate system:

Do it yourself

I wrote a package called generativeart which helps to create those kind of images with R.

You can get the package on Github.