Kategorien
Data + Code

KI, ML, etc.: Es dauert noch und das ist ganz normal so

AI hier, AI da. Aber wer setzt es wirklich ein? MIT Technology Review macht den Realitätscheck und bestätigt vieles derjenigen, die sich praktisch damit befassen.

It’s one thing to see breakthroughs in artificial intelligence that can outplay grandmasters of Go, or even to have devices that turn on music at your command. It’s another thing to use AI to make more than incremental changes in businesses that aren’t inherently digital.

Denn Google, Amazon, Facebook, Netflix und die anderen großen Firmen haben extrem viele Mitarbeiter, die sich nur damit befassen und das Geschäftsmodell ist inheränt auf Daten ausgelegt. In anderen Branchen ist das nicht so.

Data scientists at IBM and Fluor didn’t need long to mock up algorithms the system would use, says Leslie Lindgren, Fluor’s vice president of information management. What took much more time was refining the technology with the close participation of Fluor employees who would use the system. In order for them to trust its judgments, they needed to have input into how it would work, and they had to carefully validate its results, Lindgren says.


To develop a system like this, “you have to bring your domain experts from the business—I mean your best people,” she says. “That means you have to pull them off other things.” Using top people was essential, she adds, because building the AI engine was “too important, too long, and too expensive” for them to do otherwise.

Es wird also, so das Fazit, noch etwas dauern, bis Künstliche Intelliganz und Maschinelles Lernen auch in nicht-Tech-Branchen in der Breite ankommt. Ungewönlich ist das nicht:

What (…) economists confirmed, is that the spread of technologies is shaped less by the intrinsic qualities of the innovations than by the economic situations of the users. The users’ key question is not, as it is for technologists, “What can the technology do?” but “How much will we benefit from investing in it?”

Links:

Kategorien
Data + Code Kultur + Gesellschaft Medien + Internet

The illusion of the Cloud

  • „[The] “cloud” is a massive interconnected physical infrastructure which exists across the world.“
  • By using cloud services from Amazon, Google, Microsoft one can outsource one’s own infrastructure setup with all it’s challenges
  • now: Infrastructure-as-a-Service
  • super-cheap hosting with a price that depends on usage and is scalable
  • „The actual infrastructure at the heart of AWS’ infrastructure-as-a-service isn’t the thing that makes it important to developers; it’s the services and APIs built on top of that infrastructure.“ (Ingrid Burrington)

Links:

Kategorien
Data + Code Medien + Internet Politik + Wirtschaft

Pros and Cons of a Social Index

Heather Krause writes one of my favorite newsletter. She works at Datassist, a company working with NGOs and data journalists.

Recently, she wrote about social indices:

A social index is a summary of a complex issue (or issues). Generally, social indexes take a large number of variables related to a specific topic or situation and combine them to get one number. It’s often a single number, but can also be a rank (#1 country out of 180) or a category (“high performing”).

Heather Krause

Pros of social indices:

  • attract public interest
  • allow comparisons over time
  • provide a big picture
  • „reduce vast amounts of information to a manageable size“

Cons:

  • „disguise a massive amount of inequality in the data“
  • simplistic interpretations
  • hide emerging problems of some variables

So, should we use them?

Krause says, „yes“, but …

If we’re using an index to understand a trend or situation, we also need to look at the individual elements that make up that index.

Datassist published a list with various indicators here.

Kategorien
Data + Code Medien + Internet Politik + Wirtschaft

Zuboff und das Zeitalter des Überwachungskapitalismus

Shoshanna Zuboff beobachtet und interpretiert die Digitalisierung seit 40 Jahren. Die Professorin an der Harvard Business School hat ein Buch geschrieben, das der Guardian in eine Reihe mit Adam Smith, Karl Marx, Max Weber, Karl Polanyi und Thomas Piketty stellt. Es heißt The Age of Surveillance Capital und es sollte bald in meinem Buchregal stehen, wie ich finde.

Ein Auszug aus einem Interview mit dem Guardian:

In my early fieldwork in the computerising offices and factories of the late 1970s and 80s, I discovered the duality of information technology: its capacity to automate but also to “informate”, which I use to mean to translate things, processes, behaviours, and so forth into information. This duality set information technology apart from earlier generations of technology: information technology produces new knowledge territories by virtue of its informating capability, always turning the world into information. The result is that these new knowledge territories become the subject of political conflict. The first conflict is over the distribution of knowledge: “Who knows?” The second is about authority: “Who decides who knows?” The third is about power: “Who decides who decides who knows?”

(…)

Surveillance capitalists were the first movers in this new world. They declared their right to know, to decide who knows, and to decide who decides. In this way they have come to dominate what I call “the division of learning in society”, which is now the central organising principle of the 21st-century social order, just as the division of labour was the key organising principle of society in the industrial age.

Shoshanna Zuboff

Kategorien
Data + Code Lesenswertes Medien + Internet

Ein paar Links

Year Five at Stamen: Some interessting projects and short overviews from the mapmakers of stamen

Podcast: Let’s Talk About Natural Language Processing (Data Skeptic)

Ein Rädchen in einer unendlichen Maschinerie von Checks and Balances: Finanzierungen über Kleinbeträge ist immer noch wichtig, auch wenn Blendle etc nicht so wirklich zünden. Denn Mikro-Finanzierungen sind was für kleine Medien in der Nische.

Es gibt nicht nur eine Zukunft, sondern viele Zukünfte. Sie sehen aus wie ein Trichter, je ferner, desto mehr mögliche Szenarien gibt es. Zukünfte gestalten

Pen Plotter Artwork | Gunther Kleinert

„Spinne ich, wenn ich denke, dass sie ausschließlich meine Arbeit genutzt haben?“ – SZ: Angenommen, jemand nutzt Open-Source-Code, erstellt damit Kunst – wer ist dann Urheber? Die Person, die den Code geschrieben hat? Oder die, die ihn ausgeführt hat?

Kategorien
Politik + Wirtschaft

An economist’s view on AI

Michaela Schmöller, writes in „Secular stagnation: A false alarm in the euro area?“ for the Bank of Finland

It is important to note that productivity growth evolves in a two-stage process: the initial invention of new technologies through research and development, subsequently followed by technological diffusion, i.e. the incorporation of these new technologies in the production processes of firms. As a result, even though many important technology advances may have been invented in recent times, they will only exert an effect on output and productivity once firms utilise these technologies in production. Potential productivity gains from technologies that have yet to be widely adopted may be sizable. A central example is the field of artificial intelligence in which future productivity gains may be considerable once AI-related technologies diffuse to the wider economy. (…)

AI may represent — as did the steam engine, the internal combustion engine and personal computers — a general purpose technology, meaning that it is far-reaching, holds the potential for further future improvements and has the capability of spurring other major, complementary innovations over time with the power of drastically boosting productivity. Incorporating AI in production requires substantial changes on the firm-level, including capital stock adjustments, the revision of internal processes and infrastructures, as well as adapting supply and value chains to enable the absorption of these new technologies. Consequently, this initial adjustment related to the incorporation of general purpose technologies in firms‘ production may take time and may initially even be accompanied by a drop in labour productivity before delivering positive productivity gains.


Kategorien
Data + Code

Sorry, auch Datenanalysen sind nicht der Heilige Gral der Objektivität

Datenanalysen sind nicht neutral: Jede Entscheidung über Variablen oder Methodik ist schlussendlich auch eine inhaltliche Entscheidung. Das zeigt anschaulich eine Studie, über die das Spektrum Magazin schreibt:

Bekommen schwarze Fußballspieler häufiger rote Karten als Nicht-Schwarze? Das war die Frage, auf die Forscherinnen und Forscher 29 verschiedene Antworten gaben. Die Ergebnisse unterscheiden sich zum Teil deutlich und widersprachen sich auch. Und das, obwohl alle den exakt gleichen Datensatz zur Verfügung hatten.

Die Unterschiede ergeben sich zum Beispiel aus folgenden Punkten:

  • Was sind die Annahmen über die Verteilung der Daten?
  • Können sich Schiedsrichter und Spieler beeinflussen?
  • Sind rote Karten voneinander unabhängig?
  • Werden alle Variablen in die Analyse miteinbezogen? „Gut zwei Drittel der Teams hatten beispielsweise die Position des Spielers auf dem Platz berücksichtigt, aber nur drei Prozent die Gesamtzahl der Platzverweise, die ein Schiedsrichter verhängte.“

Und was folgt daraus? Sind Analysen nicht mehr zu trauen? Natürlich nicht, aber wie so oft hilft ein Bewusstsein, dass auch Datenanalysen keine in Stein gemeisselten Ergebnisse produzieren. Wie im Journalismus gilt auch hier: Transparenz erhöht die Glaubwürdigkeit.

The best defense against subjectivity in science is to expose it. Transparency in data, methods, and process gives the rest of the community opportunity to see the decisions, question them, offer alternatives, and test these alternatives in further research.

Studie „Many Analysts, One Data Set“

Hat eine schwarze Hautfarbe nun Einfluss auf Platzverweise? Zwei Drittel der Analysen sagen „ja“, ein Drittel „nein“.

via WZB Data Science Blog

Kategorien
Data + Code

How to Export Your Mozilla Firefox History as a Dataframe in R

The goal of this post is to export a Mozilla Firefox Browser history and import in R as a dataframe.

Browser history data

Firefox saves your browsing history in a file called places.sqlite. This file contains several tables, like bookmarks, favicons or the history.

To get a dataframe with visited websites, you need two tables from the sqlite file:

  1. moz_historyvisits: it contains all websites you visited with time and date. All websites have an id instead of a readable URL.
  2. moz_places: it contains the translation of the websites id and its actual URL.

More on the database schema:

Import the data into R

sqlite files can be imported with the package RSQLite.

First, find the places.sqlite on your computer. You can get the path, by visiting about:support in Firefox and looking for the Profiles directory.

library(RSQLite)
library(purrr)
library(here)

# connect to database
con <- dbConnect(drv = RSQLite::SQLite(), 
                 dbname = "path/to/places.sqlite",
                 bigint="character")

# get all tables
tables <- dbListTables(con)

# remove internal tables
tables <- tables[tables != "sqlite_sequence"]
# create a list of dataframes
list_of_df <- purrr::map(tables, ~{
  dbGetQuery(conn = con, statement=paste0("SELECT * FROM '", .x, "'"))
})
# get the list of dataframes some names
names(list_of_df) <- tables

Extract browser history

Next, we extract the two tables with the information we need, join them and keep only the visited url, the time and the URL id.

There are two caveats:

  1. The timestamps are saved in the PRTime format, which is basically an unix timestamp and you have to convert it in a human-readable format
  2. Extract the domain of a URL using the urltools package, e.g. getting twitter.com instead of twitter.com/cutterkom
library(urltools)
# get the two dataframes 
history <- list_of_df[["moz_historyvisits"]]
urls <- list_of_df[["moz_places"]]

df <- left_join(history, urls, by = c("place_id" = "id")) %>% 
  select(place_id, url, visit_date) %>% 
  # convert the unix timestamp
  mutate(date = as.POSIXct(as.numeric(visit_date)/1000000, origin = '1970-01-01', tz = 'GMT'),
  # extract the domains from the URL, e.g. `twitter.com` instead of `twitter.com/cutterkom`
         domain = str_remove(urltools::domain(url), "www\\."))
Kategorien
Data + Code

Useful Links for Installing R and Shiny on a Cloud Server

  1. How To Install R on Ubuntu 18.04
  2. Install the Shiny package on your remote computer with install.packages("shiny").
  3. Follow instructions on the official Shiny website
  4. Done!
Kategorien
Data + Code

When Two Points on a Circle Form a Line

There are many ways to produce computer created abstract images. I show you one them written in R, that leads to images like these:

First of all, let’s set the stage with a config part:

#### load packages
#### instead of tidyverse you can also use just ggplot2, purrr and magrittr
library(here)
library(tidyverse)

####
#### Utils functions neccessary: 
#### You can find them in the generativeart package on Github: github.com/cutterkom/generativeart.
#### Here they are stored in `src/generate_img.R`.
####
source(here("src/generate_img.R"))
NR_OF_IMG <- 1
LOGFILE_PATH <- "logfile/logfile.csv"

The base concept is:

  • form a starting distribution of the points
  • transform the data

In this case, our starting point is a circle. I create the data with a formula called get_circle_data(). The function was proposed on Stackoverflow by Joran Elias.

get_circle_data <- function(center = c(0,0), radius = 1, npoints = 100){
  tt <- seq(0, 2*pi, length.out = npoints)
  xx <- center[1] + radius * cos(tt)
  yy <- center[2] + radius * sin(tt)
  return(data.frame(x = xx, y = yy))
}

The circle dataframe goes straight into a generate_data(), where every point on the circle is connected to excatly one other point. The connections between a pair of coordinates are based on randomness, see sample(nrow(df2)):

generate_data <- function() {
  print("generate data")
  df <- get_circle_data(c(0,0), 1, npoints = 100)
  df2 <- df %>% 
    mutate(xend = x,
           yend = y) %>% 
    select(-x, -y)
  df2 <- df2[sample(nrow(df2)),]
  df <- bind_cols(df, df2)
  return(df)
} 

The dataframe is input to a ggplot::geom_segment() plotting function:

generate_plot <- function(df, file_name, coord) {
  print("generate plot")
  plot <- df %>% 
    ggplot() +
    geom_segment(aes(x = x, y = y, xend = xend, yend = yend), color = "black", size = 0.25, alpha = 0.6) +
    theme_void() +
    coord_equal()
  
  print("image saved...")
  plot
}

Now we have all parts gathered to run the wrapper function generate_img from the generativeart package that indeed creates an image:

generate_img()

From here, you can play with the input parameters to generate different looking images. You can change these variables in get_circle_data():

  • center = c(0,0): changes nothing when you draw only one circle, the center can be anywhere
  • radius = 1: numbers greater than 1 for rings within the circle
  • npoints = 100: Higher numbers for denser circle lines

You can find the code in an .Rmd script on Github.