Get your data into Wikidata or Wikibase with R: An import workflow derived from a real world project

This tutorial is a walkthrough to import data into any Wikibase instance using R.

What is Wikibase?

Wikibase is an open source software to store and retrieve structured linked data. The biggest Wikibase based website is Wikidata. More on Wikibases and their use cases can be found here.

But there are other Wikibases around, e.g. Factgrid, a database for historians. It is my platform and community for persisting and publishing data derived in my Remove NA project, in which I construct a knowledge graph on queer history.

Data Modeling and Data Import

I needed to develop a workflow for data modeling and importing data into a Wikibase instance directly from R. This has evolved over time into a stable process that I can now recommend as a workflow. Iterations <3

Basically it is an ETL process that needs to be customized depending on the entity. There are (at least) 17 different versions I built for different entitites that needed a special data modeling. If you want to follow it, choose a script with a high prefix on Github. Due to the iterations they have become more and more sophisticated over time.

The following tutorial is a condensate of this, showing the core elements of an ETL process that imports data in Wikidata or Wikibase.

In my Remove NA project, I also included some functionality beyond
that, which I briefly list and refer to scripts where I used it:

Minimum knowledge about Wikidata’s and Wikibase’s data structure

Everything in Wikidata/Wikibase has the form of a statement:

item -> property -> value

In terms of identifiers this can be translated into:

QID -> PID -> QID

That already tells us something about the data structure. At the end of the data modeling it needs a long form table with three columns. If you consult the documentation of ?write_wikidata() or ?write_wikibase, you’ll see three parameters, that expect this form: items, properties, values = NULL.

This is more or less the mental model when programming with tidy data principles in mind. Quite convenient!

Want to know more? This article on Wikidata’s data model is a good starting point.

Model some data: A very basic example

For the purpose of this tutorial, we pretend we want to import data to an arbitraty Wikibase installation but import it in the end to Wikidata. Why? It’s the easiest way to show the workflow and follow along. What you need: A Wikidata/Wikimedia account.

We only update data on Wikidata’s sandbox item Q4115189, which exists exactly for this purpose: Testing stuff.

We want to add data that says:

  • The Sandbox item is a human
  • The Sandbox item is located in Thailand
  • The Sandbox item field of work is climate chang

Each of the three statements we need to describe in terms of
Wikidata/Wikibase QIDs and PIDs. Searching on Wikidata, we can easily
find that corresponding ids.

StatementItem (QID)Property (PID)Value (QID)
The Sandbox item is a humanQ4115189P31Q5
The Sandbox item is located in ThailandQ4115189P131Q869
The Sandbox item field of work is climate changeQ4115189P101Q125928

As R users we recognize this data structure: Nice tabular data in long format.

Write some code

Dependencies

As always, I work in the {tidyverse} dialect. It is the go-to package to import data into Wikidata from R is {WikidataR} with it’s ?write_wikidata() function.

library(tidyverse)
library(WikidataR)

Unfortunately, it didn’t work out of the box to with any other Wikibase website. I did some refactoring and made a pull request: If you install {WikidataR} from Github, you have access to ?write_wikibase(). The change is not yet on CRAN as far as I know.

For the purpose of this tutorial, we pretend we want to import data to an arbitraty Wikibase installation but import it in the end to Wikidata. Why? It’s the easiest way to show the workflow and follow along.

import <- tibble(
  item = "Q4115189",
  property = c("P31", "P131", "P101"),
  value = c("Q5", "Q869", "Q125928")
)

import %>% knitr::kable()
item property value
Q4115189 P31 Q5
Q4115189 P131 Q869
Q4115189 P101 Q125928

Import Data

Basically, there are two different ways to import data in Wikibase/Wikidata: API or Quickstatements UI. In the end, it’s the same. One is copy-pasting the longformed data in a web interface, the other importing it via API.

More on Quickstatements on the extensive Wikidata docs. Most stuff is pretty straightforward, but there are some tricky parts, e.g. importing time and dates. Then look into the docs, there’s everything written down.

Next step: Importing!

csv_file <- "my_import_data.csv"

write_wikibase( # also write_wikidata possible
  items = import$item,
  properties = import$property,
  values = import$value,
  format = "csv",
  format.csv.file = csv_file
)

# copy it to clipboard
# read_delim(file = csv_file, delim = "\t") %>% clipr::write_clip()

You can then open my_import_data.csv copy it and paste it into the Wikidata’s Quickstatements page. The big advantage is that you get a preview of what you want to import and feedback on possible errors, a progress bar and so on.

Quickstatement’s preview
Quickstatement’s page after import
Quickstatement’s page after import
The sandbox item's revision history shows the imported data.
The sandbox item’s revision history shows the imported data.

But if you want to go directly via API you can do that as well, after configuring:

  • change the format to api
  • add connection in information api.username, api.token, api.format. You get that after creating a Wikimedia account.
  • Add the URL to push the data to. In case of Wikidata it’s https://quickstatements.toolforge.org/.

The documentation of ?write_wikibase() is pretty extensive, so go ahead and read through it.

write_wikibase(
  items = import$item,
  properties = import$property,
  values = import$value,
  format = "csv",
  api.username = "YOUR WIKIMEDIA USERNAME",
  api.token = "YOUR WIKIMEDIA API TOKEN",
  api.format = "v1",
  api.batchname = "name of import",
  api.submit = FALSE, 
  quickstatements.url = "https://quickstatements.toolforge.org"
)

Model your data: A more realistic example

I can assume, the data will not already be available in the exact way we need it.

Assumptions:

  • We want to create new items
  • Translate some data from input form to PID/QIDs
  • Sort and import

Creating new items

Creating a new item is possible through the Quickstatements API, of
course. It needs a special format, that looks like this:

CREATE my_new_item
LAST P31    Q5
LAST P131   Q869
CREATE my_next_new_item
LAST P31    Q5
LAST P331   Q522

While just updating is relatively easy, builing that multiline output is tedious – but thankfully {WikidataR} does that for us.

For example: Instead of an Q.. id in the item column there needs to be CREATE_something_unique.

Creating statements

Working with cryptic Pxxx or Qxxxis error prone. Therefore I developed a workflow to add statements in a more human accessible way. Basically, I create a table with three mandatory columns: statement, pid, qid.

My statements sheet for Remove NA looks like this (I import that file with {googlesheets4} and use it).

Using the above example, the statements dataframe can like this:

statementpidqid
instance_is_humanP31Q5
located_in_thailandP131Q869
field_of_work_climate_changeP101Q125928
statements <- tibble(
  statement = c("instance_is_human", "located_in_thailand", "field_of_work_climate_change", "gender"),
  pid = c("P31", "P131", "P101", "P21"),
  qid = c("Q5", "Q869", "Q125928", NA_character_)
)

statements %>% knitr::kable()
statementpidqid
instance_is_humanP31Q5
located_in_thailandP131Q869
field_of_work_climate_changeP101Q125928
genderP21NA

In order to streamline the adding of statements, I create the function add_statements that is available in my personal utity package {kabrutils}. So you don’t have to install it just to follow the tutorial, I’ve copied it here:

#' Add a statement to an item
#'
#' This function helps to add certain statements. A statement consists of a PID-QID combination that is added to a dataframe. The PID will be the column name, QID the row content.
#' It assumes there is a dataframe that has at least three columns: (1) `statements`, (2) `pid`, (3) `qid`.
#' @param data dataframe to add the new column
#' @param available_statements dataframe with all statements that can be added with this method. It assumes there is a dataframe that has at least three columns: (1) `statement`, (2) `pid`, (3) `qid`.
#' @param new_statement string of new statement. Must exist in `available_statements$statement`. If statement is `coordinates`, then a column `longitude` and a column `latitude` is expected.
#' @param verbose show in terminal what was added
#' @param qid_from_row boolean; default `FALSE` - then QID is taken from dataframe `statements`, if `TRUE` id QID value should be taken from another row
#' @param col_for_row_content string; name of column in dataframe `data` that contains the QID values
#' @export
#' @examples
#' statements <- data.frame(statement = c("my_statement"), pid = c("P2"), qid = c("Q1"))
#' data <- data.frame(item = "my item")
#' data %>% add_statement(available_statements = statements, new_statement = "my_statement")

add_statement <- function(data = NULL,
                          available_statements = statements,
                          new_statement = NULL,
                          verbose = TRUE,
                          qid_from_row = FALSE,
                          col_for_row_content = NULL) {

  new_statement_df <- dplyr::filter(available_statements, statement == new_statement)

  if(nrow(new_statement_df) == 0) {
    stop("Your statement can't be found. Please check if it exists in the table `available_statements` or if there's a typo.")
  }

  pid_as_column_name <- rlang::sym(new_statement_df$pid)

  if(qid_from_row == TRUE) {

    if (new_statement == "coordinates") {
      if (!is.null(data$latitude) & !is.null(data$longitude)) {
        latitude <- "latitude"
        longitude <- "longitude"
        data <- data %>% dplyr::mutate(
          !!pid_as_column_name :=
            dplyr::case_when(
              !is.na(.data[[latitude]]) ~

                #paste0(.data[[latitude]], ",", .data[[longitude]]),
                paste0('"@', .data[[latitude]], '/', .data[[longitude]], '"'),
              TRUE ~ NA_character_
            ))
      } else {
        stop("The input data frame `data` needs a column `latitude` and a column `longitude` in order to add coordinates.")
      }
    } else {
      data <- dplyr::mutate(data, !!pid_as_column_name := .data[[col_for_row_content]])
    }

  } else {
    data <- dplyr::mutate(data, !!pid_as_column_name := new_statement_df$qid)
  }

  if (verbose == TRUE) {
    cli::cli_h1('add "{new_statement}" statement')
    cli::cli_bullets(
      c(
        "*" = "PID = {new_statement_df$pid}",
        "*" = "QID = {new_statement_df$qid}"
      )
    )
  }

  return(data)
}

Let’s assume this is our input data and we want to create a new item per internal_id that has the three statements from above (human, Thailand, climate change) plus sex/gender.

We need to translate sex or gender to QIDs. {WikidataR} has the possibility to search for plain text
properties, but it didn’t work with that functionality, so you have to figure that out yourself.

input_data <- tibble(internal_id = c("1", "2"), sex_gender = c("female", "non-binary")) %>% 
  mutate(sex_gender_qid = 
           case_when(
             sex_gender == "female" ~ "Q6581072",
             sex_gender == "non-binary" ~ "Q48270",
             TRUE ~ NA_character_)
  )

input_data %>% knitr::kable()
internal_idsex_gendersex_gender_qid
1femaleQ6581072
2non-binaryQ48270

Now we build a new dataframe that has the statements we want to import as a wide dataframe:

input_statements <- input_data %>% 
  mutate(item = paste0("CREATE_", internal_id)) %>% 
  # every new item needs a label and should have an description
  mutate(Len = "my item label in english",
         Lde = "my item label in german",
         Dde = "my item description in english",
         Den = "my item descirption in german") %>% 
  # add a certain statement that is available in the statements df
  add_statement(available_statements = statements, new_statement = "instance_is_human", verbose = TRUE) %>% 
  add_statement(available_statements = statements, new_statement = "located_in_thailand", verbose = TRUE) %>% 
  add_statement(available_statements = statements, new_statement = "field_of_work_climate_change", verbose = TRUE) %>% 
    # the gender qid is already in another column, you can get that from there
    add_statement(available_statements = statements, new_statement = "gender", qid_from_row = TRUE, col_for_row_content = "sex_gender_qid", verbose = TRUE)
#> 
#> ── add "instance_is_human" statement ───────────────────────────────────────────
#> • PID = P31
#> • QID = Q5
#> 
#> ── add "located_in_thailand" statement ─────────────────────────────────────────
#> • PID = P131
#> • QID = Q869
#> 
#> ── add "field_of_work_climate_change" statement ────────────────────────────────
#> • PID = P101
#> • QID = Q125928
#> 
#> ── add "gender" statement ──────────────────────────────────────────────────────
#> • PID = P21
#> • QID = NA

The data looks now like this:

Sorry, this table looks aweful! I can’t easily add support for Datatables in WordPress from Rmd …

internal_idsex_gendersex_gender_qiditemLenLdeDdeDenP31P131P101P21
1femaleQ6581072CREATE_1my item label in englishmy item label in germanmy item description in englishmy item descirption in germanQ5Q869Q125928Q6581072
2non-binaryQ48270CREATE_2my item label in englishmy item label in germanmy item description in englishmy item descirption in germanQ5Q869Q125928Q48270

Of course, we don’t want to import everything here, just the relevant columns. So next, we select columns we need and transform it to longform. I have a small function for that last step as well.

import <- input_statements %>% 
  select(item, 
         matches("^L", ignore.case = FALSE), 
         matches("^D", ignore.case = FALSE), 
         # if there are some Sitelinks to other Wiki pages
         #matches("^S", ignore.case = FALSE), 
         matches("^P", ignore.case = FALSE)) %>% 
  pivot_longer(cols = 2:last_col(), names_to = "property", values_to = "value") %>% 
  # fix helper with two columns referring to the same property
  mutate(property = str_remove(property, "_.*")) %>% 
  filter(!is.na(value)) %>% 
  distinct()

import %>% knitr::kable()
itempropertyvalue
CREATE_1Lenmy item label in english
CREATE_1Ldemy item label in german
CREATE_1Ddemy item description in english
CREATE_1Denmy item descirption in german
CREATE_1P31Q5
CREATE_1P131Q869
CREATE_1P101Q125928
CREATE_1P21Q6581072
CREATE_2Lenmy item label in english
CREATE_2Ldemy item label in german
CREATE_2Ddemy item description in english
CREATE_2Denmy item descirption in german
CREATE_2P31Q5
CREATE_2P131Q869
CREATE_2P101Q125928
CREATE_2P21Q48270

No we utilise {WikidataR} to create file to import, copy it to clipboard:

write_wikibase(
  items = import$item,
  properties = import$property,
  values = import$value,
  format = "csv",
  format.csv.file = csv_file
)

# copy it to clipboard
#read_delim(file = csv_file, delim = "\t") %>% clipr::write_clip()

Be aware that I don’t upload that to Wikidata, because I would import some fake data. But I can show the preview page on Quickstatements:

Wikidata’s Quickstatements preview page
Wikidata’s Quickstatements preview page

Some tricks

  • An item can have more than one value per property, e.g. there are to geographic locations. But the initial wide dataframe cannot store two columns with the same name. What I do then is renaming one with an underscore and remove that underscore later.
  • The importing order defines the order on the website. So if I dont want to have any minor statements at the beginning, I have to take of that. My workflow is do define a sorting order, join that on input_statements and use it for reordering.

Schreibe einen Kommentar