This tutorial is a walkthrough to import data into any Wikibase instance using R.
What is Wikibase?
Wikibase is an open source software to store and retrieve structured linked data. The biggest Wikibase based website is Wikidata. More on Wikibases and their use cases can be found here.
But there are other Wikibases around, e.g. Factgrid, a database for historians. It is my platform and community for persisting and publishing data derived in my Remove NA project, in which I construct a knowledge graph on queer history.
Data Modeling and Data Import
I needed to develop a workflow for data modeling and importing data into a Wikibase instance directly from R. This has evolved over time into a stable process that I can now recommend as a workflow. Iterations <3
Basically it is an ETL process that needs to be customized depending on the entity. There are (at least) 17 different versions I built for different entitites that needed a special data modeling. If you want to follow it, choose a script with a high prefix on Github. Due to the iterations they have become more and more sophisticated over time.
The following tutorial is a condensate of this, showing the core elements of an ETL process that imports data in Wikidata or Wikibase.
In my Remove NA project, I also included some functionality beyond
that, which I briefly list and refer to scripts where I used it:
- Timestamps
- If I have a Wikidata ID for an entity, I get the corresponding Wikipedia articles in different languages with {tidywikidataR} and import it into a different Wikibase instance, see an example here.
- I fetch descriptions and labels from Wikidata, translate them with deepl and use them for importing, see here.
- Also, you probably need to check if an entity already exists in your Wikibase of choice. Open Refine is a software, that does this really well, especially the whole human decision making process. So in these lines of code I get the entities that need reconciliation, then do that in Open Refine and export the results to use it for importing. More on reconciliation in Open Refine’s documentation. Also, there are extensive video tutorials on Youtube for everything related to Open Refine.
- Geocode addresses and upload them as latitude and longitudes.
Minimum knowledge about Wikidata’s and Wikibase’s data structure
Everything in Wikidata/Wikibase has the form of a statement:
item
-> property
-> value
In terms of identifiers this can be translated into:
QID
-> PID
-> QID
That already tells us something about the data structure. At the end of the data modeling it needs a long form table with three columns. If you consult the documentation of ?write_wikidata()
or ?write_wikibase
, you’ll see three parameters, that expect this form: items
, properties
, values = NULL
.
This is more or less the mental model when programming with tidy data principles in mind. Quite convenient!
Want to know more? This article on Wikidata’s data model is a good starting point.
Model some data: A very basic example
For the purpose of this tutorial, we pretend we want to import data to an arbitraty Wikibase installation but import it in the end to Wikidata. Why? It’s the easiest way to show the workflow and follow along. What you need: A Wikidata/Wikimedia account.
We only update data on Wikidata’s sandbox item Q4115189, which exists exactly for this purpose: Testing stuff.
We want to add data that says:
- The Sandbox item is a human
- The Sandbox item is located in Thailand
- The Sandbox item field of work is climate chang
Each of the three statements we need to describe in terms of
Wikidata/Wikibase QIDs and PIDs. Searching on Wikidata, we can easily
find that corresponding ids.
Statement | Item (QID) | Property (PID) | Value (QID) |
---|---|---|---|
The Sandbox item is a human | Q4115189 | P31 | Q5 |
The Sandbox item is located in Thailand | Q4115189 | P131 | Q869 |
The Sandbox item field of work is climate change | Q4115189 | P101 | Q125928 |
As R users we recognize this data structure: Nice tabular data in long format.
Write some code
Dependencies
As always, I work in the {tidyverse} dialect. It is the go-to package to import data into Wikidata from R
is {WikidataR} with it’s ?write_wikidata()
function.
library(tidyverse)
library(WikidataR)
Unfortunately, it didn’t work out of the box to with any other Wikibase website. I did some refactoring and made a pull request: If you install {WikidataR}
from Github, you have access to ?write_wikibase()
. The change is not yet on CRAN as far as I know.
For the purpose of this tutorial, we pretend we want to import data to an arbitraty Wikibase installation but import it in the end to Wikidata. Why? It’s the easiest way to show the workflow and follow along.
import <- tibble(
item = "Q4115189",
property = c("P31", "P131", "P101"),
value = c("Q5", "Q869", "Q125928")
)
import %>% knitr::kable()
item | property | value |
---|---|---|
Q4115189 | P31 | Q5 |
Q4115189 | P131 | Q869 |
Q4115189 | P101 | Q125928 |
Import Data
Basically, there are two different ways to import data in Wikibase/Wikidata: API or Quickstatements UI. In the end, it’s the same. One is copy-pasting the longformed data in a web interface, the other importing it via API.
More on Quickstatements on the extensive Wikidata docs. Most stuff is pretty straightforward, but there are some tricky parts, e.g. importing time and dates. Then look into the docs, there’s everything written down.
Next step: Importing!
csv_file <- "my_import_data.csv"
write_wikibase( # also write_wikidata possible
items = import$item,
properties = import$property,
values = import$value,
format = "csv",
format.csv.file = csv_file
)
# copy it to clipboard
# read_delim(file = csv_file, delim = "\t") %>% clipr::write_clip()
You can then open my_import_data.csv
copy it and paste it into the Wikidata’s Quickstatements page. The big advantage is that you get a preview of what you want to import and feedback on possible errors, a progress bar and so on.



But if you want to go directly via API you can do that as well, after configuring:
- change the
format
toapi
- add connection in information
api.username
,api.token
,api.format
. You get that after creating a Wikimedia account. - Add the URL to push the data to. In case of Wikidata it’s
https://quickstatements.toolforge.org/
.
The documentation of ?write_wikibase()
is pretty extensive, so go ahead and read through it.
write_wikibase(
items = import$item,
properties = import$property,
values = import$value,
format = "csv",
api.username = "YOUR WIKIMEDIA USERNAME",
api.token = "YOUR WIKIMEDIA API TOKEN",
api.format = "v1",
api.batchname = "name of import",
api.submit = FALSE,
quickstatements.url = "https://quickstatements.toolforge.org"
)
Model your data: A more realistic example
I can assume, the data will not already be available in the exact way we need it.
Assumptions:
- We want to create new items
- Translate some data from input form to PID/QIDs
- Sort and import
Creating new items
Creating a new item is possible through the Quickstatements API, of
course. It needs a special format, that looks like this:
CREATE my_new_item
LAST P31 Q5
LAST P131 Q869
CREATE my_next_new_item
LAST P31 Q5
LAST P331 Q522
While just updating is relatively easy, builing that multiline output is tedious – but thankfully {WikidataR} does that for us.
For example: Instead of an Q..
id in the item
column there needs to be CREATE_something_unique
.
Creating statements
Working with cryptic Pxxx
or Qxxx
is error prone. Therefore I developed a workflow to add statements in a more human accessible way. Basically, I create a table with three mandatory columns: statement
, pid
, qid
.
My statements sheet for Remove NA looks like this (I import that file with {googlesheets4} and use it).
Using the above example, the statements dataframe can like this:
statement | pid | qid |
---|---|---|
instance_is_human | P31 | Q5 |
located_in_thailand | P131 | Q869 |
field_of_work_climate_change | P101 | Q125928 |
statements <- tibble(
statement = c("instance_is_human", "located_in_thailand", "field_of_work_climate_change", "gender"),
pid = c("P31", "P131", "P101", "P21"),
qid = c("Q5", "Q869", "Q125928", NA_character_)
)
statements %>% knitr::kable()
statement | pid | qid |
---|---|---|
instance_is_human | P31 | Q5 |
located_in_thailand | P131 | Q869 |
field_of_work_climate_change | P101 | Q125928 |
gender | P21 | NA |
In order to streamline the adding of statements, I create the function add_statements
that is available in my personal utity package {kabrutils}. So you don’t have to install it just to follow the tutorial, I’ve copied it here:
#' Add a statement to an item
#'
#' This function helps to add certain statements. A statement consists of a PID-QID combination that is added to a dataframe. The PID will be the column name, QID the row content.
#' It assumes there is a dataframe that has at least three columns: (1) `statements`, (2) `pid`, (3) `qid`.
#' @param data dataframe to add the new column
#' @param available_statements dataframe with all statements that can be added with this method. It assumes there is a dataframe that has at least three columns: (1) `statement`, (2) `pid`, (3) `qid`.
#' @param new_statement string of new statement. Must exist in `available_statements$statement`. If statement is `coordinates`, then a column `longitude` and a column `latitude` is expected.
#' @param verbose show in terminal what was added
#' @param qid_from_row boolean; default `FALSE` - then QID is taken from dataframe `statements`, if `TRUE` id QID value should be taken from another row
#' @param col_for_row_content string; name of column in dataframe `data` that contains the QID values
#' @export
#' @examples
#' statements <- data.frame(statement = c("my_statement"), pid = c("P2"), qid = c("Q1"))
#' data <- data.frame(item = "my item")
#' data %>% add_statement(available_statements = statements, new_statement = "my_statement")
add_statement <- function(data = NULL,
available_statements = statements,
new_statement = NULL,
verbose = TRUE,
qid_from_row = FALSE,
col_for_row_content = NULL) {
new_statement_df <- dplyr::filter(available_statements, statement == new_statement)
if(nrow(new_statement_df) == 0) {
stop("Your statement can't be found. Please check if it exists in the table `available_statements` or if there's a typo.")
}
pid_as_column_name <- rlang::sym(new_statement_df$pid)
if(qid_from_row == TRUE) {
if (new_statement == "coordinates") {
if (!is.null(data$latitude) & !is.null(data$longitude)) {
latitude <- "latitude"
longitude <- "longitude"
data <- data %>% dplyr::mutate(
!!pid_as_column_name :=
dplyr::case_when(
!is.na(.data[[latitude]]) ~
#paste0(.data[[latitude]], ",", .data[[longitude]]),
paste0('"@', .data[[latitude]], '/', .data[[longitude]], '"'),
TRUE ~ NA_character_
))
} else {
stop("The input data frame `data` needs a column `latitude` and a column `longitude` in order to add coordinates.")
}
} else {
data <- dplyr::mutate(data, !!pid_as_column_name := .data[[col_for_row_content]])
}
} else {
data <- dplyr::mutate(data, !!pid_as_column_name := new_statement_df$qid)
}
if (verbose == TRUE) {
cli::cli_h1('add "{new_statement}" statement')
cli::cli_bullets(
c(
"*" = "PID = {new_statement_df$pid}",
"*" = "QID = {new_statement_df$qid}"
)
)
}
return(data)
}
Let’s assume this is our input data and we want to create a new item per internal_id
that has the three statements from above (human, Thailand, climate change) plus sex/gender.
We need to translate sex or gender to QID
s. {WikidataR} has the possibility to search for plain text
properties, but it didn’t work with that functionality, so you have to figure that out yourself.
input_data <- tibble(internal_id = c("1", "2"), sex_gender = c("female", "non-binary")) %>%
mutate(sex_gender_qid =
case_when(
sex_gender == "female" ~ "Q6581072",
sex_gender == "non-binary" ~ "Q48270",
TRUE ~ NA_character_)
)
input_data %>% knitr::kable()
internal_id | sex_gender | sex_gender_qid |
---|---|---|
1 | female | Q6581072 |
2 | non-binary | Q48270 |
Now we build a new dataframe that has the statements we want to import as a wide dataframe:
input_statements <- input_data %>%
mutate(item = paste0("CREATE_", internal_id)) %>%
# every new item needs a label and should have an description
mutate(Len = "my item label in english",
Lde = "my item label in german",
Dde = "my item description in english",
Den = "my item descirption in german") %>%
# add a certain statement that is available in the statements df
add_statement(available_statements = statements, new_statement = "instance_is_human", verbose = TRUE) %>%
add_statement(available_statements = statements, new_statement = "located_in_thailand", verbose = TRUE) %>%
add_statement(available_statements = statements, new_statement = "field_of_work_climate_change", verbose = TRUE) %>%
# the gender qid is already in another column, you can get that from there
add_statement(available_statements = statements, new_statement = "gender", qid_from_row = TRUE, col_for_row_content = "sex_gender_qid", verbose = TRUE)
#>
#> ── add "instance_is_human" statement ───────────────────────────────────────────
#> • PID = P31
#> • QID = Q5
#>
#> ── add "located_in_thailand" statement ─────────────────────────────────────────
#> • PID = P131
#> • QID = Q869
#>
#> ── add "field_of_work_climate_change" statement ────────────────────────────────
#> • PID = P101
#> • QID = Q125928
#>
#> ── add "gender" statement ──────────────────────────────────────────────────────
#> • PID = P21
#> • QID = NA
The data looks now like this:
Sorry, this table looks aweful! I can’t easily add support for Datatables in WordPress from Rmd …
internal_id | sex_gender | sex_gender_qid | item | Len | Lde | Dde | Den | P31 | P131 | P101 | P21 |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | female | Q6581072 | CREATE_1 | my item label in english | my item label in german | my item description in english | my item descirption in german | Q5 | Q869 | Q125928 | Q6581072 |
2 | non-binary | Q48270 | CREATE_2 | my item label in english | my item label in german | my item description in english | my item descirption in german | Q5 | Q869 | Q125928 | Q48270 |
Of course, we don’t want to import everything here, just the relevant columns. So next, we select columns we need and transform it to longform. I have a small function for that last step as well.
import <- input_statements %>%
select(item,
matches("^L", ignore.case = FALSE),
matches("^D", ignore.case = FALSE),
# if there are some Sitelinks to other Wiki pages
#matches("^S", ignore.case = FALSE),
matches("^P", ignore.case = FALSE)) %>%
pivot_longer(cols = 2:last_col(), names_to = "property", values_to = "value") %>%
# fix helper with two columns referring to the same property
mutate(property = str_remove(property, "_.*")) %>%
filter(!is.na(value)) %>%
distinct()
import %>% knitr::kable()
item | property | value |
---|---|---|
CREATE_1 | Len | my item label in english |
CREATE_1 | Lde | my item label in german |
CREATE_1 | Dde | my item description in english |
CREATE_1 | Den | my item descirption in german |
CREATE_1 | P31 | Q5 |
CREATE_1 | P131 | Q869 |
CREATE_1 | P101 | Q125928 |
CREATE_1 | P21 | Q6581072 |
CREATE_2 | Len | my item label in english |
CREATE_2 | Lde | my item label in german |
CREATE_2 | Dde | my item description in english |
CREATE_2 | Den | my item descirption in german |
CREATE_2 | P31 | Q5 |
CREATE_2 | P131 | Q869 |
CREATE_2 | P101 | Q125928 |
CREATE_2 | P21 | Q48270 |
No we utilise {WikidataR} to create file to import, copy it to clipboard:
write_wikibase(
items = import$item,
properties = import$property,
values = import$value,
format = "csv",
format.csv.file = csv_file
)
# copy it to clipboard
#read_delim(file = csv_file, delim = "\t") %>% clipr::write_clip()
Be aware that I don’t upload that to Wikidata, because I would import some fake data. But I can show the preview page on Quickstatements:

Some tricks
- An item can have more than one value per property, e.g. there are to geographic locations. But the initial wide dataframe cannot store two columns with the same name. What I do then is renaming one with an underscore and remove that underscore later.
- The importing order defines the order on the website. So if I dont want to have any minor statements at the beginning, I have to take of that. My workflow is do define a sorting order, join that on
input_statements
and use it for reordering.