Status Update: How much data?

Author

Katharina Brunner

Published

May 31, 2022

It’s the end of May, halfway through the time I officially have for my Prototype Fund project. What is the status? Pretty good. I’ve consolidated a lot of the data from different, heterogeneous sources.

This post is intended to provide an overview of the current state of the linked open data base.

Short Recap

The data is stored and published in Factgrid, a database for historians. The software is Wikibase, so the same as Wikidata only as a separate installation. This also means: The properties are different. An example: The elementary property “instance of” (such as person, event, organization…) is P31 in Wikidata, P2 in Factgrid.

Everything else besides the specific data modeling works similar, like the data import via Quickstatements or the query via a SPARQL endpoint. I now use this endpoint to give an overview of how much data is in Factgrid as of today.

Summary of imported data

Code
library(tidyverse)
library(kabrutils)
library(SPARQL)

config_file <- "../../data-publishing/factgrid/config.yml"
config <- yaml::read_yaml(config_file)

# fetch all items from RemoveNA Project and get the statement count
query <- '
SELECT DISTINCT ?item ?itemLabel ?statementcount
WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?item wdt:P131 wd:Q400012 .
  ?item wikibase:statements ?statementcount.
}'

query_res <- query %>%
  sparql_to_tibble(endpoint = config$connection$sparql_endpoint)

# Number of items ---------------------------------------------------------

nr_of_items <- length(unique(query_res$item))

# Number of relations -----------------------------------------------------

gross_relations <- sum(query_res$statementcount)

# redundandt: 4098 for works (posters/books)
redundant_works <- 4098

# minus 3 meta data items per item
netto_relations <- gross_relations - (nr_of_items*3) - redundant_works

In summary:

As of today, I imported 7170 Items with 34902 relations between them.

Gross there are more connections, but since each item has three metadata fields (1, 2, 3), I leave them out of the count. Also, I remove Published works, because it can be interpreted as a duplicated count for those.

Type value
Items 7170
Gross relations 60510
redundant data on poster/books 4098
standard metadata 21510
Net relations 34902

A network of all data

Plotting all relations between items that belong to the Remove NA project:

Looks quite nice!

But besides looks, what does it mean?

It’s obvious that many, many items converge to a handful items. You can see this through the lines that end in just few points. For example many items have author as a career statement or LGBT as target group.

What are those items?

Main targeted objects

The table shows how often a statement points to a certain object (top 50):

Code
query <- '
# get a count of target items (objects)
# removing standard properties
SELECT ?value ?valueLabel ?count WHERE {
  {
    SELECT ?value (COUNT(DISTINCT ?item) AS ?count) WHERE {
       ?item wdt:P131 wd:Q400012 .
       ?item ?prop ?value . 
      # only entities as targets, no strings
      FILTER(STRSTARTS(STR(?value), "https://database.factgrid.de/entity/"))
      # without standard properties like instance of, and research project stuff
      MINUS {?item wdt:P2 ?value.}
      MINUS {?item wdt:P131 ?value.}
      MINUS {?item wdt:P97 ?value.}
            
    } GROUP BY ?value
  } . 
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}
ORDER BY DESC(?count) 
Limit 50
'

query %>% 
  sparql_to_tibble(endpoint = config$connection$sparql_endpoint) %>% 
  DT::datatable()

A deeper network analysis is ongoing.

Count of instances of in the data

Instances of are the most basic property in Factgrid as well as Wikidata. It gives information about what an item is: A human? A group? A demonstration? …

The table shows how often an instance occurs:

Code
query <- '
#  Query to find the most common values for the property ?prop
#  on items in the class ?class

SELECT ?value ?valueLabel ?count WHERE {
  {
    SELECT ?value (COUNT(DISTINCT ?a) AS ?count) WHERE {
       ?a ?prop ?value . 
       ?a wdt:P131 ?class .
       BIND (wdt:P2 AS ?prop) .
       BIND (wd:Q400012 AS ?class) .
    } GROUP BY ?value
  } . 
  SERVICE wikibase:label {
    bd:serviceParam wikibase:language "en" .
  }
}
ORDER BY DESC(?count) ?valueLabel
'

query %>% 
  sparql_to_tibble(endpoint = config$connection$sparql_endpoint) %>% 
  DT::datatable()