It’s the end of May, halfway through the time I officially have for my Prototype Fund project. What is the status? Pretty good. I’ve consolidated a lot of the data from different, heterogeneous sources.
This post is intended to provide an overview of the current state of the linked open data base.
Short Recap
The data is stored and published in Factgrid, a database for historians. The software is Wikibase, so the same as Wikidata only as a separate installation. This also means: The properties are different. An example: The elementary property “instance of” (such as person, event, organization…) is P31 in Wikidata, P2 in Factgrid.
Everything else besides the specific data modeling works similar, like the data import via Quickstatements or the query via a SPARQL endpoint. I now use this endpoint to give an overview of how much data is in Factgrid as of today.
Summary of imported data
Code
library(tidyverse)library(kabrutils)library(SPARQL)config_file <-"../../data-publishing/factgrid/config.yml"config <- yaml::read_yaml(config_file)# fetch all items from RemoveNA Project and get the statement countquery <-'SELECT DISTINCT ?item ?itemLabel ?statementcountWHERE { SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". } ?item wdt:P131 wd:Q400012 . ?item wikibase:statements ?statementcount.}'query_res <- query %>%sparql_to_tibble(endpoint = config$connection$sparql_endpoint)# Number of items ---------------------------------------------------------nr_of_items <-length(unique(query_res$item))# Number of relations -----------------------------------------------------gross_relations <-sum(query_res$statementcount)# redundandt: 4098 for works (posters/books)redundant_works <-4098# minus 3 meta data items per itemnetto_relations <- gross_relations - (nr_of_items*3) - redundant_works
In summary:
As of today, I imported 7170 Items with 34902 relations between them.
Gross there are more connections, but since each item has three metadata fields (1, 2, 3), I leave them out of the count. Also, I remove Published works, because it can be interpreted as a duplicated count for those.
Type
value
Items
7170
Gross relations
60510
redundant data on poster/books
4098
standard metadata
21510
Net relations
34902
A network of all data
Plotting all relations between items that belong to the Remove NA project: