18. June 2025

Connecting Architectural Datasets: Federated Queries in NFDI4Culture (with tutorial)

Permalink copied!
Map of Poland with dots in different colours marking locations of manor houses; colours depend on data source

Manors in Poland

"Map of Poland with dots in different colours marking locations of manor houses; colours depend on data source" CC-0

In NFDI4Culture’s subject areas, cultural data and research data are scattered across many organizations and even many small specialized projects. This includes data about the same topics or even the same research objects, making comprehensive analysis a challenge. The Culture Knowledge Graph is already partly addressing this issue by integrating datasets from different organizations, but it only mirrors the basic information about the object. So while it facilitates finding objects across the culture space, it still requires users to go to the original source for deeper research. Another solution can be provided by the use of federated queries.

Federated queries make use of multiple data sources at once to answer questions. The data is hosted by the original institution, ensuring that no data is lost in transformation and everything is always up to date. A typical federated query might look like this: A researcher maintains a database of artists and cities they were born in, but they might want to create a list of artists that were born in a certain country. However, their database does not store the information about which country a city is located in. To solve this problem, they can do a federated query between their database and Wikidata to get the country information and filter their data. Queries can also become much more complex. If the researcher’s database contains birth and death dates, they could query Wikidata for the head of state of the artist’s country during their lifetime or get city coordinates to see which artists were born close to each other.

This blog post is a follow-up to the last meetings of NFDI4Culture’s Linked Open Data Working Group, where federated queries were intensively discussed and showcased. That meeting itself was inspired by the latest Wikimedia workshop on federated queries, where some NFDI4Culture colleagues attended and presented. This post aims to give an introduction to federated queries specifically in the NFDI4Culture space. It will go over the requirements for federated queries, a basic tutorial, and a deeper dive into two examples.

Technical Details

Federated queries are made possible by using SPARQL, a query language for linked data. All included databases must have a SPARQL endpoint to access their data in a structured way. One origin endpoint is used to send query data to all other databases. This origin endpoint has to whitelist all the other endpoints to make the query functional.

Other requirements

Knowledge of the data models of all involved endpoints is required. Furthermore, items need to have a common identifier in both databases to ensure that the data being pulled is about the same entity. This identifier can be an ID from an authority file like the GND. It is also common to match entities between two databases directly (either manually or using OpenRefine) and add the identifier from the first to the second database. Using labels for matching items in federated queries is not recommended but can be possible if the labels are completely identical.

How to construct a basic federated query

The idea of federated queries is to get data from two or more databases at once. So in this example, a query will be made to compile a list of people from one database (main database) and their dates of birth from another (second database).

The first step is to compile the list of persons from the main database with their id from the second database:

SELECT * WHERE {

?person a ?human.

?person maindb:id ?id}

If the id is not a IRI (so for example Q123 instead of https://www.wikidata.org/entity/Q123) it needs to be transformed into a IRI using BIND.

BIND(IRI(CONCAT("https://example.org/", ?id )) AS ?personID )

In the next step, the Sparql endpoint of the second database is called:

SERVICE <https://example.org/sparql> {

In this next block, a query to the second database is made. It should look exactly like a query that is executed directly in the second database, so the different properties and prefixes should be considered.

?personID example:birthDate ?birthDate

See below for example queries to real databases.

Example queries

table with results for query for architects in Culture Knowledge Graph; table shows architects names and relations to works

Query for architects in Culture Knowledge Graph

"table with results for query for architects in Culture Knowledge Graph; table shows architects names and relations to works" CC-0

Findings architects in the Culture Knowledge Graph

The first query is made from a database that collects data about architectural archives and their collections. The goal is to find out if any institutions within NFDI4Culture have materials about any of the architects in the architectural archives database. First, the query asks for all architects and the GND IDs from the first database. Then, it calls the NFDI4Culture Knowledge Graph to see what information about the persons (identified by their GND ID) can be found there. Persons that could not be matched to a GND ID cannot be found using this query. See the comments within the query to find out more about the mechanics of each part.

 

#Prefixes

PREFIX schema: <http://schema.org/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?person ?personLabel ?DNBitem ?work ?workLabel

WHERE {

 

#find person with label and GND id

?person rdfs:label ?personLabel.

?person tibt:P102 ?gnd.

 

#transform GND id into proper IRI 

  BIND(IRI(CONCAT("https://d-nb.info/gnd/", ?gnd )) AS ?DNBitem )

 

#Call NFDI4Culture Sparql endpoint

SERVICE <https://nfdi4culture.de/sparql> {

 

#find persons (DNBitem) in the Culture Knowledge Graph, then find items they are related to

?DNBitem ?p1 ?obj.

?work ?p2 ?DNBitem.

?work rdfs:label ?workLabel.

  }  

}

Try the query.

Map of Poland with dots in different colours marking locations of manor houses; colours depend on data source

Manors in Poland

"Map of Poland with dots in different colours marking locations of manor houses; colours depend on data source" CC-0

Query different data sources to show manors in Poland on a map

The second query is for finding manor houses in Poland and presenting them on a map. It asks for this data from the “Herrenhäuser des Ostseeraums” Wikibase, as well as Wikidata and the Foko project (Kunstdenkmäler in Ostmitteleuropa). The difficulty lies in formulating the same query for different data sources which can be quite different. However, no entities need to be mapped to each other, since the query is just asking for each and every manor house from each database. Each dot on the map has a colour corresponding to its data source. Further details are in the comments within the query. As becomes evident by this query, federated queries can get complex quite quickly.

#prefixes

PREFIX wdt: <http://www.wikidata.org/prop/direct/>

PREFIX wd: <http://www.wikidata.org/entity/>

PREFIX schema: <http://schema.org/>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX ecrm: <http://erlangen-crm.org/140617/>

PREFIX foko: <http://id.gnm.de/ont/foko/>

SELECT DISTINCT ?label_title ?FOKO_coords ?HH_coords ?WD_coords ?HH_label ?wd_label ?layer

WHERE {

 

#First service call to the FOKO project

  {

  SERVICE <https://foko-project.eu/linked-data/>{

SELECT ?function ?lang_title ?label_title ?sub  ?coords WHERE {

 

#the project uses the Erlangen CRM data model which is a very nested data model and leads to very long queries

 

#first part of the query filter for the function “Herrenhaus”

  ?sub ecrm:P19i_was_made_for ?made_for.

  FILTER (REGEX(STR(?made_for), "^http://foko"))

  ?made_for ecrm:P42_assigned ?assigned.

  ?assigned ecrm:P149_is_identified_by ?identified.

  ?identified ecrm:P72_has_language ?lang.

  ?lang ecrm:P1_is_identified_by ?lang_id.

  ?lang_id ecrm:P3_has_note ?lang_title.

  ?identified ecrm:P3_has_note ?function.

  FILTER (?lang_title = "DE")

  FILTER (?function = "Herrenhaus")

 

#get the German title of the building

  ?sub ecrm:P102_has_title ?label.

  ?label ecrm:P3_has_note ?label_title.

  ?label ecrm:P72_has_language ?label_lang.

  ?label_lang ecrm:P1_is_identified_by ?label_lang_id.

  ?label_lang_id ecrm:P3_has_note ?label_lang_title.

  FILTER (?label_lang_title = "DE")

 

#get coordinates

  ?sub ecrm:P53_has_former_or_current_location ?loc.

  ?loc ecrm:P1_is_identified_by ?loc_id.

  ?loc_id ecrm:P3_has_note ?coords.

 

#filter for country “Poland”

  ?loc ecrm:P87_is_identified_by ?address.

  ?address foko:has_country ?country.

  FILTER (?country = "Polen")

}

} # end service call

 

#coordinates were returned as strings; the next few lines cast them to a coordinate type

  BIND(replace(strafter(?coords, ","), " ", "") AS ?lat)

  BIND(replace(strbefore(?coords, ","), " ", "") AS ?lon)

  BIND(CONCAT(("POINT("), ?lat, (" "), ?lon, (")"))  AS ?FOKO_point_string )

  BIND(STRDT(?FOKO_point_string, geo:wktLiteral) AS ?FOKO_coords)

 

#marks “FOKO” as the source for the results in this part of the query

BIND("FOKO" AS ?layer)

  } 

 

#results from all three services are joined using the Union function so that there will be one line per result (instead of having one item from each of the three services in one line which would be an incorrect representation of the data)

UNION

  {

#this part calls no service and is therefore looking for results in the origin endpoint, the Herrenhäuser Wikibase; finds all items with the classification “manor house” and country “Poland” and returns their coordinates and labels

  ?item tibt:P97 tib:Q200.

  ?item tibt:P127 tib:Q7549.

  ?item tibt:P37 ?HH_coords.

  ?item rdfs:label ?HH_label.

 

#marks “Herrenhäuser” as the source for the results in this part of the query

BIND("Herrenhäuser_WB" AS ?layer)

  }

  UNION

  {

#calls to the Wikidata Sparql service; the query is essentialy the same as above

SERVICE <https://query.wikidata.org/sparql> {

?Wikidata_item wdt:P31 wd:Q879050.

?Wikidata_item wdt:P625 ?WD_coords.

?Wikidata_item wdt:P17 wd:Q36.

?Wikidata_item rdfs:label ?wd_label

 

#marks “Wikidata” as the source for the results in this part of the query

   BIND("Wikidata" AS ?layer)

  }   

  }

}

#Because we assigned  a source to each item, using the ?layer variable, the results will have a colour corresponding to its source in the map view. One dot on the map can only have one colour, meaning there is no way to see if an item exists in multiple databases.

Try the query.

Future scenarios

The example queries in this blog post only query data from very few data sources because most databases do not have a Sparql endpoint. This limits the usefulness of such queries. In an ideal world, it would be possible to use a GND ID to find the complete data about a person and their works from all collections. Projects like the Culture Knowledge Graph have made steps in the right direction to make more linked open data available. More work in this area is needed so that federated queries can be a meaningful tool for researchers beyond very specific cases.

Another issue for federated queries are identifiers. As can be seen in the first query example, finding data about persons in the Culture Knowledge Graph was unproblematic because they were identified by their GND ID. This authority file has a big coverage of person data and new persons can be added by institutions. Additionally, this identifier is used by many cultural organisations. 

For works, especially buildings, the situation is different. There is no established authority file in the German language area that has nearly the same coverage and use as the GND. Wikidata does have a lot of data about cultural works but it is not in the strict sense an authority file since anyone can edit data. A work authority data that also covers buildings is not just needed for federated queries, but in general so that researchers can use reliable identifiers for searching information as well as communicating with other researchers.