Brief Report | 11. May 2023
Extension of Wikibase support in the OpenRefine data cleaning tool
By Dr. Lozana Rossenova
Project proposal context
Wikibase is a popular open source tool used by cultural and research institutions to store and structure Linked Open Data, as well as various media files. It is part of the media viewing environment (SemanticKompakkt) developed in the context of Task Area 1: Data capture and enrichment, and part of the portfolio of Knowledge Graph Services developed in Task Area 5: Technical, Ethical and Legal activities at NFDI4Culture.
OpenRefine is a widely-used data wrangling tool for cleaning tabular data and connecting it with knowledge bases, including Wikidata and Wikibase. Managers of Wikibase instances regularly need to perform batch uploads and edits of/to their data and media files there. Prior to this project, OpenRefine’s Wikibase extension already supported batch uploads and edits of/to metadata on Wikidata and arbitrary Wikibases. The Flex Funds award allows the OpenRefine team to extend OpenRefine’s existing functionalities by integrating support for local media upload in arbitrary Wikibases and support for custom data types.
This project builds upon existing work on the Wikibase extension funded by a grant from the Wikimedia Foundation which supported upload and batch edit of files on Wikimedia Commons (Wikimedia’s media repository), but not yet individual Wikibase installations.
Deliverables
1) Batch upload and batch editing of media files in Wikibases through OpenRefine
The OpenRefine Reconciliation Service for Wikimedia Commons was modified and abstracted, so it can be used to connect OpenRefine to any Wikibase instance, not just Wikimedia Commons. As a result, managers of, and contributors to a Wikibase instance are able to upload large batches (up to 10,000s) of media files to an arbitrary Wikibase. In addition, they can also edit (modify, add to, delete) the (structured) metadata of the media files stored in their Wikibase, through using OpenRefine.
In order to make this possible, media file upload functionality through OpenRefine has also been modified and made more flexible. This includes adding support in OpenRefine for the media-specific new data type ‘local media file’ that is used in the target Wikibase.
Further achievements in this regard include a clearer interface to add additional Wikibase instance connections in OpenRefine, and to switch between multiple connected instances. The schema building interface (where metadata is added to files before upload) has also been adapted to the use case of working with a Wikibase instance vs Wikimedia Commons.
2) Batch editing of data stored in custom (non-Wikidata) data types in Wikibases through OpenRefine
Managers of, and contributors to a Wikibase instance can now also (batch) edit data in any (custom, atypical) data type defined in their Wikibase through using OpenRefine.
In order to make this possible, Wikibase data type support in OpenRefine has been modified and made more versatile/flexible. In early 2022, OpenRefine’s Wikibase extension already supports all data types that are used inside Wikidata. However, cases exist where Wikibase managers want to implement / deploy custom data types which differ from the ones used in Wikidata. One example is the local media file datatype (mentioned above). Another example is the EDTF data type in Wikibase, which is more specific than Wikidata’s own Time datatype and which was commissioned and deployed by the Luxembourg Ministry of Culture. With support through the FlexFunds, OpenRefine’s data type support has been made extensible so that in the future, OpenRefine can support more data types, even ones that are not yet developed.
3) Thumbnail support for media files
Additionally to the planned features 1) and 2), support for showing thumbnails of media files (stored in a Wikibase) inside an OpenRefine project grid was newly implemented. This highly visible feature lets users compare the metadata stored in the grid to the actual media file and allows them to spot inconsistencies more easily.
This feature is available both on Wikimedia Commons and on third-party Wikibases. It was also the occasion to introduce an extension point in OpenRefine to let plugins customise the way project cells are displayed, giving further opportunity to users to adapt the tool to fit their project-specific needs better.
4) Documentation and dissemination
Following the completion of the technical deliverables above, OpenRefine’s team has produced end user and developer focused documentation for the above mentioned features. The documentation is accessible here: https://en.wikiversity.org/wiki/Uploading_media_files_to_a_Wikibase_with_OpenRefine, and will soon also be available as guidelines via the NFDI4Culture Portal.
The new features were also presented to interested NFDI stakeholders during a meeting of the Linked Open Data Working Group organized by Task Area 5. Meeting notes and slides were later made available to all members of the LOD WG and subscribers to the mailing list.
In the context of this project and in collaboration with the OpenRefine team, TIB’s Open Science Lab Wikibase maintainers took on the long term hosting and maintenance of the two git repositories for OpenRefine Wikibase reconciliation services required for data editing and media file upload respectively. These are made publicly available on the NFDI4Culture GitLab here: https://gitlab.com/nfdi4culture/ta1-data-enrichment/openrefine-wikibase; and https://gitlab.com/nfdi4culture/ta1-data-enrichment/openrefine-wikibase-media.