NFDI4Culture Metadata API

Metadata of all resources as Linked Open Data

Persistent Identifier: <https://nfdi4culture.de/id/E5371>

Twittertagung@DNB – Nachhaltige Archivierung, Erschließung, Bereitstellung dynamischer Daten aus sozialen Medien – Twitter und danach

Retrieve record as:

type
schema : NewsArticle
fabio : Report
image
Low-angle photography of a metal structure
url
https://nfdi4culture.de/news/twittertagungdnb-long-term-archiving.html
name
Twittertagung@DNB: Long-term archiving, cataloguing, and provision of dynamic data from social media – Twitter and beyond
headline
Twittertagung@DNB: Long-term archiving, cataloguing, and provision of dynamic data from social media – Twitter and beyond
text

Long-term archiving, cataloguing, and provision of dynamic data from social media – Twitter and beyond

Conference from 19 to 20 March 2024 and Datasprint from 21 to 22 March at the German National Library (Frankfurt am Main)

Social media is both a source of data for and the focus of a range of research approaches in the humanities, social sciences, IT, sciences and life sciences. The development of social media over time makes it a part of our digital cultural heritage, but the process for institutions to archive and document these in ways which fully reflect its detail and complexity is still only rudimentary. One key reason for this is the unique characteristics of the data in terms of media technology, economics, social factors and aesthetics. This confronts researchers, research institutions and cultural heritage institutions with many different challenges in terms of how to archive, catalogue and provide the data for later use. One example of this is Twitter (now known as “X”). The monetisation of the platform’s internal archive (part of ongoing restructuring of the platform) has had a radical impact on research and archiving. While flexible APIs and access opportunities before early 2023 led to a boom in research activity and the creation of comprehensive collections, access for research and archive has been made increasingly difficult since then.

Archiving, cataloguing and providing dynamic data from social media present challenges which affect researchers, research institutions, libraries and archives in equal measure, and the best way to solve these problems is through collaboration and partnership. This requires wide-ranging efforts which would be impossible for a single data community or discipline.

The aim of the conference was to facilitate networking between libraries, archives, research institutes and researchers in German-speaking countries who are involved in archiving and long-term use of data and digital objects from social media.

This was followed on 21 and 22 March by a Datasprint to work with a long-term corpus of 'German' Twitter data. Calls as well as general participation in the conference and Datasprint met with broad interest. There were over 60 participants on site and over 100 online. They ranged from research institutes and universities to libraries and large and small archives at national and municipal level, with around a third of the participants coming from the surrounding European countries.

The two parts of the programme were realised over four days under the leadership of the Deutsche Nationalbibliothek and Kompetenzwerkstatt of the UB HU Berlin with strong involvement of GESIS, not least through the support and cooperation of the participating consortia BERD@NFDI, KonsortSWD, NFDI4Culture, NFDI4Data Science, NFDI4Memory and Text+.

After a welcome by Director General Frank Scholze and an introduction by Claus-Michael Schlesinger and Britta Woldering on behalf of the programme committee, the first day saw two panels on large data sets and archives and libraries discuss initial problem outlines, long-term archiving projects and research projects. Participants included both municipal and national libraries as well as recognised research projects that were also involved in the collection of the Twitter corpora. These panels were accompanied by poster presentations, during which the participating NFDI consortia were also able to present their infrastructure offerings and communities.

The second day provided insights into platform and research design from a methodological, technical and historical perspective. The afternoon and conclusion of the conference focussed on ethics and law in social media research and the sustainable handling of the data collected in particular. Following the presentation of a GESIS survey, the final discussion (supported by an open Etherpad) was also guided by the debate on the Digital Services Act (DSA). This was followed by central questions on LTA and community participation in conjunction with infrastructure facilities, focussing on data formats, metadata standards and the selective archival value of accounts and platforms. These were accompanied by legal and technical problems, especially in data collection and indexing in research data centres and local archives, for which both law and platforms set barriers. In conclusion, further initiatives were identified to further engage with social media research and Twitter corpora in mailing lists, forums or the continuation of this conference.

These then also formed the focus of the second part, the Datasprint on 21 and 22 March. The data sets of German-language tweets for individual research projects in the humanities, social sciences and life sciences were explored, adapted and visualised on 14 terminals. The background to this was not only the massive computing capacity provided and the data sets prepared by the mentors, but also the legal framework conditions.
Two of the corpora provided contain German-language Twitter data from 2006-2011 and 2014-2023, while the third corpus represents a one per cent sample of all tweets over a period of ten years.

Corpus 1 (2006 – 2011) encompasses approx. 220 million tweets posted between March 2006 (platform launch) and June 2011. The data were collected using a search function that filtered all tweets labelled by Twitter as German-language. The corpus contains all the metadata that were available for each tweet via the Twitter API. The data are stored in multiple files in JSONL (line-oriented JSON, one tweet per line).

Corpus 2 (2014 – 2023) contains approx. 2 billion German-language Twitter data that were collected in real time with no content filtering. The data were collected using the Scheffler criteria (2014), i.e. tweets which contain German function words ('und', 'sie', 'dass') and pass through a language filter. Besides the text, the corpus only contains individual metadata, i.e. the tweet and user ID, the date and time of posting, the reply-to ID and (for the majority of the data) the geographic coordinates. The corpus thus constitutes a representative cross-section of all German-language tweets between July 2014 and mid-March 2023. The data are available in CSV files (one tweet per line, metadata in columns).

Corpus 3 (2013 – 2023) encompasses TweetsKB, a Twitter archive based on Twitter's 1% random sample API, that contains a total of 14 billion tweets together with their metadata. Along with the texts and metadata, which are available in JSON format, the archive also contains annotated features such as entities and sentiments.

The organising institutions provided customised subsets of the corpora on request. It was also possible to create certain derivatives and pre-processing steps (e.g. tokenisation, N-grams) and compilations of tweets (e.g. on one or more hashtags, a list of accounts, extraction of hashtags, links, etc.). Mentors from GESIS, RUB, HU and DNB supported the participants' research projects with their knowledge of the data sets and various programming languages.

The Datasprint was not only characterised by an almost cosy atmosphere in the DNB's training rooms, but above all by stimulating discussions. The project presentations, including initial results, and the discussion with everyone were central, ranging from questions about politics, public opinion, journalistic framing, information and publication, LLMs to more artistic-literary approaches to early computational linguistics in dealing with text corpora.

It is planned to publish further reports and projects on the GESIS and DNB blogs.

Some presentations and slides of the conference lectures can be found in an overview of the programme on the DNB-Wiki.

CONTACT

Dr. Britta Woldering, Letitia Mölck, German National Library

twarchiv(at)dnb(dot)de

dnb.de/twittertagung

PARTNERS

Deutsche Nationalbibliothek
BERD@NFDI
KDH UB HU Berlin
KonsortSWD
NFDI4Culture
NFDI4Data Science
NFDI4Memory
Text+

datePublished
2024-04-18T13:00:00+02:00
keywords
Art History
Computational Linguistics
Digital Humanities
German Studies
History
Information Science
Information Technology
Law
Library Science
Media Informatics
Media Studies
Political Science
Sociology
Publication & Preservation
Qualification & Reuse
Conference
Archive
Library
Report
contributor
Task Area 6: Cultural Research Data Academy
about
Challenges of long-term archiving of social media platforms: Lecture and discussion series "Show & Tell – Social Media Data in Research Practice"
about
Lecture and discussion series "Show & Tell – Social Media Data in Research Practice"

Back