Tiny But Mighty: A Crowdsourced Benchmark Dataset for Triple Extraction from Unstructured Text

Muhammad Salman*, Armin Haller, Sergio J. Rodríguez Méndez, Usman Naseem

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

In the context of Natural Language Processing (NLP) and Semantic Web applications, constructing Knowledge Graphs (KGs) from unstructured text plays a vital role. Several techniques have been developed for KG construction from text, but the lack of standardized datasets hinders the evaluation of triple extraction methods. The evaluation of existing KG construction approaches is based on structured data or manual investigations. To overcome this limitation, this work introduces a novel dataset specifically designed to evaluate KG construction techniques from unstructured text. Our dataset consists of a diverse collection of compound and complex sentences meticulously annotated by human annotators with potential triples (subject, predicate, object). The annotations underwent further scrutiny by expert ontologists to ensure accuracy and consistency. For evaluation purposes, the proposed F-measure criterion offers a robust approach to quantify the relatedness and assess the alignment between extracted triples and the ground-truth triples, providing a valuable tool for evaluating the performance of triple extraction systems. By providing a diverse collection of high-quality triples, our proposed benchmark dataset offers a comprehensive training and evaluation set for refining the performance of state-of-the-art language models on a triple extraction task. Furthermore, this dataset encompasses various KG-related tasks, such as named entity recognition, relation extraction, and entity linking.

Original languageEnglish
Title of host publicationISA 2024
Subtitle of host publication20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC-COLING 2024, Workshop Proceedings
EditorsHarry Bunt
PublisherEuropean Language Resources Association (ELRA)
Pages71-81
Number of pages11
Publication statusPublished - 2024
Event20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation, ISA 2024 - Torino, Italy
Duration: 20 May 2024 → …

Publication series

NameISA 2024: 20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation at LREC-COLING 2024, Workshop Proceedings

Conference

Conference20th Joint ACL - ISO Workshop on Interoperable Semantic Annotation, ISA 2024
Country/TerritoryItaly
CityTorino
Period20/05/24 → …

Fingerprint

Dive into the research topics of 'Tiny But Mighty: A Crowdsourced Benchmark Dataset for Triple Extraction from Unstructured Text'. Together they form a unique fingerprint.

Cite this