Doc-KG: Unstructured documents to knowledge graph construction, identification and validation with Wikidata

Muhammad Salman*, Armin Haller, Sergio J.Rodríguez Méndez, Usman Naseem

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The exponential growth of textual data in the digital era underlines the pivotal role of Knowledge Graphs (KGs) in effectively storing, managing, and utilizing this vast reservoir of information. Despite the copious amounts of text available on the web, a significant portion remains unstructured, presenting a substantial barrier to the automatic construction and enrichment of KGs. To address this issue, we introduce an enhanced Doc-KG model, a sophisticated approach designed to transform unstructured documents into structured knowledge by generating local KGs and mapping these to a target KG, such as Wikidata. Our model innovatively leverages syntactic information to extract entities and predicates efficiently, integrating them into triples with improved accuracy. Furthermore, the Doc-KG model's performance surpasses existing methodologies by utilizing advanced algorithms for both the extraction of triples and their subsequent identification within Wikidata, employing Wikidata's Unified Resource Identifiers for precise mapping. This dual capability not only facilitates the construction of KGs directly from unstructured texts but also enhances the process of identifying triple mentions within Wikidata, marking a significant advancement in the domain. Our comprehensive evaluation, conducted using the renowned WebNLG benchmark dataset, reveals the Doc-KG model's superior performance in triple extraction tasks, achieving an unprecedented accuracy rate of 86.64%. In the domain of triple identification, the model demonstrated exceptional efficacy by mapping 61.35% of the local KG to Wikidata, thereby contributing 38.65% of novel information for KG enrichment. A qualitative analysis based on a manually annotated dataset further confirms the model's excellence, outshining baseline methods in extracting high-fidelity triples. This research embodies a novel contribution to the field of knowledge extraction and management, offering a robust framework for the semantic structuring of unstructured data and paving the way for the next generation of KGs.

Original languageEnglish
Article number9
Number of pages17
JournalExpert Systems
Volume41
Issue number9
DOIs
Publication statusPublished - Sept 2024

Fingerprint

Dive into the research topics of 'Doc-KG: Unstructured documents to knowledge graph construction, identification and validation with Wikidata'. Together they form a unique fingerprint.

Cite this