Automatic identification of the most important elements in an XML collection

Alexander Krumpholz*, Nina Studeny, David Hawking, Amir Hadad, Tom Gedeon

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

An important problem in XML retrieval is determining the most useful element types to retrieve - e.g. book, chapter, section, paragraph or caption. An automated system for doing this could be based on features of element types related to size, depth, frequency of occurrence, etc. We consider a large number of such features and assess their usefulness in predicting the types of elements judged relevant in INEX evaluations for the IEEE and Wikipedia 2006 corpora. For each feature we automatically assign Useful / Not-Useful labels to element types using Fuzzy c-Means Clustering. We then rank the features by the accuracy with which they predict the manual judgments. We find strong overlap between the top-ten most predictive features for the two collections and that seven features achieve high average accuracy (F-measure > 65%) acrosss them. We hypothesize that an XML retrieval system working on an unlabelled corpus could use these features to decide which retrieval units are most appropriate to return to the user.

Original languageEnglish
Title of host publicationADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium
Pages14-17
Number of pages4
Publication statusPublished - 2011
Event16th Australasian Document Computing Symposium, ADCS 2011 - Canberra, ACT, Australia
Duration: 2 Dec 20112 Dec 2011

Publication series

NameADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium

Conference

Conference16th Australasian Document Computing Symposium, ADCS 2011
Country/TerritoryAustralia
CityCanberra, ACT
Period2/12/112/12/11

Fingerprint

Dive into the research topics of 'Automatic identification of the most important elements in an XML collection'. Together they form a unique fingerprint.

Cite this