TY - GEN
T1 - Automatic identification of the most important elements in an XML collection
AU - Krumpholz, Alexander
AU - Studeny, Nina
AU - Hawking, David
AU - Hadad, Amir
AU - Gedeon, Tom
PY - 2011
Y1 - 2011
N2 - An important problem in XML retrieval is determining the most useful element types to retrieve - e.g. book, chapter, section, paragraph or caption. An automated system for doing this could be based on features of element types related to size, depth, frequency of occurrence, etc. We consider a large number of such features and assess their usefulness in predicting the types of elements judged relevant in INEX evaluations for the IEEE and Wikipedia 2006 corpora. For each feature we automatically assign Useful / Not-Useful labels to element types using Fuzzy c-Means Clustering. We then rank the features by the accuracy with which they predict the manual judgments. We find strong overlap between the top-ten most predictive features for the two collections and that seven features achieve high average accuracy (F-measure > 65%) acrosss them. We hypothesize that an XML retrieval system working on an unlabelled corpus could use these features to decide which retrieval units are most appropriate to return to the user.
AB - An important problem in XML retrieval is determining the most useful element types to retrieve - e.g. book, chapter, section, paragraph or caption. An automated system for doing this could be based on features of element types related to size, depth, frequency of occurrence, etc. We consider a large number of such features and assess their usefulness in predicting the types of elements judged relevant in INEX evaluations for the IEEE and Wikipedia 2006 corpora. For each feature we automatically assign Useful / Not-Useful labels to element types using Fuzzy c-Means Clustering. We then rank the features by the accuracy with which they predict the manual judgments. We find strong overlap between the top-ten most predictive features for the two collections and that seven features achieve high average accuracy (F-measure > 65%) acrosss them. We hypothesize that an XML retrieval system working on an unlabelled corpus could use these features to decide which retrieval units are most appropriate to return to the user.
KW - F-Measure
KW - Fuzzy C-Means Clustering
KW - XML Retrieval
UR - http://www.scopus.com/inward/record.url?scp=84872860710&partnerID=8YFLogxK
M3 - Conference contribution
SN - 9781921426926
T3 - ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium
SP - 14
EP - 17
BT - ADCS 2011 - Proceedings of the Sixteenth Australasian Document Computing Symposium
T2 - 16th Australasian Document Computing Symposium, ADCS 2011
Y2 - 2 December 2011 through 2 December 2011
ER -