TY - JOUR
T1 - Engineering a multi-purpose test collection for Web retrieval experiments
AU - Bailey, Peter
AU - Craswell, Nick
AU - Hawking, David
PY - 2003/11
Y1 - 2003/11
N2 - Past research into text retrieval methods for the Web has been restricted by the lack of a test collection capable of supporting experiments which are both realistic and reproducible. The 1.69 million document WT10g collection is proposed as a multi-purpose testbed for experiments with these attributes, in distributed IR, hyperlink algorithms and conventional ad hoc retrieval. WT10g was constructed by selecting from a superset of documents in such a way that desirable corpus properties were preserved or optimised. These properties include: a high degree of inter-server connectivity, integrity of server holdings, inclusion of documents related to a very wide spread of likely queries, and a realistic distribution of server holding sizes. We confirm that WT10g contains exploitable link information using a site (homepage) finding experiment. Our results show that, on this task, Okapi BM25 works better on propagated link anchor text than on full text. WT10g was used in TREC-9 and TREC-2000 and both topic relevance and homepage finding queries and judgments are available.
AB - Past research into text retrieval methods for the Web has been restricted by the lack of a test collection capable of supporting experiments which are both realistic and reproducible. The 1.69 million document WT10g collection is proposed as a multi-purpose testbed for experiments with these attributes, in distributed IR, hyperlink algorithms and conventional ad hoc retrieval. WT10g was constructed by selecting from a superset of documents in such a way that desirable corpus properties were preserved or optimised. These properties include: a high degree of inter-server connectivity, integrity of server holdings, inclusion of documents related to a very wide spread of likely queries, and a realistic distribution of server holding sizes. We confirm that WT10g contains exploitable link information using a site (homepage) finding experiment. Our results show that, on this task, Okapi BM25 works better on propagated link anchor text than on full text. WT10g was used in TREC-9 and TREC-2000 and both topic relevance and homepage finding queries and judgments are available.
KW - Distributed information retrieval
KW - Link-based ranking
KW - Test collections
KW - Web retrieval
UR - http://www.scopus.com/inward/record.url?scp=0042766369&partnerID=8YFLogxK
U2 - 10.1016/S0306-4573(02)00084-5
DO - 10.1016/S0306-4573(02)00084-5
M3 - Article
SN - 0306-4573
VL - 39
SP - 853
EP - 871
JO - Information Processing and Management
JF - Information Processing and Management
IS - 6
ER -