TY - GEN
T1 - Server selection methods in hybrid portal search
AU - Hawking, David
AU - Thomas, Paul
PY - 2005
Y1 - 2005
N2 - The TREC.GOV collection makes a valuable web testbed for distributed information retrieval methods because it is naturally partitioned and includes 725 web-oriented queries with judged answers. It can usefully model aspects of government and large corporate portals. Analysis of the.gov data shows that a purely distributed approach would not be feasible for providing search on a.gov portal because of the large number (17,000+) of web sites and the high proportion that do not provide a search interface. An alternative hybrid approach, combining both distributed and centralized techniques, is proposed and server selection methods are evaluated within this framework using web-oriented evaluation methodology. A number of well-known algorithms are compared against representatives (highest anchor ranked page (HARP) and anchor weighted sum (AWSUM)) of a family of new selection methods which use link anchortext extracted from an auxiliary crawl to provide descriptions of sites which are not themselves crawled. Of the previously published methods, ReDDE substantially outperformed three variants of CORI and also outperformed a method based on Kullback-Leibler Divergence (extended) except on topic distillation. HARP and AWSUM performed best overall but were outperformed on the topic distillation task by extended KL Divergence.
AB - The TREC.GOV collection makes a valuable web testbed for distributed information retrieval methods because it is naturally partitioned and includes 725 web-oriented queries with judged answers. It can usefully model aspects of government and large corporate portals. Analysis of the.gov data shows that a purely distributed approach would not be feasible for providing search on a.gov portal because of the large number (17,000+) of web sites and the high proportion that do not provide a search interface. An alternative hybrid approach, combining both distributed and centralized techniques, is proposed and server selection methods are evaluated within this framework using web-oriented evaluation methodology. A number of well-known algorithms are compared against representatives (highest anchor ranked page (HARP) and anchor weighted sum (AWSUM)) of a family of new selection methods which use link anchortext extracted from an auxiliary crawl to provide descriptions of sites which are not themselves crawled. Of the previously published methods, ReDDE substantially outperformed three variants of CORI and also outperformed a method based on Kullback-Leibler Divergence (extended) except on topic distillation. HARP and AWSUM performed best overall but were outperformed on the topic distillation task by extended KL Divergence.
KW - Experimentation
KW - H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval - selection process
KW - H.3.4 [Information Storage and Retrieval]: Systems and Software - performance evaluation (efficiency and effectiveness)
KW - Measurement
KW - Performance
KW - distributed systems
UR - http://www.scopus.com/inward/record.url?scp=84885572144&partnerID=8YFLogxK
U2 - 10.1145/1076034.1076050
DO - 10.1145/1076034.1076050
M3 - Conference contribution
SN - 1595930345
SN - 9781595930347
T3 - SIGIR 2005 - Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 75
EP - 82
BT - SIGIR 2005 - Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
T2 - 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2005
Y2 - 15 August 2005 through 19 August 2005
ER -