TY - JOUR
T1 - Query-independent evidence in home page finding
AU - Upstill, Trystan
AU - Craswell, Nick
AU - Hawking, David
PY - 2003/7
Y1 - 2003/7
N2 - Hyperlink recommendation evidence, that is, evidence based on the structure of a web's link graph, is widely exploited by commercial Web search systems. However there is little published work to support its popularity. Another form of query-independent evidence, URL-type, has been shown to be beneficial on a home page finding task. We compared the usefulness of these types of evidence on the home page finding task, combined with both content and anchor text baselines. Our experiments made use of five query sets spanning three corpora - one enterprise crawl, and the WT10g and VLC2 Web test collections. We found that, in optimal conditions, all of the query-independent methods studied (in-degree, URL-type, and two variants of PageRank) offered a better than random improvement on a content-only baseline. However, only URL-type offered a better than random improvement on an anchor text baseline. In realistic settings, for either baseline, only URL-type offered consistent gains. In combination with URL-type the anchor text baseline was more useful for finding popular home pages, but URL-type with content was more useful for finding randomly selected home pages. We conclude that a general home page finding system should combine evidence from document content, anchor text, and URL-type classification.
AB - Hyperlink recommendation evidence, that is, evidence based on the structure of a web's link graph, is widely exploited by commercial Web search systems. However there is little published work to support its popularity. Another form of query-independent evidence, URL-type, has been shown to be beneficial on a home page finding task. We compared the usefulness of these types of evidence on the home page finding task, combined with both content and anchor text baselines. Our experiments made use of five query sets spanning three corpora - one enterprise crawl, and the WT10g and VLC2 Web test collections. We found that, in optimal conditions, all of the query-independent methods studied (in-degree, URL-type, and two variants of PageRank) offered a better than random improvement on a content-only baseline. However, only URL-type offered a better than random improvement on an anchor text baseline. In realistic settings, for either baseline, only URL-type offered consistent gains. In combination with URL-type the anchor text baseline was more useful for finding popular home pages, but URL-type with content was more useful for finding randomly selected home pages. We conclude that a general home page finding system should combine evidence from document content, anchor text, and URL-type classification.
KW - Citation and link analysis
KW - Connectivity
KW - Web information retrieval
UR - http://www.scopus.com/inward/record.url?scp=2442488148&partnerID=8YFLogxK
U2 - 10.1145/858476.858479
DO - 10.1145/858476.858479
M3 - Article
SN - 1046-8188
VL - 21
SP - 286
EP - 313
JO - ACM Transactions on Information Systems
JF - ACM Transactions on Information Systems
IS - 3
ER -