Nullification test collections for web spam and SEO

Timothy Jones*, Ramesh Sankaranarayana, David Hawking, Nick Craswell

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    6 Citations (Scopus)

    Abstract

    Research in the area of adversarial information retrieval has been facilitated by the availability of the UK-2006/UK-2007 collections, comprising crawl data, link graph, and spam labels. However, research into nullifying the negative effect of spam or excessive search engine optimisation (SEO) on the ranking of non-spam pages is not well supported by these resources. Nor is the study of cloaking techniques or of click spam. Finally, the domain-restricted nature of a .uk crawl means that only parts of link-farm icebergs may be visible in these crawls. We introduce the term nullification which we define as "preventing problem pages from negatively affecting search results". We show some important differences between properties of current .uk-restricted crawls and those previously reported for the Web as a whole. We identify a need for an adversarial IR collection which is not domain-restricted and which is supported by a set of appropriate query sets and (optimistically) user-behaviour data. The billion-page unrestricted crawl being conducted by CMU (web09-bst) and which will be used in the 2009 TREC Web Track is assessed as a possible basis for a new AIR test collection. We discuss the pros and cons of its scale, and the feasibility of adding resources such as query lists to enhance the utility of the collection for AIR research.

    Original languageEnglish
    Title of host publicationAIRWeb 2009 - Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web
    Pages53-60
    Number of pages8
    DOIs
    Publication statusPublished - 2009
    Event5th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2009 - Madrid, Spain
    Duration: 21 Apr 200921 Apr 2009

    Publication series

    NameACM International Conference Proceeding Series

    Conference

    Conference5th International Workshop on Adversarial Information Retrieval on the Web, AIRWeb 2009
    Country/TerritorySpain
    CityMadrid
    Period21/04/0921/04/09

    Fingerprint

    Dive into the research topics of 'Nullification test collections for web spam and SEO'. Together they form a unique fingerprint.

    Cite this