Evaluating sampling methods for uncooperative collections

Paul Thomas*, David Hawking

*Corresponding author for this work

    Research output: Chapter in Book/Report/Conference proceedingConference Paperpeer-review

    20 Citations (Scopus)

    Abstract

    Many server selection methods suitable for distributed information retrieval applications rely, in the absence of cooperation, on the availability of unbiased samples of documents from the constituent collections. We describe a number of sampling methods which depend only on the normal query-response mechanism of the applicable search facilities. We evaluate these methods on a number of collections typical of a personal metasearch application. Results demonstrate that biases exist for all methods, particularly toward longer documents, and that in some cases these biases can be reduced but not eliminated by choice of parameters.We also introduce a new sampling technique, "multiple queries", which produces samples of similar quality to the best current techniques but with significantly reduced cost.

    Original languageEnglish
    Title of host publicationProceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
    Pages503-510
    Number of pages8
    DOIs
    Publication statusPublished - 2007
    Event30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 - Amsterdam, Netherlands
    Duration: 23 Jul 200727 Jul 2007

    Publication series

    NameProceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07

    Conference

    Conference30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
    Country/TerritoryNetherlands
    CityAmsterdam
    Period23/07/0727/07/07

    Fingerprint

    Dive into the research topics of 'Evaluating sampling methods for uncooperative collections'. Together they form a unique fingerprint.

    Cite this