Text segmentation and Chinese site search

Liyuan Zhou, David Hawking, Paul Thomas

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    Abstract

    Automatic segmentation and overlapping bigrams are the most common methods for overcoming the lack of explicit word boundaries in Chinese text. Past studies have compared their effectiveness, but findings have been equivocal and site search has been little studied. We compare representatives of the two approaches using a 465,000 page crawl and test queries applicable to the university context. 503 pairs of result sets were judged by 56 Chinese students. Although there are differences on certain queries, we find no overall advantage to either method. To understand the merits of each approach, we analyze cases where they performed differently. Our analysis enumerates situations which favour segmentation, and those which favour bigrams. We observe that further improvements in segmentation accuracy will not improve retrieval effectiveness.

    Original languageEnglish
    Title of host publicationADCS 2015 - Proceedings of the 20th Australasian Document Computing Symposium
    EditorsSarvnaz Karimi, Laurence A. F. Park
    PublisherAssociation for Computing Machinery (ACM)
    ISBN (Electronic)9781450340403
    DOIs
    Publication statusPublished - 8 Dec 2015
    Event20th Australasian Document Computing Symposium, ADCS 2015 - Parramatta, Australia
    Duration: 8 Dec 20159 Dec 2015

    Publication series

    NameACM International Conference Proceeding Series
    Volume08-09-Dec-2015

    Conference

    Conference20th Australasian Document Computing Symposium, ADCS 2015
    Country/TerritoryAustralia
    CityParramatta
    Period8/12/159/12/15

    Fingerprint

    Dive into the research topics of 'Text segmentation and Chinese site search'. Together they form a unique fingerprint.

    Cite this