Big data small data, in domain out-of domain, known word unknown word: The impact of word representations on sequence labelling tasks

Lizhen Qu, Gabriela Ferraro, Liyuan Zhou, Weiwei Hou, Nathan Schneider, Timothy Baldwin

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    22 Citations (Scopus)

    Abstract

    Word embeddings — distributed word representations that can be learned from unlabelled data — have been shown to have high utility in many natural language processing applications. In this paper, we perform an extrinsic evaluation of four popular word embedding methods in the context of four sequence labelling tasks: part-of-speech tagging, syntactic chunking, named entity recognition, and multiword expression identification. A particular focus of the paper is analysing the effects of task-based updating of word representations. We show that when using word embeddings as features, as few as several hundred training instances are sufficient to achieve competitive results, and that word embeddings lead to improvements over out-of-vocabulary words and also out of domain. Perhaps more surprisingly, our results indicate there is little difference between the different word embedding methods, and that simple Brown clusters are often competitive with word embeddings across all tasks we consider.

    Original languageEnglish
    Title of host publicationCoNLL 2015 - 19th Conference on Computational Natural Language Learning, Proceedings
    PublisherAssociation for Computational Linguistics (ACL)
    Pages83-93
    Number of pages11
    ISBN (Electronic)9781941643778
    DOIs
    Publication statusPublished - 2015
    Event19th Conference on Computational Natural Language Learning, CoNLL 2015 - Beijing, China
    Duration: 30 Jul 201531 Jul 2015

    Publication series

    NameCoNLL 2015 - 19th Conference on Computational Natural Language Learning, Proceedings

    Conference

    Conference19th Conference on Computational Natural Language Learning, CoNLL 2015
    Country/TerritoryChina
    CityBeijing
    Period30/07/1531/07/15

    Fingerprint

    Dive into the research topics of 'Big data small data, in domain out-of domain, known word unknown word: The impact of word representations on sequence labelling tasks'. Together they form a unique fingerprint.

    Cite this