Unsupervised software-specific morphological forms inference from informal discussions

Chunyang Chen, Zhenchang Xing, Ximing Wang

    Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

    57 Citations (Scopus)

    Abstract

    Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonly-used morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.

    Original languageEnglish
    Title of host publicationProceedings - 2017 IEEE/ACM 39th International Conference on Software Engineering, ICSE 2017
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    Pages450-461
    Number of pages12
    ISBN (Electronic)9781538638682
    DOIs
    Publication statusPublished - 19 Jul 2017
    Event39th IEEE/ACM International Conference on Software Engineering, ICSE 2017 - Buenos Aires, Argentina
    Duration: 20 May 201728 May 2017

    Publication series

    NameProceedings - 2017 IEEE/ACM 39th International Conference on Software Engineering, ICSE 2017

    Conference

    Conference39th IEEE/ACM International Conference on Software Engineering, ICSE 2017
    Country/TerritoryArgentina
    CityBuenos Aires
    Period20/05/1728/05/17

    Fingerprint

    Dive into the research topics of 'Unsupervised software-specific morphological forms inference from informal discussions'. Together they form a unique fingerprint.

    Cite this