TY - GEN
T1 - Unsupervised software-specific morphological forms inference from informal discussions
AU - Chen, Chunyang
AU - Xing, Zhenchang
AU - Wang, Ximing
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/19
Y1 - 2017/7/19
N2 - Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonly-used morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.
AB - Informal discussions on social platforms (e.g., Stack Overflow) accumulates a large body of programming knowledge in natural language text. Natural language process (NLP) techniques can be exploited to harvest this knowledge base for software engineering tasks. To make an effective use of NLP techniques, consistent vocabulary is essential. Unfortunately, the same concepts are often intentionally or accidentally mentioned in many different morphological forms in informal discussions, such as abbreviations, synonyms and misspellings. Existing techniques to deal with such morphological forms are either designed for general English or predominantly rely on domain-specific lexical rules. A thesaurus of software-specific terms and commonly-used morphological forms is desirable for normalizing software engineering text, but very difficult to build manually. In this work, we propose an automatic approach to build such a thesaurus. Our approach identifies software-specific terms by contrasting software-specific and general corpuses, and infers morphological forms of software-specific terms by combining distributed word semantics, domain-specific lexical rules and transformations, and graph analysis of morphological relations. We evaluate the coverage and accuracy of the resulting thesaurus against community-curated lists of software-specific terms, abbreviations and synonyms. We also manually examine the correctness of the identified abbreviations and synonyms in our thesaurus. We demonstrate the usefulness of our thesaurus in a case study of normalizing questions from Stack Overflow and CodeProject.
KW - Abbreviation
KW - Morphological Form
KW - Stack Overflow
KW - Synonym
KW - Word embedding
UR - http://www.scopus.com/inward/record.url?scp=85027696498&partnerID=8YFLogxK
U2 - 10.1109/ICSE.2017.48
DO - 10.1109/ICSE.2017.48
M3 - Conference contribution
T3 - Proceedings - 2017 IEEE/ACM 39th International Conference on Software Engineering, ICSE 2017
SP - 450
EP - 461
BT - Proceedings - 2017 IEEE/ACM 39th International Conference on Software Engineering, ICSE 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 39th IEEE/ACM International Conference on Software Engineering, ICSE 2017
Y2 - 20 May 2017 through 28 May 2017
ER -