Software-specific part-of-speech tagging: An experimental study on Stack Overflow

Deheng Ye, Zhenchang Xing, Jing Li, Nachiket Kapre

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

16 Citations (Scopus)

Abstract

Part-of-speech (POS) tagging performance degrades on outof-domain data due to the lack of domain knowledge. Software engineering knowledge, embodied in textual documentations, bug reports and online forum discussions, is expressed in natural language, but is full of domain terms, software entities and software-specific informal languages. Such software texts call for software-specific POS tagging. In the software engineering community, there have been several attempts leveraging POS tagging technique to help solve software engineering tasks. However, little work is done for POS tagging on software natural language texts. In this paper, we build a software-specific POS tagger, called S-POS, for processing the textual discussions on Stack Overflow. We target at Stack Overflow because it has become an important developer-generated knowledge repository for software engineering. We define a POS tagset that is suitable for describing software engineering knowledge, select corpus, develop a custom tokenizer, annotate data, design features for supervised model training, and demonstrate that the tagging accuracy of S-POS outperforms that of the Stanford POS Tagger when tagging software texts. Our work presents a feasible roadmap to build software-specific POS tagger for the socio-professional contents on Stack Overflow, and reveals challenges and opportunities for advanced software-specific information extraction.

Original languageEnglish
Title of host publication2016 Symposium on Applied Computing, SAC 2016
PublisherAssociation for Computing Machinery
Pages1378-1385
Number of pages8
ISBN (Electronic)9781450337397
DOIs
Publication statusPublished - 4 Apr 2016
Externally publishedYes
Event31st Annual ACM Symposium on Applied Computing, SAC 2016 - Pisa, Italy
Duration: 4 Apr 20168 Apr 2016

Publication series

NameProceedings of the ACM Symposium on Applied Computing
Volume04-08-April-2016

Conference

Conference31st Annual ACM Symposium on Applied Computing, SAC 2016
Country/TerritoryItaly
CityPisa
Period4/04/168/04/16

Fingerprint

Dive into the research topics of 'Software-specific part-of-speech tagging: An experimental study on Stack Overflow'. Together they form a unique fingerprint.

Cite this