TY - GEN
T1 - Software-specific part-of-speech tagging
T2 - 31st Annual ACM Symposium on Applied Computing, SAC 2016
AU - Ye, Deheng
AU - Xing, Zhenchang
AU - Li, Jing
AU - Kapre, Nachiket
N1 - Publisher Copyright:
© 2016 ACM.
PY - 2016/4/4
Y1 - 2016/4/4
N2 - Part-of-speech (POS) tagging performance degrades on outof-domain data due to the lack of domain knowledge. Software engineering knowledge, embodied in textual documentations, bug reports and online forum discussions, is expressed in natural language, but is full of domain terms, software entities and software-specific informal languages. Such software texts call for software-specific POS tagging. In the software engineering community, there have been several attempts leveraging POS tagging technique to help solve software engineering tasks. However, little work is done for POS tagging on software natural language texts. In this paper, we build a software-specific POS tagger, called S-POS, for processing the textual discussions on Stack Overflow. We target at Stack Overflow because it has become an important developer-generated knowledge repository for software engineering. We define a POS tagset that is suitable for describing software engineering knowledge, select corpus, develop a custom tokenizer, annotate data, design features for supervised model training, and demonstrate that the tagging accuracy of S-POS outperforms that of the Stanford POS Tagger when tagging software texts. Our work presents a feasible roadmap to build software-specific POS tagger for the socio-professional contents on Stack Overflow, and reveals challenges and opportunities for advanced software-specific information extraction.
AB - Part-of-speech (POS) tagging performance degrades on outof-domain data due to the lack of domain knowledge. Software engineering knowledge, embodied in textual documentations, bug reports and online forum discussions, is expressed in natural language, but is full of domain terms, software entities and software-specific informal languages. Such software texts call for software-specific POS tagging. In the software engineering community, there have been several attempts leveraging POS tagging technique to help solve software engineering tasks. However, little work is done for POS tagging on software natural language texts. In this paper, we build a software-specific POS tagger, called S-POS, for processing the textual discussions on Stack Overflow. We target at Stack Overflow because it has become an important developer-generated knowledge repository for software engineering. We define a POS tagset that is suitable for describing software engineering knowledge, select corpus, develop a custom tokenizer, annotate data, design features for supervised model training, and demonstrate that the tagging accuracy of S-POS outperforms that of the Stanford POS Tagger when tagging software texts. Our work presents a feasible roadmap to build software-specific POS tagger for the socio-professional contents on Stack Overflow, and reveals challenges and opportunities for advanced software-specific information extraction.
KW - Information extraction
KW - Mining software repositories
KW - Natural language processing
KW - Part-of-speech tagging
UR - http://www.scopus.com/inward/record.url?scp=84975824230&partnerID=8YFLogxK
U2 - 10.1145/2851613.2851772
DO - 10.1145/2851613.2851772
M3 - Conference contribution
T3 - Proceedings of the ACM Symposium on Applied Computing
SP - 1378
EP - 1385
BT - 2016 Symposium on Applied Computing, SAC 2016
PB - Association for Computing Machinery
Y2 - 4 April 2016 through 8 April 2016
ER -