TY - GEN
T1 - Learning to extract API mentions from informal natural language discussions
AU - Ye, Deheng
AU - Xing, Zhenchang
AU - Foo, Chee Yong
AU - Li, Jing
AU - Kapre, Nachiket
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2017/1/12
Y1 - 2017/1/12
N2 - When discussing programming issues on social platforms (e.g, Stack Overflow, Twitter), developers often mention APIs in natural language texts. Extracting API mentions in natural language texts is a prerequisite for effective indexing and searching for API-related information in software engineering social content. However, the informal nature of social discussions creates two fundamental challenges for API extraction: common-word polysemy and sentence-format variations. Common-word polysemy refers to the ambiguity between the API sense of a common word and the normal sense of the word (e.g., append, apply and merge). Sentence-format variations refer to the lack of consistent sentence writing format for inferring API mentions. Existing API extraction techniques fall short to address these two challenges, because they assume distinct API naming conventions (e.g., camel case, underscore) or structured sentence format (e.g., code-like phrase, API annotation, or full API name). In this paper, we propose a semi-supervised machine-learning approach that exploits name synonyms and rich semantic context of API mentions to extract API mentions in informal social text. The key innovation of our approach is to exploit two complementary unsupervised language models learned from the abundant un-labeled text to model sentence-format variations and to train a robust model with a small set of labeled data and an iterative self-training process. The evaluation of 1,205 API mentions of the three libraries (Pandas, Numpy, and Matplotlib) in Stack Overflow texts shows that our approach significantly outperforms existing API extraction techniques based on language-convention and sentence-format heuristics and our earlier machine-learning based method for named-entity recognition.
AB - When discussing programming issues on social platforms (e.g, Stack Overflow, Twitter), developers often mention APIs in natural language texts. Extracting API mentions in natural language texts is a prerequisite for effective indexing and searching for API-related information in software engineering social content. However, the informal nature of social discussions creates two fundamental challenges for API extraction: common-word polysemy and sentence-format variations. Common-word polysemy refers to the ambiguity between the API sense of a common word and the normal sense of the word (e.g., append, apply and merge). Sentence-format variations refer to the lack of consistent sentence writing format for inferring API mentions. Existing API extraction techniques fall short to address these two challenges, because they assume distinct API naming conventions (e.g., camel case, underscore) or structured sentence format (e.g., code-like phrase, API annotation, or full API name). In this paper, we propose a semi-supervised machine-learning approach that exploits name synonyms and rich semantic context of API mentions to extract API mentions in informal social text. The key innovation of our approach is to exploit two complementary unsupervised language models learned from the abundant un-labeled text to model sentence-format variations and to train a robust model with a small set of labeled data and an iterative self-training process. The evaluation of 1,205 API mentions of the three libraries (Pandas, Numpy, and Matplotlib) in Stack Overflow texts shows that our approach significantly outperforms existing API extraction techniques based on language-convention and sentence-format heuristics and our earlier machine-learning based method for named-entity recognition.
UR - http://www.scopus.com/inward/record.url?scp=85013036501&partnerID=8YFLogxK
U2 - 10.1109/ICSME.2016.11
DO - 10.1109/ICSME.2016.11
M3 - Conference contribution
T3 - Proceedings - 2016 IEEE International Conference on Software Maintenance and Evolution, ICSME 2016
SP - 389
EP - 399
BT - Proceedings - 2016 IEEE International Conference on Software Maintenance and Evolution, ICSME 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE International Conference on Software Maintenance and Evolution, ICSME 2016
Y2 - 2 October 2016 through 10 October 2016
ER -