TY - GEN
T1 - HDSKG
T2 - 24th IEEE International Conference on Software Analysis, Evolution, and Reengineering, SANER 2017
AU - Zhao, Xuejiao
AU - Xing, Zhenchang
AU - Kabir, Muhammad Ashad
AU - Sawada, Naoya
AU - Li, Jing
AU - Lin, Shang Wei
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/3/21
Y1 - 2017/3/21
N2 - Knowledge graph is useful for many different domains like search result ranking, recommendation, exploratory search, etc. It integrates structural information of concepts across multiple information sources, and links these concepts together. The extraction of domain specific relation triples (subject, verb phrase, object) is one of the important techniques for domain specific knowledge graph construction. In this research, an automatic method named HDSKG is proposed to discover domain specific concepts and their relation triples from the content of webpages. We incorporate the dependency parser with rule-based method to chunk the relations triple candidates, then we extract advanced features of these candidate relation triples to estimate the domain relevance by a machine learning algorithm. For the evaluation of our method, we apply HDSKG to Stack Overflow (a Q&A website about computer programming). As a result, we construct a knowledge graph of software engineering domain with 35279 relation triples, 44800 concepts, and 9660 unique verb phrases. The experimental results show that both the precision and recall of HDSKG (0.78 and 0.7 respectively) is much higher than the openIE (0.11 and 0.6 respectively). The performance is particularly efficient in the case of complex sentences. Further more, with the self-training technique we used in the classifier, HDSKG can be applied to other domain easily with less training data.
AB - Knowledge graph is useful for many different domains like search result ranking, recommendation, exploratory search, etc. It integrates structural information of concepts across multiple information sources, and links these concepts together. The extraction of domain specific relation triples (subject, verb phrase, object) is one of the important techniques for domain specific knowledge graph construction. In this research, an automatic method named HDSKG is proposed to discover domain specific concepts and their relation triples from the content of webpages. We incorporate the dependency parser with rule-based method to chunk the relations triple candidates, then we extract advanced features of these candidate relation triples to estimate the domain relevance by a machine learning algorithm. For the evaluation of our method, we apply HDSKG to Stack Overflow (a Q&A website about computer programming). As a result, we construct a knowledge graph of software engineering domain with 35279 relation triples, 44800 concepts, and 9660 unique verb phrases. The experimental results show that both the precision and recall of HDSKG (0.78 and 0.7 respectively) is much higher than the openIE (0.11 and 0.6 respectively). The performance is particularly efficient in the case of complex sentences. Further more, with the self-training technique we used in the classifier, HDSKG can be applied to other domain easily with less training data.
KW - Dependency Parse
KW - Knowledge Graph
KW - Stack Overflow
KW - Structural Information Extraction
KW - openIE
UR - http://www.scopus.com/inward/record.url?scp=85018390307&partnerID=8YFLogxK
U2 - 10.1109/SANER.2017.7884609
DO - 10.1109/SANER.2017.7884609
M3 - Conference contribution
T3 - SANER 2017 - 24th IEEE International Conference on Software Analysis, Evolution, and Reengineering
SP - 56
EP - 67
BT - SANER 2017 - 24th IEEE International Conference on Software Analysis, Evolution, and Reengineering
A2 - Bavota, Gabriele
A2 - Pinzger, Martin
A2 - Marcus, Andrian
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 21 February 2017 through 24 February 2017
ER -