TY - JOUR
T1 - Comparative analysis of feature representations and machine learning methods in Android family classification
AU - Bai, Yude
AU - Xing, Zhenchang
AU - Ma, Duoyuan
AU - Li, Xiaohong
AU - Feng, Zhiyong
N1 - Publisher Copyright:
© 2020 Elsevier B.V.
PY - 2021/1/15
Y1 - 2021/1/15
N2 - In order to overcome the lasting increase of Android malware, malware family classification, which clusters malware with the same features into one family, has been proposed as an efficient way for malware analysis. Several machine learning based approaches have been proposed for such task of malware family classification. However, due to the adoption of very different features and learning methods in different approaches, it is still an open question to explore: which approach works better for malware family classification? In this paper, we conduct extensive experiments to answer this question. For three widely known Android malware datasets, we design five multi-classification methods for predicting Android malware family. Based on the survey of Android malware analysis literatures and the observation of a large number of Android malware, we construct a set of 250 common features shared by Android malware. And we also collect 16873 documentary features from Android Developer as a comparison. Furthermore, we investigate the effects of transfer learning for adapting the model on three malware datasets on different scales. Our empirical results show that (i) the classification methods perform very closely, with neural network model having marginally better performance (1% to 3% in F1-score), (ii) features contribute most for classification, especially to enhance API features on larger datasets, and (iii) it is model transferable across different malware datasets based on various transfer learning tasks.
AB - In order to overcome the lasting increase of Android malware, malware family classification, which clusters malware with the same features into one family, has been proposed as an efficient way for malware analysis. Several machine learning based approaches have been proposed for such task of malware family classification. However, due to the adoption of very different features and learning methods in different approaches, it is still an open question to explore: which approach works better for malware family classification? In this paper, we conduct extensive experiments to answer this question. For three widely known Android malware datasets, we design five multi-classification methods for predicting Android malware family. Based on the survey of Android malware analysis literatures and the observation of a large number of Android malware, we construct a set of 250 common features shared by Android malware. And we also collect 16873 documentary features from Android Developer as a comparison. Furthermore, we investigate the effects of transfer learning for adapting the model on three malware datasets on different scales. Our empirical results show that (i) the classification methods perform very closely, with neural network model having marginally better performance (1% to 3% in F1-score), (ii) features contribute most for classification, especially to enhance API features on larger datasets, and (iii) it is model transferable across different malware datasets based on various transfer learning tasks.
KW - Android Malware Family
KW - Machine learning
KW - Multi-classification
KW - Transfer Learning
UR - http://www.scopus.com/inward/record.url?scp=85094888514&partnerID=8YFLogxK
U2 - 10.1016/j.comnet.2020.107639
DO - 10.1016/j.comnet.2020.107639
M3 - Article
SN - 1389-1286
VL - 184
JO - Computer Networks
JF - Computer Networks
M1 - 107639
ER -