TY - GEN
T1 - MASCOT
T2 - 14th IEEE International Conference on Data Mining, ICDM 2014
AU - Wen, Zeyi
AU - Zhang, Rui
AU - Ramamohanarao, Kotagiri
AU - Qi, Jianzhong
AU - Taylor, Kerry
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/1/1
Y1 - 2014/1/1
N2 - Cross-validation is a commonly used method for evaluating the effectiveness of Support Vector Machines (SVMs). However, existing SVM cross-validation algorithms are not scalable to large datasets because they have to (i) hold the whole dataset in memory and/or (ii) perform a very large number of kernel value computation. In this paper, we propose a scheme to dramatically improve the scalability and efficiency of SVM cross-validation through the following key ideas. (i) To avoid holding the whole dataset in the memory and avoid performing repeated kernel value computation, we precompute the kernel values and reuse them. (ii) We store the precomputed kernel values to a high-speed storage framework, consisting of CPU memory extended by solid state drives (SSDs) and GPU memory as a cache, so that reusing (i.e., Reading) kernel values takes much lesser time than computing them on-the-fly. (iii) To further improve the efficiency of the SVM training, we apply a number of techniques for the extreme example search algorithm, design a parallel kernel value read algorithm, propose a caching strategy well-suited to the characteristics of the storage framework, and parallelize the tasks on the GPU and the CPU. For datasets of sizes that existing algorithms can handle, our scheme achieves several orders of magnitude of speedup. More importantly, our scheme enables SVM cross-validation on datasets of very large scale that existing algorithms are unable to handle.
AB - Cross-validation is a commonly used method for evaluating the effectiveness of Support Vector Machines (SVMs). However, existing SVM cross-validation algorithms are not scalable to large datasets because they have to (i) hold the whole dataset in memory and/or (ii) perform a very large number of kernel value computation. In this paper, we propose a scheme to dramatically improve the scalability and efficiency of SVM cross-validation through the following key ideas. (i) To avoid holding the whole dataset in the memory and avoid performing repeated kernel value computation, we precompute the kernel values and reuse them. (ii) We store the precomputed kernel values to a high-speed storage framework, consisting of CPU memory extended by solid state drives (SSDs) and GPU memory as a cache, so that reusing (i.e., Reading) kernel values takes much lesser time than computing them on-the-fly. (iii) To further improve the efficiency of the SVM training, we apply a number of techniques for the extreme example search algorithm, design a parallel kernel value read algorithm, propose a caching strategy well-suited to the characteristics of the storage framework, and parallelize the tasks on the GPU and the CPU. For datasets of sizes that existing algorithms can handle, our scheme achieves several orders of magnitude of speedup. More importantly, our scheme enables SVM cross-validation on datasets of very large scale that existing algorithms are unable to handle.
KW - Cross-validation
KW - GPUs
KW - SSDs
KW - SVM
KW - Training
UR - http://www.scopus.com/inward/record.url?scp=84936947596&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2014.35
DO - 10.1109/ICDM.2014.35
M3 - Conference contribution
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 580
EP - 589
BT - Proceedings - 14th IEEE International Conference on Data Mining, ICDM 2014
A2 - Kumar, Ravi
A2 - Toivonen, Hannu
A2 - Pei, Jian
A2 - Zhexue Huang, Joshua
A2 - Wu, Xindong
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 14 December 2014 through 17 December 2014
ER -