TY - GEN
T1 - Comparison of cutoff strategies for geometrical features in machine learning-based scoring functions
AU - Siu, Shirley W.I.
AU - Wong, Thomas K.F.
AU - Fong, Simon
PY - 2013
Y1 - 2013
N2 - Countings of protein-ligand contacts are popular geometrical features in scoring functions for structure-based drug design. When extracting features, cutoff values are used to define the range of distances within which a protein-ligand atom pair is considered as in contact. But effects of the number of ranges and the choice of cutoff values on the predictive ability of scoring functions are unclear. Here, we compare five cutoff strategies (one-, two-, three-, six-range and soft boundary) with four machine learning methods. Prediction models are constructed using the latest PDBbind v2012 data sets and assessed by correlation coefficients. Our results show that the optimal one-range cutoff value lies between 6 and 8 Å instead of the customary choice of 12 Å. In general, two-range models have improved predictive performance in correlation coefficients by 3-5%, but introducing more cutoff ranges do not always help improving the prediction accuracy.
AB - Countings of protein-ligand contacts are popular geometrical features in scoring functions for structure-based drug design. When extracting features, cutoff values are used to define the range of distances within which a protein-ligand atom pair is considered as in contact. But effects of the number of ranges and the choice of cutoff values on the predictive ability of scoring functions are unclear. Here, we compare five cutoff strategies (one-, two-, three-, six-range and soft boundary) with four machine learning methods. Prediction models are constructed using the latest PDBbind v2012 data sets and assessed by correlation coefficients. Our results show that the optimal one-range cutoff value lies between 6 and 8 Å instead of the customary choice of 12 Å. In general, two-range models have improved predictive performance in correlation coefficients by 3-5%, but introducing more cutoff ranges do not always help improving the prediction accuracy.
KW - geometrical features
KW - machine learning
KW - protein-ligand binding affinity
KW - scoring function
KW - structure-based drug design
UR - http://www.scopus.com/inward/record.url?scp=84893056839&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-53917-6_30
DO - 10.1007/978-3-642-53917-6_30
M3 - Conference contribution
SN - 9783642539169
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 336
EP - 347
BT - Advanced Data Mining and Applications - 9th International Conference, ADMA 2013, Proceedings
T2 - 9th International Conference on Advanced Data Mining and Applications, ADMA 2013
Y2 - 14 December 2013 through 16 December 2013
ER -