TY - JOUR
T1 - Statistical modeling of a ligand knowledge base
AU - Mansson, Ralph A.
AU - Welsh, Alan H.
AU - Fey, Natalie
AU - Orpen, A. Guy
PY - 2006
Y1 - 2006
N2 - A range of different statistical models has been fitted to experimental data for the Tolman electronic parameter (TEP) based on a large set of calculated descriptors in a prototype ligand knowledge base (LKB) of phosphorus(III) donor ligands. The models have been fitted by ordinary least squares using subsets of descriptors, principal component regression, and partial least squares which use variables derived from the complete set of descriptors, least angle regression, and the least absolute shrinkage and selection operator. None of these methods is robust against outliers, so we also applied a robust estimation procedure to the linear regression model. Criteria for model evaluation and comparison have been discussed, highlighting the importance of resampling methods for assessing the robustness of models and the scope for making predictions in chemically intuitive models. For the ligands covered by this LKB, ordinary least squares models of descriptor subsets provide a good representation of the data, while partial least squares, principal component regression, and least angle regression models are less suitable for our dual aims of prediction and interpretation. A linear regression model with robustly fitted parameters achieves the best model performance over all classes of models fitted to TEP data, and the weightings assigned to ligands during the robust estimation procedure are chemically intuitive. The increased model complexity when compared to the ordinary least squares linear model is justified by the reduced influence of individual ligands on the model parameters and predictions of new ligands. Robust linear regression models therefore represent the best compromise for achieving statistical robustness in simple, chemically meaningful models.
AB - A range of different statistical models has been fitted to experimental data for the Tolman electronic parameter (TEP) based on a large set of calculated descriptors in a prototype ligand knowledge base (LKB) of phosphorus(III) donor ligands. The models have been fitted by ordinary least squares using subsets of descriptors, principal component regression, and partial least squares which use variables derived from the complete set of descriptors, least angle regression, and the least absolute shrinkage and selection operator. None of these methods is robust against outliers, so we also applied a robust estimation procedure to the linear regression model. Criteria for model evaluation and comparison have been discussed, highlighting the importance of resampling methods for assessing the robustness of models and the scope for making predictions in chemically intuitive models. For the ligands covered by this LKB, ordinary least squares models of descriptor subsets provide a good representation of the data, while partial least squares, principal component regression, and least angle regression models are less suitable for our dual aims of prediction and interpretation. A linear regression model with robustly fitted parameters achieves the best model performance over all classes of models fitted to TEP data, and the weightings assigned to ligands during the robust estimation procedure are chemically intuitive. The increased model complexity when compared to the ordinary least squares linear model is justified by the reduced influence of individual ligands on the model parameters and predictions of new ligands. Robust linear regression models therefore represent the best compromise for achieving statistical robustness in simple, chemically meaningful models.
UR - http://www.scopus.com/inward/record.url?scp=33845807472&partnerID=8YFLogxK
U2 - 10.1021/ci600212t
DO - 10.1021/ci600212t
M3 - Article
SN - 1549-9596
VL - 46
SP - 2591
EP - 2600
JO - Journal of Chemical Information and Modeling
JF - Journal of Chemical Information and Modeling
IS - 6
ER -