TY - JOUR
T1 - Regional Explanations and Diverse Molecular Representations in Cheminformatics
T2 - A Comparative Study
AU - Wang, Xin
AU - Barnard, Amanda S.
AU - Li, Sichao
N1 - Publisher Copyright:
© 2025 Xin Wang et al.
PY - 2025
Y1 - 2025
N2 - In cheminformatics, the explainability of machine learning models is important for interpreting complex chemical data, deriving new chemical insights, and building trust in predictive models. However, cheminformatics datasets often exhibit clustered distributions, while traditional explanation methods might overlook intra-cluster variations and complicate the extraction of meaningful explanations.Additionally, diverse representations (tabular, sequence, image, and graph) yield divergent explanations. To address these issues, we propose a novel approach termed regional explanation, designed as an intermediate-level interpretability method that bridges the gap between local and global explanations. This approach systematically reveals how explanations and feature importance vary across data clusters. Using 2 public datasets, a graphene oxide nanoflakes dataset and QM9, with natural clustering properties, we comprehensively evaluate 4 molecular representations through tabular, sequence, image, and graph regional explanation, providing practical guidelines for representation selection. Our analysis illuminates complex, nonlinear relationships between molecular structures and predicted properties within clusters; explores the interplay among molecular features, feature importance, and target properties across distinct regions of chemical space; and advances the interpretability of machine learning models for complex molecular systems.
AB - In cheminformatics, the explainability of machine learning models is important for interpreting complex chemical data, deriving new chemical insights, and building trust in predictive models. However, cheminformatics datasets often exhibit clustered distributions, while traditional explanation methods might overlook intra-cluster variations and complicate the extraction of meaningful explanations.Additionally, diverse representations (tabular, sequence, image, and graph) yield divergent explanations. To address these issues, we propose a novel approach termed regional explanation, designed as an intermediate-level interpretability method that bridges the gap between local and global explanations. This approach systematically reveals how explanations and feature importance vary across data clusters. Using 2 public datasets, a graphene oxide nanoflakes dataset and QM9, with natural clustering properties, we comprehensively evaluate 4 molecular representations through tabular, sequence, image, and graph regional explanation, providing practical guidelines for representation selection. Our analysis illuminates complex, nonlinear relationships between molecular structures and predicted properties within clusters; explores the interplay among molecular features, feature importance, and target properties across distinct regions of chemical space; and advances the interpretability of machine learning models for complex molecular systems.
UR - http://www.scopus.com/inward/record.url?scp=105007896810&partnerID=8YFLogxK
U2 - 10.34133/icomputing.0126
DO - 10.34133/icomputing.0126
M3 - Article
AN - SCOPUS:105007896810
SN - 2771-5892
VL - 4
JO - Intelligent Computing
JF - Intelligent Computing
M1 - 0126
ER -