Unlocking the Potential of Clustering and Classification Approaches: Navigating Supervised and Unsupervised Chemical Similarity.,Environmental Health Perspectives

当前位置： X-MOL 学术 › Environ. Health Perspect. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unlocking the Potential of Clustering and Classification Approaches: Navigating Supervised and Unsupervised Chemical Similarity.
Environmental Health Perspectives ( IF 10.1 ) Pub Date : 2024-08-06 , DOI: 10.1289/ehp14001
Kamel Mansouri ₁ , Kyla Taylor ₁ , Scott Auerbach ₁ , Stephen Ferguson ₁ , Rachel Frawley ₁ , Jui-Hua Hsieh ₁ , Gloria Jahnke ₁ , Nicole Kleinstreuer ₁ , Suril Mehta ₁ , José T Moreira-Filho ₁ , Fred Parham ₁ , Cynthia Rider ₁ , Andrew A Rooney ₁ , Amy Wang ₁ , Vicki Sutherland ₁

Affiliation

BACKGROUND The field of toxicology has witnessed substantial advancements in recent years, particularly with the adoption of new approach methodologies (NAMs) to understand and predict chemical toxicity. Class-based methods such as clustering and classification are key to NAMs development and application, aiding the understanding of hazard and risk concerns associated with groups of chemicals without additional laboratory work. Advances in computational chemistry, data generation and availability, and machine learning algorithms represent important opportunities for continued improvement of these techniques to optimize their utility for specific regulatory and research purposes. However, due to their intricacy, deep understanding and careful selection are imperative to align the adequate methods with their intended applications. OBJECTIVES This commentary aims to deepen the understanding of class-based approaches by elucidating the pivotal role of chemical similarity (structural and biological) in clustering and classification approaches (CCAs). It addresses the dichotomy between general end point-agnostic similarity, often entailing unsupervised analysis, and end point-specific similarity necessitating supervised learning. The goal is to highlight the nuances of these approaches, their applications, and common misuses. DISCUSSION Understanding similarity is pivotal in toxicological research involving CCAs. The effectiveness of these approaches depends on the right definition and measure of similarity, which varies based on context and objectives of the study. This choice is influenced by how chemical structures are represented and the respective labels indicating biological activity, if applicable. The distinction between unsupervised clustering and supervised classification methods is vital, requiring the use of end point-agnostic vs. end point-specific similarity definition. Separate use or combination of these methods requires careful consideration to prevent bias and ensure relevance for the goal of the study. Unsupervised methods use end point-agnostic similarity measures to uncover general structural patterns and relationships, aiding hypothesis generation and facilitating exploration of datasets without the need for predefined labels or explicit guidance. Conversely, supervised techniques demand end point-specific similarity to group chemicals into predefined classes or to train classification models, allowing accurate predictions for new chemicals. Misuse can arise when unsupervised methods are applied to end point-specific contexts, like analog selection in read-across, leading to erroneous conclusions. This commentary provides insights into the significance of similarity and its role in supervised classification and unsupervised clustering approaches. https://doi.org/10.1289/EHP14001.

中文翻译：

释放聚类和分类方法的潜力：引导监督和无监督化学相似性。

背景技术近年来，毒理学领域取得了长足的进步，特别是采用新方法（NAM）来理解和预测化学毒性。聚类和分类等基于类别的方法是 NAM 开发和应用的关键，有助于了解与化学品组相关的危害和风险问题，而无需额外的实验室工作。计算化学、数据生成和可用性以及机器学习算法的进步为持续改进这些技术以优化其针对特定监管和研究目的的效用提供了重要机会。然而，由于它们的复杂性，必须深入理解和仔细选择，以使适当的方法与其预期应用相一致。目标本评论旨在通过阐明化学相似性（结构和生物学）在聚类和分类方法 (CCA) 中的关键作用，加深对基于类的方法的理解。它解决了一般终点不可知相似性（通常需要无监督分析）和终点特定相似性（需要监督学习）之间的二分法。目的是强调这些方法的细微差别、它们的应用以及常见的误用。讨论了解相似性对于涉及 CCA 的毒理学研究至关重要。这些方法的有效性取决于正确的定义和相似性测量，这根据研究的背景和目标而变化。这种选择受到化学结构的表示方式以及指示生物活性的相应标签（如果适用）的影响。无监督聚类和监督分类方法之间的区别至关重要，需要使用端点不可知与端点特定的相似性定义。单独使用或组合这些方法需要仔细考虑，以防止偏见并确保与研究目标的相关性。无监督方法使用与端点无关的相似性度量来揭示一般结构模式和关系，帮助假设生成并促进数据集探索，而无需预定义标签或明确指导。相反，监督技术需要终点特定的相似性，以将化学品分组为预定义的类别或训练分类模型，从而能够准确预测新化学品。当无监督方法应用于特定于终点的上下文时，例如跨读中的模拟选择，可能会出现误用，从而导致错误的结论。该评论深入探讨了相似性的重要性及其在监督分类和无监督聚类方法中的作用。 https://doi.org/10.1289/EHP14001。

更新日期：2024-08-06

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南