当前位置:
X-MOL 学术
›
J. Chem. Inf. Model.
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Classification-Based Detection and Quantification of Cross-Domain Data Bias in Materials Discovery.
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2024-12-16 , DOI: 10.1021/acs.jcim.4c01766 Giovanni Trezza,Eliodoro Chiavazzo
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2024-12-16 , DOI: 10.1021/acs.jcim.4c01766 Giovanni Trezza,Eliodoro Chiavazzo
It stands to reason that the amount and the quality of data are of key importance for setting up accurate artificial intelligence (AI)-driven models. Among others, a fundamental aspect to consider is the bias introduced during sample selection in database generation. This is particularly relevant when a model is trained on a specialized data set to predict a property of interest and then applied to forecast the same property over samples having a completely different genesis. Indeed, the resulting biased model will likely produce unreliable predictions for many of those out-of-the-box samples, i.e., samples out of the training set. Neglecting such an aspect may hinder the AI-based discovery process, even when high-quality, sufficiently large, and highly reputable data sources are available. To address this challenge, we propose a new method that detects and quantifies data bias, reducing its impact on materials discovery. Our approach, aimed at identifying and excluding those out-of-the-box materials for which the predictions of a pretrained model are likely unreliable, leverages a classification strategy and is validated by means of superconductor and thermoelectric materials as two representative case studies. This methodology, designed to be simple, flexible, and easily adaptable to any architecture, including modern graph equivariant neural networks, aims to enhance the reliability of AI models when applied to diverse and previously unseen materials, thereby contributing to more reliable AI-driven materials discovery.
中文翻译:
材料发现中基于分类的检测和量化跨域数据偏差。
按理说,数据的数量和质量对于建立准确的人工智能 (AI) 驱动模型至关重要。其中,需要考虑的一个基本方面是数据库生成中样本选择过程中引入的偏差。当模型在专门的数据集上训练以预测感兴趣的属性,然后应用于预测具有完全不同起源的样本的相同属性时,这一点尤其重要。事实上,生成的有偏差模型可能会对许多开箱即用的样本(即训练集外的样本)产生不可靠的预测。忽视这一方面可能会阻碍基于 AI 的发现过程,即使有高质量、足够大且信誉良好的数据源可用。为了应对这一挑战,我们提出了一种检测和量化数据偏差的新方法,以减少其对材料发现的影响。我们的方法旨在识别和排除那些预训练模型的预测可能不可靠的开箱即用的材料,它利用分类策略,并通过超导体和热电材料作为两个代表性案例研究进行验证。这种方法旨在简单、灵活且易于适应任何架构,包括现代图等变神经网络,旨在提高 AI 模型在应用于各种和以前从未见过的材料时的可靠性,从而有助于更可靠的 AI 驱动材料发现。
更新日期:2024-12-16
中文翻译:
材料发现中基于分类的检测和量化跨域数据偏差。
按理说,数据的数量和质量对于建立准确的人工智能 (AI) 驱动模型至关重要。其中,需要考虑的一个基本方面是数据库生成中样本选择过程中引入的偏差。当模型在专门的数据集上训练以预测感兴趣的属性,然后应用于预测具有完全不同起源的样本的相同属性时,这一点尤其重要。事实上,生成的有偏差模型可能会对许多开箱即用的样本(即训练集外的样本)产生不可靠的预测。忽视这一方面可能会阻碍基于 AI 的发现过程,即使有高质量、足够大且信誉良好的数据源可用。为了应对这一挑战,我们提出了一种检测和量化数据偏差的新方法,以减少其对材料发现的影响。我们的方法旨在识别和排除那些预训练模型的预测可能不可靠的开箱即用的材料,它利用分类策略,并通过超导体和热电材料作为两个代表性案例研究进行验证。这种方法旨在简单、灵活且易于适应任何架构,包括现代图等变神经网络,旨在提高 AI 模型在应用于各种和以前从未见过的材料时的可靠性,从而有助于更可靠的 AI 驱动材料发现。