当前位置: X-MOL 学术J. Cheminfom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2024-11-26 , DOI: 10.1186/s13321-024-00928-8
Sarveswara Rao Vangala, Sowmya Ramaswamy Krishnan, Navneet Bung, Dhandapani Nandagopal, Gomathi Ramasamy, Satyam Kumar, Sridharan Sankaran, Rajgopal Srinivasan, Arijit Roy

With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future. Scientific contribution In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.

中文翻译:


大型语言模型适用于从专利文献中提取高质量化学反应数据集



随着人工智能 (AI) 的出现,现在可以从以前未探索的化学空间中设计出多样化和新颖的分子。然而,化学家面临的挑战是这种分子的合成。最近,有人尝试开发用于逆合成预测的 AI 模型,这依赖于高质量训练数据集的可用性。在这项工作中,我们探讨了大型语言模型 (LLMs) 从专利文件中提取高质量化学反应数据的适用性。对早期研究中的同一组专利的比较研究表明,所提出的自动化方法可以通过添加 26% 的新反应来增强当前的数据集。在反应采矿过程中发现了几个挑战,并针对其中一些挑战提出了替代解决方案。还进行了详细分析,其中在先前策划的数据集中发现了几个错误的条目。使用拟议的管道在更大的专利数据集上提取的反应可以提高未来合成预测模型的准确性和效率。科学贡献在这项工作中,我们评估了大型语言模型对从专利文献中挖掘高质量化学反应数据集的适用性。我们表明,所提出的方法可以通过识别更多的化学反应来显着提高反应数据库的数量,并通过纠正以前的错误/假阳性来提高反应数据库的质量。
更新日期:2024-11-26
down
wechat
bug