Application of machine reading comprehension techniques for named entity recognition in materials science,Journal of Cheminformatics

当前位置： X-MOL 学术 › J. Cheminfom. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Application of machine reading comprehension techniques for named entity recognition in materials science
Journal of Cheminformatics ( IF 7.1 ) Pub Date : 2024-07-02 , DOI: 10.1186/s13321-024-00874-5
Zihui Huang ₁ , Liqiang He ₁ , Yuhang Yang ₁ , Andi Li ₁ , Zhiwen Zhang ₁ , Siwei Wu ₁ , Yang Wang ₁ , Yan He ₁ , Xujie Liu ₁

Affiliation

Materials science is an interdisciplinary field that studies the properties, structures, and behaviors of different materials. A large amount of scientific literature contains rich knowledge in the field of materials science, but manually analyzing these papers to find material-related data is a daunting task. In information processing, named entity recognition (NER) plays a crucial role as it can automatically extract entities in the field of materials science, which have significant value in tasks such as building knowledge graphs. The typically used sequence labeling methods for traditional named entity recognition in material science (MatNER) tasks often fail to fully utilize the semantic information in the dataset and cannot effectively extract nested entities. Herein, we proposed to convert the sequence labeling task into a machine reading comprehension (MRC) task. MRC method effectively can solve the challenge of extracting multiple overlapping entities by transforming it into the form of answering multiple independent questions. Moreover, the MRC framework allows for a more comprehensive understanding of the contextual information and semantic relationships within materials science literature, by integrating prior knowledge from queries. State-of-the-art (SOTA) performance was achieved on the Matscholar, BC4CHEMD, NLMChem, SOFC, and SOFC-Slot datasets, with F1-scores of 89.64%, 94.30%, 85.89%, 85.95%, and 71.73%, respectively in MRC approach. By effectively utilizing semantic information and extracting nested entities, this approach holds great significance for knowledge extraction and data analysis in the field of materials science, and thus accelerating the development of material science. Scientific contribution We have developed an innovative NER method that enhances the efficiency and accuracy of automatic entity extraction in the field of materials science by transforming the sequence labeling task into a MRC task, this approach provides robust support for constructing knowledge graphs and other data analysis tasks.

中文翻译：

机器阅读理解技术在材料科学中命名实体识别的应用

材料科学是一个跨学科领域，研究不同材料的性质、结构和行为。大量的科学文献蕴含着材料科学领域丰富的知识，但手动分析这些论文以查找与材料相关的数据是一项艰巨的任务。在信息处理中，命名实体识别（NER）发挥着至关重要的作用，因为它可以自动提取材料科学领域的实体，这在构建知识图谱等任务中具有重要价值。材料科学中传统命名实体识别（MatNER）任务中常用的序列标记方法通常无法充分利用数据集中的语义信息，并且无法有效地提取嵌套实体。在这里，我们建议将序列标记任务转换为机器阅读理解（MRC）任务。 MRC方法通过将其转化为回答多个独立问题的形式，可以有效地解决提取多个重叠实体的挑战。此外，MRC 框架通过整合查询中的先验知识，可以更全面地理解材料科学文献中的上下文信息和语义关系。在 Matscholar、BC4CHEMD、NLMChem、SOFC 和 SOFC-Slot 数据集上实现了最先进的 (SOTA) 性能，F1 分数为 89.64%、94.30%、85.89%、85.95% 和 71.73%，分别采用 MRC 方法。该方法通过有效利用语义信息并提取嵌套实体，对于材料科学领域的知识提取和数据分析具有重要意义，从而加速材料科学的发展。科学贡献我们开发了一种创新的NER方法，通过将序列标记任务转变为MRC任务，提高了材料科学领域自动实体提取的效率和准确性，该方法为构建知识图谱和其他数据分析任务提供了强有力的支持。

更新日期：2024-07-03

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文本刊介绍/投稿指南