当前位置: X-MOL 学术bioRxiv. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Developing a More Accurate Biomedical Literature Retrieval Method using Deep Learning and Citations in PubMed Central Full-text Articles
bioRxiv - Bioinformatics Pub Date : 2021-10-23 , DOI: 10.1101/2021.10.21.465340
Chun-chao Lo , Shubo Tian , Yuchuan Tao , Jie Hao , Jinfeng Zhang

Most queries submitted to a literature search engine can be more precisely written as sentences to give the search engine more specific information. Sentence queries should be more effective, in principle, than short queries with small numbers of keywords. Querying with full sentences is also a key step in question-answering and citation recommendation systems. Despite the considerable progress in natural language processing (NLP) in recent years, using sentence queries on current search engines does not yield satisfactory results. In this study, we developed a deep learning-based method for sentence queries, called DeepSenSe, using citation data available in full-text articles obtained from PubMed Central (PMC). A large amount of labeled data was generated from millions of matched citing sentences and cited articles, making it possible to train quality predictive models using modern deep learning techniques. A two-stage approach was designed: in the first stage we used a modified BM25 algorithm to obtain the top 1000 relevant articles; the second stage involved re-ranking the relevant articles using DeepSenSe. We tested our method using a large number of sentences extracted from real scientific articles in PMC. Our method performed substantially better than PubMed and Google Scholar for sentence queries.

中文翻译:

使用深度学习和 PubMed Central 全文文章中的引用开发更准确的生物医学文献检索方法

提交给文献搜索引擎的大多数查询都可以更准确地写成句子,以便为搜索引擎提供更具体的信息。原则上,句子查询应该比使用少量关键字的短查询更有效。用完整句子查询也是问答和引文推荐系统中的关键步骤。尽管近年来自然语言处理(NLP)取得了长足的进步,但在当前的搜索引擎上使用句子查询并不能产生令人满意的结果。在这项研究中,我们开发了一种基于深度学习的句子查询方法,称为 DeepSenSe,使用从 PubMed Central (PMC) 获得的全文文章中可用的引文数据。从数百万匹配的引用句子和被引文章中生成了大量标记数据,使使用现代深度学习技术训练质量预测模型成为可能。设计了一个两阶段的方法:在第一阶段,我们使用修改后的 BM25 算法获取前 1000 篇相关文章;第二阶段涉及使用 DeepSenSe 重新排列相关文章。我们使用从 PMC 中的真实科学文章中提取的大量句子来测试我们的方法。我们的方法在句子查询方面的表现明显优于 PubMed 和 Google Scholar。我们使用从 PMC 中的真实科学文章中提取的大量句子来测试我们的方法。我们的方法在句子查询方面的表现明显优于 PubMed 和 Google Scholar。我们使用从 PMC 中的真实科学文章中提取的大量句子来测试我们的方法。我们的方法在句子查询方面的表现明显优于 PubMed 和 Google Scholar。
更新日期:2021-10-26
down
wechat
bug