Foundations and Trends in Information Retrieval ( IF 8.3 ) Pub Date : 2022-8-17 , DOI: 10.1561/1500000100 Yixing Fan , Xiaohui Xie , Yinqiong Cai , Jia Chen , Xinyu Ma , Xiangsheng Li , Ruqing Zhang , Jiafeng Guo
The core of information retrieval (IR) is to identify relevant information from large-scale resources and return it as a ranked list to respond to user’s information need. In recent years, the resurgence of deep learning has greatly advanced this field and leads to a hot topic named NeuIR (i.e., neural information retrieval), especially the paradigm of pre-training methods (PTMs). Owing to sophisticated pre-training objectives and huge model size, pre-trained models can learn universal language representations from massive textual data, which are beneficial to the ranking task of IR. Recently, a large number of works, which are dedicated to the application of PTMs in IR, have been introduced to promote the retrieval performance. Considering the rapid progress of this direction, this survey aims to provide a systematic review of pre-training methods in IR. To be specific, we present an overview of PTMs applied in different components of an IR system, including the retrieval component, the re-ranking component, and other components. In addition, we also introduce PTMs specifically designed for IR, and summarize available datasets as well as benchmark leaderboards. Moreover, we discuss some open challenges and highlight several promising directions, with the hope of inspiring and facilitating more works on these topics for future research.
中文翻译:
信息检索中的预训练方法
信息检索(IR)的核心是从大规模资源中识别出相关信息,并将其作为排序列表返回,以响应用户的信息需求。近年来,深度学习的复兴极大地推动了该领域的发展,并引发了一个热门话题,即 NeuIR(即神经信息检索),尤其是预训练方法(PTM)的范式。由于复杂的预训练目标和巨大的模型规模,预训练模型可以从海量文本数据中学习通用语言表示,这有利于 IR 的排序任务。最近,大量致力于 PTM 在 IR 中应用的作品被引入以提升检索性能。考虑到这个方向的快速进展,本调查旨在对 IR 中的预训练方法进行系统评价。具体来说,我们概述了应用于 IR 系统不同组件的 PTM,包括检索组件、重新排序组件和其他组件。此外,我们还介绍了专门为 IR 设计的 PTM,并总结了可用的数据集以及基准排行榜。此外,我们讨论了一些开放的挑战并强调了几个有前途的方向,希望能激发和促进更多关于这些主题的工作,以供未来研究。并总结可用的数据集以及基准排行榜。此外,我们讨论了一些开放的挑战并强调了几个有前途的方向,希望能激发和促进更多关于这些主题的工作,以供未来研究。并总结可用的数据集以及基准排行榜。此外,我们讨论了一些开放的挑战并强调了几个有前途的方向,希望能激发和促进更多关于这些主题的工作,以供未来研究。