Pattern Recognition Letters ( IF 3.9 ) Pub Date : 2023-06-02 , DOI: 10.1016/j.patrec.2023.05.033
Dan Anitei , Joan Andreu Sánchez , José Miguel Benedí , Ernesto Noya
Searching for information in printed scientific documents is a challenging problem that has recently received special attention from the Pattern Recognition research community. Mathematical expressions are complex elements that appear in scientific documents, and developing techniques for locating and recognizing them requires the preparation of datasets that can be used as benchmarks. Most current techniques for dealing with mathematical expressions are based on Machine Learning techniques which require a large amount of annotated data. These datasets must be prepared with ground-truth information for automatic training and testing. However, preparing large datasets with ground-truth is a very expensive and time-consuming task. This paper introduces the IBEM dataset, consisting of scientific documents that have been prepared for mathematical expression recognition and searching. This dataset consists of 600 documents, more than page images with more than mathematical expressions. It has been automatically generated from the


中文翻译:

IBEM 数据集:用于索引和搜索数学表达式的大型印刷科学图像数据集
在印刷的科学文档中搜索信息是一个具有挑战性的问题,最近受到模式识别研究团体的特别关注。数学表达式是出现在科学文献中的复杂元素,开发定位和识别它们的技术需要准备可用作基准的数据集。大多数当前处理数学表达式的技术都是基于需要大量注释数据的机器学习技术。这些数据集必须准备好用于自动训练和测试的真实信息。然而,准备具有真实数据的大型数据集是一项非常昂贵且耗时的任务。本文介绍了IBEM数据集,由为数学表达式识别和搜索准备的科学文档组成。该数据集包含 600 篇文档,超过页面图像超过数学表达式。它是从

