The IBEM dataset: A large printed scientific image dataset for indexing and searching mathematical expressions,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

The IBEM dataset: A large printed scientific image dataset for indexing and searching mathematical expressions
Pattern Recognition Letters ( IF 3.9 ) Pub Date : 2023-06-02 , DOI: 10.1016/j.patrec.2023.05.033
Dan Anitei , Joan Andreu Sánchez , José Miguel Benedí , Ernesto Noya

Searching for information in printed scientific documents is a challenging problem that has recently received special attention from the Pattern Recognition research community. Mathematical expressions are complex elements that appear in scientific documents, and developing techniques for locating and recognizing them requires the preparation of datasets that can be used as benchmarks. Most current techniques for dealing with mathematical expressions are based on Machine Learning techniques which require a large amount of annotated data. These datasets must be prepared with ground-truth information for automatic training and testing. However, preparing large datasets with ground-truth is a very expensive and time-consuming task. This paper introduces the IBEM dataset, consisting of scientific documents that have been prepared for mathematical expression recognition and searching. This dataset consists of 600 documents, more than $8 200$ page images with more than $160 000$ mathematical expressions. It has been automatically generated from the

version of the documents and can be enlarged easily. The ground-truth includes the position at the page level and the

transcript for mathematical expressions both embedded in the text and displayed. This paper also reports a baseline classification experiment with mathematical symbols and a baseline experiment of Mathematical Expression Recognition performed on the IBEM dataset. These experiments aim to provide some benchmarks for comparison purposes so that future users of the IBEM dataset can have a baseline framework.

中文翻译：

IBEM 数据集：用于索引和搜索数学表达式的大型印刷科学图像数据集

在印刷的科学文档中搜索信息是一个具有挑战性的问题，最近受到模式识别研究团体的特别关注。数学表达式是出现在科学文献中的复杂元素，开发定位和识别它们的技术需要准备可用作基准的数据集。大多数当前处理数学表达式的技术都是基于需要大量注释数据的机器学习技术。这些数据集必须准备好用于自动训练和测试的真实信息。然而，准备具有真实数据的大型数据集是一项非常昂贵且耗时的任务。本文介绍了IBEM数据集，由为数学表达式识别和搜索准备的科学文档组成。该数据集包含 600 篇文档，超过 $8个 200$ 页面图像超过 $160 000$ 数学表达式。它是从