当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The IBEM dataset: A large printed scientific image dataset for indexing and searching mathematical expressions
Pattern Recognition Letters ( IF 3.9 ) Pub Date : 2023-06-02 , DOI: 10.1016/j.patrec.2023.05.033
Dan Anitei , Joan Andreu Sánchez , José Miguel Benedí , Ernesto Noya

Searching for information in printed scientific documents is a challenging problem that has recently received special attention from the Pattern Recognition research community. Mathematical expressions are complex elements that appear in scientific documents, and developing techniques for locating and recognizing them requires the preparation of datasets that can be used as benchmarks. Most current techniques for dealing with mathematical expressions are based on Machine Learning techniques which require a large amount of annotated data. These datasets must be prepared with ground-truth information for automatic training and testing. However, preparing large datasets with ground-truth is a very expensive and time-consuming task. This paper introduces the IBEM dataset, consisting of scientific documents that have been prepared for mathematical expression recognition and searching. This dataset consists of 600 documents, more than 8200 page images with more than 160000 mathematical expressions. It has been automatically generated from the

Image 1
version of the documents and can be enlarged easily. The ground-truth includes the position at the page level and the
Image 1
 transcript for mathematical expressions both embedded in the text and displayed. This paper also reports a baseline classification experiment with mathematical symbols and a baseline experiment of Mathematical Expression Recognition performed on the IBEM dataset. These experiments aim to provide some benchmarks for comparison purposes so that future users of the IBEM dataset can have a baseline framework.



中文翻译:

IBEM 数据集:用于索引和搜索数学表达式的大型印刷科学图像数据集

在印刷的科学文档中搜索信息是一个具有挑战性的问题,最近受到模式识别研究团体的特别关注。数学表达式是出现在科学文献中的复杂元素,开发定位和识别它们的技术需要准备可用作基准的数据集。大多数当前处理数学表达式的技术都是基于需要大量注释数据的机器学习技术。这些数据集必须准备好用于自动训练和测试的真实信息。然而,准备具有真实数据的大型数据集是一项非常昂贵且耗时的任务。本文介绍了IBEM数据集,由为数学表达式识别和搜索准备的科学文档组成。该数据集包含 600 篇文档,超过8个200页面图像超过160000数学表达式。它是从

图片 1
文档的版本,可以轻松放大。ground-truth 包括页面级别的位置和
图片 1
 嵌入文本和显示的数学表达式的成绩单。本文还报告了一个使用数学符号的基线分类实验和一个在 IBEM 数据集上进行的数学表达式识别的基线实验。这些实验旨在为比较目的提供一些基准,以便 IBEM 数据集的未来用户可以有一个基线框架。

更新日期:2023-06-07
down
wechat
bug