Egyptian Informatics Journal ( IF 5.0 ) Pub Date : 2022-02-22 , DOI: 10.1016/j.eij.2022.02.006 Hamood Alshalabi 1, 2 , Sabrina Tiun 1 , Nazlia Omar 1 , Elham abdulwahab Anaam 1 , Yazid Saif 3
One of the most important phases in text processing is stemming, whose aim is to aggregate all variations in a word into one group to aid natural language processing. The morphological structure of the Arabic language is more challenging than that of the English language; thus, it requires superior stemming algorithms for Arabic stemmers to be effective. One of the challenges is the irregular broken plural, which has been a problematic issue in Arabic natural language processing that affects the performance of Arabic information retrieval and other Arabic language engineering applications. Several studies have attempted to develop solutions to irregular plural problems, but the challenge remains, especially in extracting correct Arabic root words. In this paper, the broken plural rule (BPR) algorithm introduces new solutions to solve the problem in which an existing root-based method cannot extract correct roots by using their proposed rules. The BPR algorithm introduces several rules (main rules and subrules) to extract the correct roots of the Arabic irregular broken plural words. To evaluate the effectiveness of the BPR algorithm, we extracted roots from an Arabic standard dataset and applied the BPR algorithm as an enhancement to a root-based Arabic stemmer, ISRI. The obtained results from both evaluations showed encouraging results: (i) Only a few numbers of incorrect roots were stemmed on the large-sized Arabic word dataset. (ii) The enhanced root-based Arabic stemmer, ISRI + BPR, exhibited the best performance compared with the original ISRI stemmer and a well-known Arabic stemmer, ARLS 2. Thus, the proposed BPR algorithm has solved some of the irregular broken plural problems that eventually increase the performance of a root-based Arabic stemmer.
中文翻译:
BPR 算法:阿拉伯语词干分析器的新破复数规则
文本处理中最重要的阶段之一是词干提取,其目的是将单词中的所有变体聚合到一个组中以帮助自然语言处理。阿拉伯语的形态结构比英语更具挑战性;因此,阿拉伯语词干分析器需要卓越的词干提取算法才能有效。挑战之一是不规则的断复数,这一直是阿拉伯语自然语言处理中的一个问题,会影响阿拉伯语信息检索和其他阿拉伯语工程应用程序的性能。一些研究试图开发不规则复数问题的解决方案,但挑战依然存在,尤其是在提取正确的阿拉伯语词根方面。在本文中,打破复数规则(BPR)算法引入了新的解决方案来解决现有的基于根的方法无法通过使用其提出的规则来提取正确的根的问题。BPR算法引入了几个规则(主规则和子规则)来提取阿拉伯语不规则断复数词的正确词根。为了评估 BPR 算法的有效性,我们从阿拉伯语标准数据集中提取了词根,并将 BPR 算法应用为对基于词根的阿拉伯语词干分析器 ISRI 的增强。两次评估获得的结果都显示出令人鼓舞的结果:(i)在大型阿拉伯语单词数据集上仅提取了少数不正确的词根。(ii) 增强的基于根的阿拉伯语词干分析器 ISRI + BPR 与原始 ISRI 词干分析器和著名的阿拉伯语词干分析器 ARLS 2 相比表现出最佳性能。因此,