Approaches to improve preprocessing for Latent Dirichlet Allocation topic modeling,Decision Support Systems

当前位置： X-MOL 学术 › Decis. Support Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Approaches to improve preprocessing for Latent Dirichlet Allocation topic modeling
Decision Support Systems ( IF 6.7 ) Pub Date : 2024-08-27 , DOI: 10.1016/j.dss.2024.114310
Jamie Zimmermann , Lance E. Champagne , John M. Dickens , Benjamin T. Hazen

As a part of natural language processing (NLP), the intent of topic modeling is to identify topics in textual corpora with limited human input. Current topic modeling techniques, like Latent Dirichlet Allocation (LDA), are limited in the pre-processing steps and currently require human judgement, increasing analysis time and opportunities for error. The purpose of this research is to allay some of those limitations by introducing new approaches to improve coherence without adding computational complexity and provide an objective method for determining the number of topics within a corpus. First, we identify a requirement for a more robust stop words list and introduce a new dimensionality-reduction heuristic that exploits the number of words within a document to infer importance to word choice. Second, we develop an eigenvalue technique to determine the number of topics within a corpus. Third, we combine all of these techniques into the Zimm Approach, which produces higher quality results than LDA in determining the number of topics within a corpus. The Zimm Approach, when tested against various subsets of the 20newsgroup dataset, produced the correct number of topics in 7 of 9 subsets vs. 0 of 9 using highest coherence value produced by LDA.

中文翻译：

改进潜在狄利克雷分配主题建模预处理的方法

作为自然语言处理（NLP）的一部分，主题建模的目的是通过有限的人类输入来识别文本语料库中的主题。当前的主题建模技术，例如潜在狄利克雷分配（LDA），在预处理步骤方面受到限制，并且目前需要人工判断，从而增加了分析时间和出错的机会。这项研究的目的是通过引入新的方法来提高一致性而不增加计算复杂性，并提供一种客观的方法来确定语料库中的主题数量，从而减轻其中的一些限制。首先，我们确定了对更强大的停用词列表的要求，并引入了一种新的降维启发式，该启发式利用文档中的单词数量来推断单词选择的重要性。其次，我们开发了一种特征值技术来确定语料库中的主题数量。第三，我们将所有这些技术结合到 Zimm 方法中，该方法在确定语料库中的主题数量时产生比 LDA 更高质量的结果。当针对 20newsgroup 数据集的各个子集进行测试时，Zimm 方法在 9 个子集中的 7 个子集中生成了正确数量的主题，而使用 LDA 生成的最高一致性值则在 9 个子集中生成了 0 个主题。

更新日期：2024-08-27

点击分享查看原文

点击收藏

阅读更多本刊新发论文本刊介绍/投稿指南