当前位置: X-MOL 学术Communication Methods and Measures › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Machine Translation Vs. Multilingual Dictionaries Assessing Two Strategies for the Topic Modeling of Multilingual Text Collections
Communication Methods and Measures ( IF 11.4 ) Pub Date : 2021-08-17 , DOI: 10.1080/19312458.2021.1955845
Daniel Maier 1 , Christian Baden 2 , Daniela Stoltenberg 1 , Maya De Vries-Kedem 3 , Annie Waldherr 4
Affiliation  

ABSTRACT

The goal of this paper is to evaluate two methods for the topic modeling of multilingual document collections: (1) machine translation (MT), and (2) the coding of semantic concepts using a multilingual dictionary (MD) prior to topic modeling. We empirically assess the consequences of these approaches based on both a quantitative comparison of models and a qualitative validation of each method’s potentials and weaknesses. Our case study uses two text collections (of tweets and news articles) in three languages (English, Hebrew, Arabic), covering the ongoing local conflicts between Israeli authorities, settlers, and Palestinian Bedouins in the West Bank. We find that both methods produce a large share of equivalent topics, especially in the context of fairly homogenous news discourse, yet show limited but systematic differences when applied to highly heterogenous social media discourse. While the MD model delivers a more nuanced picture of conflict-related topics, it misses several more peripheral topics, especially those unrelated to the dictionary’s focus, which are picked up by the MT model. Our study is a first step toward instrument validation, indicating that both methods yield valid, comparable results, while method-specific differences remain.



中文翻译:

机器翻译对比 多语言词典评估多语言文本集合主题建模的两种策略

摘要

本文的目标是评估多语言文档集合主题建模的两种方法:(1) 机器翻译 (MT),以及 (2) 在主题建模之前使用多语言词典 (MD) 对语义概念进行编码。我们基于模型的定量比较和每种方法的潜力和弱点的定性验证,凭经验评估这些方法的后果。我们的案例研究使用三种语言(英语、希伯来语、阿拉伯语)的两个文本集(推文和新闻文章),涵盖了以色列当局、定居者和约旦河西岸巴勒斯坦贝都因人之间正在进行的当地冲突。我们发现这两种方法都产生了大量相同的主题,特别是在相当同质的新闻话语的背景下,但在应用于高度异质的社交媒体话语时显示出有限但系统的差异。虽然 MD 模型提供了与冲突相关主题的更细致入微的画面,但它遗漏了更多外围主题,尤其是那些与字典重点无关的主题,这些主题被 MT 模型拾取。我们的研究是仪器验证的第一步,表明两种方法都产生了有效的、可比较的结果,而特定方法的差异仍然存在。

更新日期:2021-08-17
down
wechat
bug