当前位置: X-MOL 学术Communication Methods and Measures › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Enhancing Theory-Informed Dictionary Approaches with “Glass-box” Machine Learning: The Case of Integrative Complexity in Social Media Comments
Communication Methods and Measures ( IF 11.4 ) Pub Date : 2021-11-17 , DOI: 10.1080/19312458.2021.1999913
Timo Dobbrick 1 , Julia Jakob 1 , Chung-Hong Chan 1 , Hartmut Wessler 2
Affiliation  

ABSTRACT

Dictionary-based approaches to computational text analysis have been shown to perform relatively poorly, particularly when the dictionaries rely on simple bags of words, are not specified for the domain under study, and add word scores without weighting. While machine learning approaches usually perform better, they offer little insight into (a) which of the assumptions underlying dictionary approaches (bag-of-words, domain transferability, or additivity) impedes performance most, and (b) which language features drive the algorithmic classification most strongly. To fill both gaps, we offer a systematic assumption-based error analysis, using the integrative complexity of social media comments as our case in point. We show that attacking the additivity assumption offers the strongest potential for improving dictionary performance. We also propose to combine off-the-shelf dictionaries with supervised “glass box” machine learning algorithms (as opposed to the usual “black box” machine learning approaches) to classify texts and learn about the most important features for classification. This dictionary-plus-supervised-learning approach performs similarly well as classic full-text machine learning or deep learning approaches, but yields interpretable results in addition, which can inform theory development on top of enabling a valid classification.



中文翻译:

用“玻璃盒”机器学习增强理论知识词典方法:社交媒体评论中综合复杂性的案例

摘要

基于字典的计算文本分析方法表现相对较差,特别是当字典依赖于简单的词袋,没有为研究领域指定,并且在没有加权的情况下添加单词分数时。虽然机器学习方法通​​常表现更好,但它们几乎无法深入了解 (a) 字典方法背后的哪些假设(词袋、域可迁移性或可加性)最阻碍性能,以及 (b) 哪些语言特征驱动算法分类最强烈。为了填补这两个空白,我们提供了一个系统的基于假设的错误分析,使用社交媒体评论的综合复杂性作为我们的例子。我们表明,攻击可加性假设为提高字典性能提供了最强的潜力。我们还建议将现成的词典与受监督的“玻璃盒”机器学习算法(与通常的“黑盒”机器学习方法相反)相结合,对文本进行分类并了解最重要的分类特征。这种字典加监督学习方法的性能与经典的全文机器学习或深度学习方法相似,但还产生了可解释的结果,这可以在实现有效分类的基础上为理论发展提供信息。

更新日期:2021-11-17
down
wechat
bug