Communication Methods and Measures ( IF 6.3 ) Pub Date : 2022-12-12 , DOI: 10.1080/19312458.2022.2151579 Christina Viehmann 1 , Tilman Beck 2 , Marcus Maurer 1 , Oliver Quiring 1 , Iryna Gurevych 2
ABSTRACT
Supervised machine learning (SML) provides us with tools to efficiently scrutinize large corpora of communication texts. Yet, setting up such a tool involves plenty of decisions starting with the data needed for training, the selection of an algorithm, and the details of model training. We aim at establishing a firm link between communication research tasks and the corresponding state-of-the-art in natural language processing research by systematically comparing the performance of different automatic text analysis approaches. We do this for a challenging task – stance detection of opinions on policy measures to tackle the COVID-19 pandemic in Germany voiced on Twitter. Our results add evidence that pre-trained language models such as BERT outperform feature-based and other neural network approaches. Yet, the gains one can achieve differ greatly depending on the specific merits of pre-training (i.e., use of different language models). Adding to the robustness of our conclusions, we run a generalizability check with a different use case in terms of language and topic. Additionally, we illustrate how the amount and quality of training data affect model performance pointing to potential compensation effects. Based on our results, we derive important practical recommendations for setting up such SML tools to study communication texts.
中文翻译:
调查数字媒体中公共政策的意见:建立用于立场分类的监督机器学习工具
摘要
监督机器学习 (SML) 为我们提供了有效审查大型通信文本语料库的工具。然而,建立这样一个工具涉及大量决策,从训练所需的数据、算法的选择以及模型训练的细节开始。我们的目标是通过系统地比较不同自动文本分析方法的性能,在传播研究任务与自然语言处理研究中相应的最新技术之间建立牢固的联系。我们这样做是为了完成一项具有挑战性的任务——立场检测 Twitter 上表达的关于德国应对 COVID-19 大流行的政策措施的意见。我们的结果进一步证明,BERT 等预训练语言模型的性能优于基于特征的方法和其他神经网络方法。然而,根据预训练的具体优点(即使用不同的语言模型),人们可以获得的收益会有很大差异。为了增加我们结论的稳健性,我们在语言和主题方面对不同的用例进行了普遍性检查。此外,我们还说明了训练数据的数量和质量如何影响模型性能,并指出了潜在的补偿效应。根据我们的结果,我们得出了建立此类 SML 工具来研究通信文本的重要实用建议。我们说明了训练数据的数量和质量如何影响模型性能,并指出潜在的补偿效应。根据我们的结果,我们得出了建立此类 SML 工具来研究传播文本的重要实用建议。我们说明了训练数据的数量和质量如何影响模型性能,并指出潜在的补偿效应。根据我们的结果,我们得出了建立此类 SML 工具来研究传播文本的重要实用建议。