Lifting the Veil on the Use of Big Data News Repositories: A Documentation and Critical Discussion of A Protest Event Analysis,Communication Methods and Measures

当前位置： X-MOL 学术 › Communication Methods and Measures › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Lifting the Veil on the Use of Big Data News Repositories: A Documentation and Critical Discussion of A Protest Event Analysis
Communication Methods and Measures ( IF 11.4 ) Pub Date : 2022-09-28 , DOI: 10.1080/19312458.2022.2128099
Matthias Hoffmann ₁ , Felipe G. Santos ₂ , Christina Neumayer ₁ , Dan Mercea ₂

Affiliation

ABSTRACT

This paper presents a critical discussion of the processing, reliability and implications of free big data repositories. We argue that big data is not only the starting point of scientific analyses but also the outcome of a long string of invisible or semi-visible tasks, often masked by the fetish of size that supposedly lends validity to big data. We unpack these notions by illustrating the process of extracting protest event data from the Global Database of Events, Language and Tone (GDELT) in six European countries over a period of seven years. To stand up to rigorous scientific scrutiny, we collected additional data by computational means and undertook large-scale neural-network translation tasks, dictionary-based content analyses, machine-learning classification tasks, and human coding. In a documentation and critical discussion of this process, we render visible opaque procedures that inevitably shape any dataset and show how this type of freely available datasets require significant additional resources of knowledge, labor, money, and computational power. We conclude that while these processes can ultimately yield more valid datasets, the supposedly free and ready-to-use big news data repositories should not be taken at face value.

中文翻译：

揭开大数据新闻存储库使用的面纱：抗议事件分析的记录和批判性讨论

摘要

本文对免费大数据存储库的处理、可靠性和影响进行了批判性讨论。我们认为，大数据不仅是科学分析的起点，也是一长串不可见或半可见任务的结果，这些任务通常被所谓的大数据有效性的大小迷信所掩盖。我们通过说明在七年的时间里从六个欧洲国家的全球事件、语言和语气数据库 (GDELT) 中提取抗议事件数据的过程来解开这些概念。为了经得起严格的科学审查，我们通过计算手段收集了额外的数据，并进行了大规模的神经网络翻译任务、基于词典的内容分析、机器学习分类任务和人工编码。在这个过程的文档和批判性讨论中，我们呈现了可见的不透明过程，这些过程不可避免地塑造了任何数据集，并展示了这种类型的免费可用数据集如何需要大量额外的知识、劳动力、金钱和计算能力资源。我们的结论是，虽然这些过程最终可以产生更有效的数据集，但不应该只看表面上的价值，即所谓的免费且随时可用的大新闻数据存储库。

更新日期：2022-09-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>