当前位置: X-MOL 学术Soc. Sci. Comput. Rev. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Large Language Models Outperform Expert Coders and Supervised Classifiers at Annotating Political Social Media Messages
Social Science Computer Review ( IF 3.0 ) Pub Date : 2024-09-23 , DOI: 10.1177/08944393241286471
Petter Törnberg

Instruction-tuned Large Language Models (LLMs) have recently emerged as a powerful new tool for text analysis. As these models are capable of zero-shot annotation based on instructions written in natural language, they obviate the need of large sets of training data—and thus bring potential paradigm-shifting implications for using text as data. While the models show substantial promise, their relative performance compared to human coders and supervised models remains poorly understood and subject to significant academic debate. This paper assesses the strengths and weaknesses of popular fine-tuned AI models compared to both conventional supervised classifiers and manual annotation by experts and crowd workers. The task used is to identify the political affiliation of politicians based on a single X/Twitter message, focusing on data from 11 different countries. The paper finds that GPT-4 achieves higher accuracy than both supervised models and human coders across all languages and country contexts. In the US context, it achieves an accuracy of 0.934 and an inter-coder reliability of 0.982. Examining the cases where the models fail, the paper finds that the LLM—unlike the supervised models—correctly annotates messages that require interpretation of implicit or unspoken references, or reasoning on the basis of contextual knowledge—capacities that have traditionally been understood to be distinctly human. The paper thus contributes to our understanding of the revolutionary implications of LLMs for text analysis within the social sciences.

中文翻译:


大型语言模型在注释政治社交媒体消息方面优于专家编码员和监督分类器



指令调整的大型语言模型 ( LLMs ) 最近已成为一种强大的文本分析新工具。由于这些模型能够基于自然语言编写的指令进行零样本注释,因此它们不需要大量训练数据,从而为使用文本作为数据带来潜在的范式转变影响。虽然这些模型显示出巨大的前景,但与人类编码员和监督模型相比,它们的相对性能仍然知之甚少,并且受到重大学术争论。本文评估了流行的微调人工智能模型与传统监督分类器以及专家和人群手动注释相比的优点和缺点。所使用的任务是根据单个 X/Twitter 消息来识别政客的政治派别,重点关注来自 11 个不同国家的数据。该论文发现,GPT-4 在所有语言和国家背景下都比监督模型和人类编码员实现了更高的准确性。在美国,它的准确度为 0.934,编码器间可靠性为 0.982。通过检查模型失败的案例,论文发现LLM (与监督模型不同)正确地注释了需要解释隐含或不言而喻的引用的消息,或基于上下文知识进行推理的信息,而这些信息传统上被理解为明确的能力人类。因此,本文有助于我们理解LLMs对社会科学领域文本分析的革命性影响。
更新日期:2024-09-23
down
wechat
bug