Analyzing the unrestricted web: The finnish corpus of online registers
Nordic Journal of Linguistics ( IF 0.5 ) Pub Date : 2023-03-13 , DOI: 10.1017/s0332586523000021
Valtteri Skantsi , Veronika Laippala

This article introduces the Finnish Corpus of Online Registers (FinCORE) representing the full range of registers – situationally defined text varieties such as news and blogs – on the Finnish Internet. The extreme range of language use found online has challenged the study of registers. It has been unclear what registers the entire Internet includes, and if they can be sufficiently defined to allow for their analysis or classification, previous studies focusing on restricted sets of registers and English. FinCORE features 10,754 texts from the unrestricted web, manually annotated for their register using a scheme originally established for the Corpus of Online Registers of English (CORE). We present the FinCORE registers and compare them to CORE. Finally, we show that the FinCORE registers are sufficiently well-defined to allow for their automatic identification, thus opening novel possibilities for both linguistics and web-as-corpus research. FinCORE is published under an open license.



本文介绍了芬兰语在线注册语料库 (FinCORE),它代表了芬兰互联网上的所有注册范围——根据情境定义的文本类型,例如新闻和博客。网上发现的极端语言使用范围对语域研究提出了挑战。目前还不清楚整个 Internet 包括哪些寄存器,以及是否可以对它们进行充分定义以允许对其进行分析或分类,之前的研究主要集中在有限的寄存器集和英语上。FinCORE 具有来自不受限制的网络的 10,754 个文本,使用最初为英语在线注册语料库 (CORE) 建立的方案对其注册进行手动注释。我们介绍 FinCORE 寄存器并将它们与 CORE 进行比较。最后,我们展示了 FinCORE 寄存器的定义非常明确,可以自动识别它们,从而为语言学和网络语料库研究开辟了新的可能性。FinCORE 是在开放许可下发布的。