Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2023-03-31 , DOI: 10.1007/s10579-022-09631-2 Natalie Weber 1 , Tyler Brown 1 , Joshua Celli 1 , McKenzie Denham 1 , Hailey Dykstra 1 , Rodrigo Hernandez-Merlin 1 , Evan Hochstein 1 , Pinyu Hwang 1 , Nico Kidd 1 , Diana Kulmizev 1 , Hannah Morrison 1 , Matty Norris 1 , Lena Venkatraman 1
This paper describes the structure and creation of Blackfoot Words, a new relational database of lexical forms (inflected words, stems, and morphemes) in Blackfoot (Algonquian; ISO 639-3: bla). To date, we have digitized 63,493 individual lexical forms from 30 sources, representing all four major dialects, and spanning the years 1743–2017. Version 1.1 of the database includes lexical forms from nine of these sources. This project has two aims. The first is to digitize and provide access to the lexical data in these sources, many of which are difficult to access and discover. The second is to organize the data so that connections can be made between instances of the “same” lexical form across all sources, despite variation across sources in the dialect recorded, orthographic conventions, and the depth of morpheme analysis. The database structure was developed in response to these aims. The database comprises five tables: Sources, Words, Stems, Morphemes, and Lemmas. The Sources table contains bibliographic information and commentary on the sources. The Words table contains inflected words in the source orthography. Each word is broken down into stems and morphemes which are entered into the Stems and Morphemes tables in the source orthography. The Lemmas table contains abstract versions of each stem or morpheme in a standardized orthography. Instances of the same stem or morpheme are linked to a common lemma. We expect that the database will support projects by the language community and other researchers.
中文翻译:
Blackfoot Words:Blackfoot 词汇形式的数据库
本文介绍了 Blackfoot Words 的结构和创建,这是 Blackfoot(Algonquian;ISO 639-3:bla)中的词汇形式(变形词、词干和语素)的新关系数据库。迄今为止,我们已对 30 个来源的 63,493 个词汇形式进行了数字化,代表所有四种主要方言,时间跨度为 1743 年至 2017 年。该数据库的 1.1 版包括来自其中九个来源的词汇形式。该项目有两个目标。第一个是数字化并提供对这些来源中的词汇数据的访问,其中许多数据很难访问和发现。第二个是组织数据,以便可以在所有来源的“相同”词汇形式的实例之间建立联系,尽管不同来源的方言记录、拼写约定和语素分析的深度存在差异。数据库结构是为了响应这些目标而开发的。该数据库包含五个表:Sources、Words、Stems、Morphemes 和 Lemmas。来源表包含参考文献信息和来源评论。单词表包含源拼字法中的变形单词。每个单词都被分解为词干和词素,这些词干和词素被输入到源拼字法中的词干和词素表中。Lemmas 表包含标准化正字法中每个词干或语素的抽象版本。相同词干或语素的实例链接到共同的引理。我们希望该数据库能够支持语言社区和其他研究人员的项目。