Complex & Intelligent Systems ( IF 5.0 ) Pub Date : 2024-12-19 , DOI: 10.1007/s40747-024-01641-7 Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, Jun Shen
The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA.
中文翻译:

T-LLaMA:基于 LLaMA2 的藏语大语言模型
ChatGPT 和 GPT-4 的出现引起了人们对大型语言模型 (LLM,在对话系统、机器翻译和研究论文摘要等各种应用中展示了卓越的表现。然而,当应用于资源匮乏的语言时,特别是像藏语这样的学术研究背景时,它们的功效就会减弱。在这项研究中,我们训练了藏文 LLaMA (T-LLaMA),这是一个基于高效预训练技术的模型,用于三个下游任务:文本分类、新闻文本生成和自动文本摘要。为了解决语料库的缺乏问题,我们构建了一个包含 22 亿个字符的藏文数据集。此外,我们通过使用 SentencePiece 扩展藏语词汇量,增加了 META AI 的 LLaMA2 词汇量。值得注意的是,文本分类任务在公开可用的数据集 Tibetan News Classification Corpus 上达到了 79.8% 的最先进的 (SOTA) 准确率。此外,对 500 个生成的样本进行人工审查表明,新闻文本生成和文本摘要任务的结果令人满意。据我们所知,T-LLaMA 是藏语自然语言处理 (NLP) 中第一个参数在十亿级的大规模语言模型。我们公开提供经过训练的模型,预计这一贡献不仅填补了藏语大规模语言模型领域的空白,而且还可以作为藏语 NLP 社区中计算资源有限的研究人员的基础模型。T-LLaMA 模型可在 https://huggingface.co/Pagewood/T-LLaMA 购买。