Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT,Methods of Information in Medicine

当前位置： X-MOL 学术 › Methods Inf. Med. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT
Methods of Information in Medicine ( IF 1.3 ) Pub Date : 2021-07-08 , DOI: 10.1055/s-0041-1731390
Faith Wavinya Mutinda ₁ , Shuntaro Yada ₁ , Shoko Wakamiya ₁ , Eiji Aramaki ₁

Affiliation

Background Semantic textual similarity (STS) captures the degree of semantic similarity between texts. It plays an important role in many natural language processing applications such as text summarization, question answering, machine translation, information retrieval, dialog systems, plagiarism detection, and query ranking. STS has been widely studied in the general English domain. However, there exists few resources for STS tasks in the clinical domain and in languages other than English, such as Japanese.

Objective The objective of this study is to capture semantic similarity between Japanese clinical texts (Japanese clinical STS) by creating a Japanese dataset that is publicly available.

Materials We created two datasets for Japanese clinical STS: (1) Japanese case reports (CR dataset) and (2) Japanese electronic medical records (EMR dataset). The CR dataset was created from publicly available case reports extracted from the CiNii database. The EMR dataset was created from Japanese electronic medical records.

Methods We used an approach based on bidirectional encoder representations from transformers (BERT) to capture the semantic similarity between the clinical domain texts. BERT is a popular approach for transfer learning and has been proven to be effective in achieving high accuracy for small datasets. We implemented two Japanese pretrained BERT models: a general Japanese BERT and a clinical Japanese BERT. The general Japanese BERT is pretrained on Japanese Wikipedia texts while the clinical Japanese BERT is pretrained on Japanese clinical texts.

Results The BERT models performed well in capturing semantic similarity in our datasets. The general Japanese BERT outperformed the clinical Japanese BERT and achieved a high correlation with human score (0.904 in the CR dataset and 0.875 in the EMR dataset). It was unexpected that the general Japanese BERT outperformed the clinical Japanese BERT on clinical domain dataset. This could be due to the fact that the general Japanese BERT is pretrained on a wide range of texts compared with the clinical Japanese BERT.

中文翻译：

使用 BERT 分析日语临床领域文本中的语义文本相似性

背景语义文本相似度（STS）捕获文本之间的语义相似度。它在文本摘要、问答、机器翻译、信息检索、对话系统、抄袭检测和查询排名等许多自然语言处理应用中发挥着重要作用。 STS 在通用英语领域得到了广泛的研究。然而，临床领域和英语以外语言（例如日语）的 STS 任务资源很少。

目的本研究的目的是通过创建公开的日语数据集来捕获日语临床文本（日语临床 STS）之间的语义相似性。

材料我们为日本临床 STS 创建了两个数据集：（1）日本病例报告（CR 数据集）和（2）日本电子病历（EMR 数据集）。 CR 数据集是根据从 CiNii 数据库中提取的公开案例报告创建的。 EMR 数据集是根据日本电子病历创建的。

方法我们使用基于 Transformer 双向编码器表示 (BERT) 的方法来捕获临床领域文本之间的语义相似性。 BERT 是一种流行的迁移学习方法，已被证明可以有效地实现小数据集的高精度。我们实现了两种日本预训练 BERT 模型：通用日本 BERT 和临床日本 BERT。通用日语 BERT 在日语维基百科文本上进行预训练，而临床日语 BERT 在日语临床文本上进行预训练。

结果BERT 模型在捕获数据集中的语义相似性方面表现良好。普通日本 BERT 优于临床日本 BERT，并实现了与人类评分的高度相关性（CR 数据集中为 0.904，EMR 数据集中为 0.875）。出乎意料的是，普通日本 BERT 在临床领域数据集上优于临床日本 BERT。这可能是因为与临床日语 BERT 相比，一般日语 BERT 是在广泛的文本上进行预训练的。

更新日期：2021-07-09

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>