基于语言模型的生物医学大数据分析自动前缀缩写扩展方法,Future Generation Computer Systems

当前位置： X-MOL 学术 › Future Gener. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

基于语言模型的生物医学大数据分析自动前缀缩写扩展方法
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2019-03-28 , DOI: 10.1016/j.future.2019.01.016
Xiaokun Du , Rongbo Zhu , Yanhong Li , Ashiq Anjum

在生物医学领域，缩写词在各种数据集中的出现越来越频繁，这给生物医学大数据分析带来了重大障碍。已采用基于字典的方法来处理缩写，但是它不能处理即席缩写，并且不可能覆盖所有缩写。为了克服这些缺点，本文提出了一种自动缩写扩展方法，称为LMAAE（基于语言模型的自动缩写扩展）。在这种方法中，首先将缩写分为几个部分；然后，通过恢复每个块来生成扩展候选。最后，根据语言模型和聚类方法对扩展候选进行过滤和聚类以获得最终的扩展结果。通过将缩写限制为前缀缩写，扩展的搜索空间急剧减少。然后，通过限制分区的有效长度和长度来连续减少搜索空间。为了验证该方法的有效性，设计了两种类型的实验。对于标准缩写，扩展结果包括字典中的大多数扩展。因此，它具有很高的精度。对于临时缩写，通过使用这种方法来处理缩写，可以提高模式匹配，知识融合的精度。尽管需要改进标准缩写的查全率，但这不会影响字典方法的良好补语效果。为了验证该方法的有效性，设计了两种类型的实验。对于标准缩写，扩展结果包括字典中的大多数扩展。因此，它具有很高的精度。对于临时缩写，通过使用这种方法来处理缩写，可以提高模式匹配，知识融合的精度。尽管需要改进标准缩写的查全率，但这并不影响字典方法的良好补语效果。为了验证该方法的有效性，设计了两种类型的实验。对于标准缩写，扩展结果包括字典中的大多数扩展。因此，它具有很高的精度。对于临时缩写，通过使用此方法来处理缩写，可以提高模式匹配，知识融合的精度。尽管需要改进标准缩写的查全率，但这并不影响字典方法的良好补语效果。

"点击查看英文标题和摘要"

Language model-based automatic prefix abbreviation expansion method for biomedical big data analysis

In biomedical domain, abbreviations are appearing more and more frequently in various data sets, which has caused significant obstacles to biomedical big data analysis. The dictionary-based approach has been adopted to process abbreviations, but it cannot handle ad hoc abbreviations, and it is impossible to cover all abbreviations. To overcome these drawbacks, this paper proposes an automatic abbreviation expansion method called LMAAE (Language Model-based Automatic Abbreviation Expansion). In this method, the abbreviation is firstly divided into blocks; then, expansion candidates are generated by restoring each block; and finally, the expansion candidates are filtered and clustered to acquire the final expansion result according to the language model and clustering method. Through restrict the abbreviation to prefix abbreviation, the search space of expansion is reduced sharply. And then, the search space is continuous reduced by restrained the effective and the length of the partition. In order to validate the effective of the method, two types of experiments are designed. For standard abbreviations, the expansion results include most of the expansion in dictionary. Therefore, it has a high precision. For ad hoc abbreviations, the precisions of schema matching, knowledge fusion are increased by using this method to handle the abbreviations. Although the recall for standard abbreviation needs to be improved, but this does not affect the good complement effect for the dictionary method.

更新日期：2019-03-28

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文