基于神经网络端到端领域对抗训练的重音语音识别,Applied Sciences

当前位置： X-MOL 学术 › Appl. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

基于神经网络端到端领域对抗训练的重音语音识别
Applied Sciences ( IF 2.5 ) Pub Date : 2021-09-10 , DOI: 10.3390/app11188412
Hyeong-Ju Na , Jeong-Sik Park

当识别带口音的语音时，自动语音识别 (ASR) 的性能可能会降低，因为语音与标准语音存在一些语言差异。传统的重音语音识别研究使用了重音嵌入方法，其中重音嵌入特征直接输入 ASR 网络。该方法虽然提高了重音识别的性能，但也存在一些限制，例如增加了计算成本。本研究提出了一种基于域对抗神经网络 (DANN) 以域对抗方式训练重音语音 ASR 模型的有效方法。DANN 扮演领域适应的角色，其中训练数据和测试数据具有不同的分布。因此，我们的方法有望通过减少重音语音和标准语音之间的分布差异来构建可靠的重音语音 ASR 模型。DANN 具有三个子网络：特征提取器、域分类器和标签预测器。为了调整 DANN 以进行重音语音识别，我们考虑重音语音的特性，独立构建了这三个子网络。特别是，我们使用基于连接主义时间分类 (CTC) 的端到端框架来开发标签预测器，这是一个非常重要的模块，直接影响 ASR 结果。为了验证所提出方法的效率，我们对四种英语口音（包括澳大利亚口音、加拿大口音、英国（英格兰）和印度口音）进行了多次重音语音识别实验。

"点击查看英文标题和摘要"

Accented Speech Recognition Based on End-to-End Domain Adversarial Training of Neural Networks

The performance of automatic speech recognition (ASR) may be degraded when accented speech is recognized because the speech has some linguistic differences from standard speech. Conventional accented speech recognition studies have utilized the accent embedding method, in which the accent embedding features are directly fed into the ASR network. Although the method improves the performance of accented speech recognition, it has some restrictions, such as increasing the computational costs. This study proposes an efficient method of training the ASR model for accented speech in a domain adversarial way based on the Domain Adversarial Neural Network (DANN). The DANN plays a role as a domain adaptation in which the training data and test data have different distributions. Thus, our approach is expected to construct a reliable ASR model for accented speech by reducing the distribution differences between accented speech and standard speech. DANN has three sub-networks: the feature extractor, the domain classifier, and the label predictor. To adjust the DANN for accented speech recognition, we constructed these three sub-networks independently, considering the characteristics of accented speech. In particular, we used an end-to-end framework based on Connectionist Temporal Classification (CTC) to develop the label predictor, a very important module that directly affects ASR results. To verify the efficiency of the proposed approach, we conducted several experiments of accented speech recognition for four English accents including Australian, Canadian, British (England), and Indian accents. The experimental results showed that the proposed DANN-based model outperformed the baseline model for all accents, indicating that the end-to-end domain adversarial training effectively reduced the distribution differences between accented speech and standard speech.

更新日期：2021-09-10

点击分享查看原文

点击收藏

公开下载

阅读更多本刊新发论文