End-to-End-Based Tibetan Multitask Speech Recognition

To date, speech recognition technology for majority languages has been applied in wireless communication devices successfully. However, as a minority language, Tibetan has very limited resources for conventional automatic speech recognition. It lacks of enough data, sub-word units, lexicons, and wor...

Full description

Bibliographic Details
Main Authors: Yue Zhao, Jianjian Yue, Xiaona Xu, Licheng Wu, Xiali Li
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8894432/
id doaj-90fe08bc87f745689b50ebe605a9421b
record_format Article
spelling doaj-90fe08bc87f745689b50ebe605a9421b2021-03-30T00:53:48ZengIEEEIEEE Access2169-35362019-01-01716251916252910.1109/ACCESS.2019.29524068894432End-to-End-Based Tibetan Multitask Speech RecognitionYue Zhao0https://orcid.org/0000-0002-4007-7016Jianjian Yue1Xiaona Xu2Licheng Wu3Xiali Li4School of Information and Engineering, Minzu University of China, Beijing, ChinaSchool of Information and Engineering, Minzu University of China, Beijing, ChinaSchool of Information and Engineering, Minzu University of China, Beijing, ChinaSchool of Information and Engineering, Minzu University of China, Beijing, ChinaSchool of Information and Engineering, Minzu University of China, Beijing, ChinaTo date, speech recognition technology for majority languages has been applied in wireless communication devices successfully. However, as a minority language, Tibetan has very limited resources for conventional automatic speech recognition. It lacks of enough data, sub-word units, lexicons, and word inventories for some dialects. In this paper, we present a multitask end-to-end model to perform simultaneous Tibetan speech content recognition, dialect identification, and speaker recognition. This model avoids processing the pronunciation dictionary and word segmentation for new dialects while allowing for training three tasks in a single model. We build the multitask recognition framework based on WaveNet-CTC. The dialect information and speaker ID are used in the output for training. The experimental results show that our method has better performance compared with a task-specific model.https://ieeexplore.ieee.org/document/8894432/End-to-end modelmultitask speech recognitionTibetan languagewavenet model
collection DOAJ
language English
format Article
sources DOAJ
author Yue Zhao
Jianjian Yue
Xiaona Xu
Licheng Wu
Xiali Li
spellingShingle Yue Zhao
Jianjian Yue
Xiaona Xu
Licheng Wu
Xiali Li
End-to-End-Based Tibetan Multitask Speech Recognition
IEEE Access
End-to-end model
multitask speech recognition
Tibetan language
wavenet model
author_facet Yue Zhao
Jianjian Yue
Xiaona Xu
Licheng Wu
Xiali Li
author_sort Yue Zhao
title End-to-End-Based Tibetan Multitask Speech Recognition
title_short End-to-End-Based Tibetan Multitask Speech Recognition
title_full End-to-End-Based Tibetan Multitask Speech Recognition
title_fullStr End-to-End-Based Tibetan Multitask Speech Recognition
title_full_unstemmed End-to-End-Based Tibetan Multitask Speech Recognition
title_sort end-to-end-based tibetan multitask speech recognition
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description To date, speech recognition technology for majority languages has been applied in wireless communication devices successfully. However, as a minority language, Tibetan has very limited resources for conventional automatic speech recognition. It lacks of enough data, sub-word units, lexicons, and word inventories for some dialects. In this paper, we present a multitask end-to-end model to perform simultaneous Tibetan speech content recognition, dialect identification, and speaker recognition. This model avoids processing the pronunciation dictionary and word segmentation for new dialects while allowing for training three tasks in a single model. We build the multitask recognition framework based on WaveNet-CTC. The dialect information and speaker ID are used in the output for training. The experimental results show that our method has better performance compared with a task-specific model.
topic End-to-end model
multitask speech recognition
Tibetan language
wavenet model
url https://ieeexplore.ieee.org/document/8894432/
work_keys_str_mv AT yuezhao endtoendbasedtibetanmultitaskspeechrecognition
AT jianjianyue endtoendbasedtibetanmultitaskspeechrecognition
AT xiaonaxu endtoendbasedtibetanmultitaskspeechrecognition
AT lichengwu endtoendbasedtibetanmultitaskspeechrecognition
AT xialili endtoendbasedtibetanmultitaskspeechrecognition
_version_ 1724187722660184064