Sequence-to-Sequence Emotional Voice Conversion With Strength Control

This paper proposes an improved emotional voice conversion (EVC) method with emotional strength and duration controllability. EVC methods without duration mapping generate emotional speech with identical duration to that of the neutral input speech. In reality, even the same sentences would have dif...

Full description

Bibliographic Details
Main Authors:	Heejin Choi, Minsoo Hahn
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Voice conversion emotional voice conversion emotion strength sequence-to-sequence learning controllable emotional voice conversion
Online Access:	https://ieeexplore.ieee.org/document/9374921/

id	doaj-1eb9aa8ea3ee4c1ca29f2f2597ef2517
record_format	Article
spelling	doaj-1eb9aa8ea3ee4c1ca29f2f2597ef25172021-03-30T15:10:58ZengIEEEIEEE Access2169-35362021-01-019426744268710.1109/ACCESS.2021.30654609374921Sequence-to-Sequence Emotional Voice Conversion With Strength ControlHeejin Choi0https://orcid.org/0000-0001-5093-2859Minsoo Hahn1https://orcid.org/0000-0003-1953-6078Department of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South KoreaDepartment of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South KoreaThis paper proposes an improved emotional voice conversion (EVC) method with emotional strength and duration controllability. EVC methods without duration mapping generate emotional speech with identical duration to that of the neutral input speech. In reality, even the same sentences would have different speeds and rhythms depending on the emotions. To solve this, the proposed method adopts a sequence-to-sequence network with an attention module that enables the network to learn attention in the neutral input sequence should be focused on which part of the emotional output sequence. Besides, to capture the multi-attribute aspects of emotional variations, an emotion encoder is designed for transforming acoustic features into emotion embedding vectors. By aggregating the emotion embedding vectors for each emotion, a representative vector for the target emotion is obtained and weighted to reflect emotion strength. By introducing a speaker encoder, the proposed method can preserve speaker identity even after the emotion conversion. Objective and subjective evaluation results confirm that the proposed method is superior to other previous works. Especially, in emotion strength control, we achieve in getting successful results.https://ieeexplore.ieee.org/document/9374921/Voice conversionemotional voice conversionemotion strengthsequence-to-sequence learningcontrollable emotional voice conversion
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Heejin Choi Minsoo Hahn
spellingShingle	Heejin Choi Minsoo Hahn Sequence-to-Sequence Emotional Voice Conversion With Strength Control IEEE Access Voice conversion emotional voice conversion emotion strength sequence-to-sequence learning controllable emotional voice conversion
author_facet	Heejin Choi Minsoo Hahn
author_sort	Heejin Choi
title	Sequence-to-Sequence Emotional Voice Conversion With Strength Control
title_short	Sequence-to-Sequence Emotional Voice Conversion With Strength Control
title_full	Sequence-to-Sequence Emotional Voice Conversion With Strength Control
title_fullStr	Sequence-to-Sequence Emotional Voice Conversion With Strength Control
title_full_unstemmed	Sequence-to-Sequence Emotional Voice Conversion With Strength Control
title_sort	sequence-to-sequence emotional voice conversion with strength control
publisher	IEEE
series	IEEE Access
issn	2169-3536
publishDate	2021-01-01
description	This paper proposes an improved emotional voice conversion (EVC) method with emotional strength and duration controllability. EVC methods without duration mapping generate emotional speech with identical duration to that of the neutral input speech. In reality, even the same sentences would have different speeds and rhythms depending on the emotions. To solve this, the proposed method adopts a sequence-to-sequence network with an attention module that enables the network to learn attention in the neutral input sequence should be focused on which part of the emotional output sequence. Besides, to capture the multi-attribute aspects of emotional variations, an emotion encoder is designed for transforming acoustic features into emotion embedding vectors. By aggregating the emotion embedding vectors for each emotion, a representative vector for the target emotion is obtained and weighted to reflect emotion strength. By introducing a speaker encoder, the proposed method can preserve speaker identity even after the emotion conversion. Objective and subjective evaluation results confirm that the proposed method is superior to other previous works. Especially, in emotion strength control, we achieve in getting successful results.
topic	Voice conversion emotional voice conversion emotion strength sequence-to-sequence learning controllable emotional voice conversion
url	https://ieeexplore.ieee.org/document/9374921/
work_keys_str_mv	AT heejinchoi sequencetosequenceemotionalvoiceconversionwithstrengthcontrol AT minsoohahn sequencetosequenceemotionalvoiceconversionwithstrengthcontrol
_version_	1724179837327769600

Sequence-to-Sequence Emotional Voice Conversion With Strength Control

Similar Items