Sequence-to-Sequence Emotional Voice Conversion With Strength Control

This paper proposes an improved emotional voice conversion (EVC) method with emotional strength and duration controllability. EVC methods without duration mapping generate emotional speech with identical duration to that of the neutral input speech. In reality, even the same sentences would have dif...

Full description

Bibliographic Details
Main Authors: Heejin Choi, Minsoo Hahn
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9374921/
id doaj-1eb9aa8ea3ee4c1ca29f2f2597ef2517
record_format Article
spelling doaj-1eb9aa8ea3ee4c1ca29f2f2597ef25172021-03-30T15:10:58ZengIEEEIEEE Access2169-35362021-01-019426744268710.1109/ACCESS.2021.30654609374921Sequence-to-Sequence Emotional Voice Conversion With Strength ControlHeejin Choi0https://orcid.org/0000-0001-5093-2859Minsoo Hahn1https://orcid.org/0000-0003-1953-6078Department of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South KoreaDepartment of Electrical Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, South KoreaThis paper proposes an improved emotional voice conversion (EVC) method with emotional strength and duration controllability. EVC methods without duration mapping generate emotional speech with identical duration to that of the neutral input speech. In reality, even the same sentences would have different speeds and rhythms depending on the emotions. To solve this, the proposed method adopts a sequence-to-sequence network with an attention module that enables the network to learn attention in the neutral input sequence should be focused on which part of the emotional output sequence. Besides, to capture the multi-attribute aspects of emotional variations, an emotion encoder is designed for transforming acoustic features into emotion embedding vectors. By aggregating the emotion embedding vectors for each emotion, a representative vector for the target emotion is obtained and weighted to reflect emotion strength. By introducing a speaker encoder, the proposed method can preserve speaker identity even after the emotion conversion. Objective and subjective evaluation results confirm that the proposed method is superior to other previous works. Especially, in emotion strength control, we achieve in getting successful results.https://ieeexplore.ieee.org/document/9374921/Voice conversionemotional voice conversionemotion strengthsequence-to-sequence learningcontrollable emotional voice conversion
collection DOAJ
language English
format Article
sources DOAJ
author Heejin Choi
Minsoo Hahn
spellingShingle Heejin Choi
Minsoo Hahn
Sequence-to-Sequence Emotional Voice Conversion With Strength Control
IEEE Access
Voice conversion
emotional voice conversion
emotion strength
sequence-to-sequence learning
controllable emotional voice conversion
author_facet Heejin Choi
Minsoo Hahn
author_sort Heejin Choi
title Sequence-to-Sequence Emotional Voice Conversion With Strength Control
title_short Sequence-to-Sequence Emotional Voice Conversion With Strength Control
title_full Sequence-to-Sequence Emotional Voice Conversion With Strength Control
title_fullStr Sequence-to-Sequence Emotional Voice Conversion With Strength Control
title_full_unstemmed Sequence-to-Sequence Emotional Voice Conversion With Strength Control
title_sort sequence-to-sequence emotional voice conversion with strength control
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2021-01-01
description This paper proposes an improved emotional voice conversion (EVC) method with emotional strength and duration controllability. EVC methods without duration mapping generate emotional speech with identical duration to that of the neutral input speech. In reality, even the same sentences would have different speeds and rhythms depending on the emotions. To solve this, the proposed method adopts a sequence-to-sequence network with an attention module that enables the network to learn attention in the neutral input sequence should be focused on which part of the emotional output sequence. Besides, to capture the multi-attribute aspects of emotional variations, an emotion encoder is designed for transforming acoustic features into emotion embedding vectors. By aggregating the emotion embedding vectors for each emotion, a representative vector for the target emotion is obtained and weighted to reflect emotion strength. By introducing a speaker encoder, the proposed method can preserve speaker identity even after the emotion conversion. Objective and subjective evaluation results confirm that the proposed method is superior to other previous works. Especially, in emotion strength control, we achieve in getting successful results.
topic Voice conversion
emotional voice conversion
emotion strength
sequence-to-sequence learning
controllable emotional voice conversion
url https://ieeexplore.ieee.org/document/9374921/
work_keys_str_mv AT heejinchoi sequencetosequenceemotionalvoiceconversionwithstrengthcontrol
AT minsoohahn sequencetosequenceemotionalvoiceconversionwithstrengthcontrol
_version_ 1724179837327769600