DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH

Bibliographic Details
Main Author: Williamson, Donald S.
Language:English
Published: The Ohio State University / OhioLINK 2016
Subjects:
Online Access:http://rave.ohiolink.edu/etdc/view?acc_num=osu1461018277
id ndltd-OhioLink-oai-etd.ohiolink.edu-osu1461018277
record_format oai_dc
collection NDLTD
language English
sources NDLTD
topic Computer Science
spellingShingle Computer Science
Williamson, Donald S.
DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH
author Williamson, Donald S.
author_facet Williamson, Donald S.
author_sort Williamson, Donald S.
title DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH
title_short DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH
title_full DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH
title_fullStr DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH
title_full_unstemmed DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH
title_sort deep learning methods for improving the perceptual quality of noisy and reverberant speech
publisher The Ohio State University / OhioLINK
publishDate 2016
url http://rave.ohiolink.edu/etdc/view?acc_num=osu1461018277
work_keys_str_mv AT williamsondonalds deeplearningmethodsforimprovingtheperceptualqualityofnoisyandreverberantspeech
_version_ 1719439704799051776
spelling ndltd-OhioLink-oai-etd.ohiolink.edu-osu14610182772021-08-03T06:35:45Z DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH Williamson, Donald S. Computer Science Speech is a vital form of human communication and it is important for many real-world applications. Voice commands are used to interface with electronic devices and hearing-impaired individuals use hearing aids to understand speech better. In realistic environments, background noise and reverberation are present, resulting in performance degradation. For this reason, it is crucial that speech is separated from interference. Many speech separation approaches have been proposed, but there is a considerable need to produce speech estimates that are both intelligible and high quality, especially at low signal-to-noise ratios (SNRs).Time-frequency (T-F) masking and model-based separation are two common ways to extract speech in a noisy observation. T-F masking involves the estimation of an oracle mask, which can be accomplished using supervised learning. Deep neural networks (DNN) are well suited for T-F mask estimation due to their ability to learn mappings from noisy observations to a desired target. Likewise, model-based separation is suitable due to its ability to represent the spectral structure of speech. This dissertation presents work that develops speech separation systems using combinations of T-F masking, DNNs, and model-based reconstruction. The aim of each system is to improve the perceptual quality of the speech estimates.Ideal binary mask (IBM) estimation has shown success in improving the intelligibility of separated speech, but it often results in poor quality due to estimation errors and the removal of speech. On the other hand, model-based separation approaches such as nonnegative matrix factorization (NMF) and sparse reconstruction improve the perceptual quality, but not intelligibility, of separated speech. We start by studying the performance of speech separation by combining IBM estimation with model-based reconstruction. We demonstrate that our system can improve the perceptual quality and intelligibility over performing T-F masking or model-based separation alone.DNNs have successfully estimated a range of targets. We then present a method that uses a DNN to estimate the activations of a speech model. Initially, a DNN is used to estimate the ideal ratio mask (IRM), where the estimated IRM separates the speech from noise with reasonable sound quality. Afterwards, a second DNN learns the mapping from ratio-masked speech to NMF model activations. The estimated activations linearly combine the elements of an NMF speech model to approximate clean speech. Experiments show that the proposed approach produces high quality separated speech. In addition, we conduct a listening study and its results show that our output is preferred over comparison systems.The above and most other speech separation systems operate on the magnitude response of noisy speech and use the noisy phase during signal reconstruction. This occurs because it is believed that the phase spectrum is unimportant for speech enhancement. More recent studies, however, reveal that phase is important for perceptual quality. We present an approach that concurrently enhances the magnitude and phase spectra by operating in the complex domain. We start by introducing the complex ideal ratio mask (cIRM), which has real and imaginary components. A DNN is used to jointly estimate these components of the cIRM. Evaluation results demonstrate that the proposed system substantially improves perceptual quality over recent approaches in noisy environments.Along with background noise, room reverberation is commonly encountered in real environments. The performance of many speech processing applications is severely degraded when both noise and reverberation are present. We propose to simultaneously perform dereverberation and denoising with the cIRM. First, we redefine the cIRM for reverberant and noisy environments. A DNN is then trained to estimate it. The complex mask removes the interference caused by noise and reverberation, and results in better predicted speech quality and intelligibility. 2016-09-12 English text The Ohio State University / OhioLINK http://rave.ohiolink.edu/etdc/view?acc_num=osu1461018277 http://rave.ohiolink.edu/etdc/view?acc_num=osu1461018277 unrestricted This thesis or dissertation is protected by copyright: some rights reserved. It is licensed for use under a Creative Commons license. Specific terms and permissions are available from this document's record in the OhioLINK ETD Center.