LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES

Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking st...

Full description

Bibliographic Details
Main Authors: Fatemeh Vakhshiteh, Farshad Almasganj, Ahmad Nickabadi
Format: Article
Language:English
Published: Slovenian Society for Stereology and Quantitative Image Analysis 2018-07-01
Series:Image Analysis and Stereology
Subjects:
Online Access:https://www.ias-iss.org/ojs/IAS/article/view/1859
id doaj-e01eccdec6094166aeaf066f87557962
record_format Article
spelling doaj-e01eccdec6094166aeaf066f875579622020-11-24T23:08:39ZengSlovenian Society for Stereology and Quantitative Image AnalysisImage Analysis and Stereology1580-31391854-51652018-07-0137215917110.5566/ias.18591002LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURESFatemeh Vakhshiteh0Farshad Almasganj1Ahmad Nickabadi2Amirkabir University of Technology - Tehran PolytechnicAmirkabir University of Technology - Tehran PolytechnicAmirkabir University of Technology - Tehran PolytechnicLip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.https://www.ias-iss.org/ojs/IAS/article/view/1859Deep belief NetworksHidden Markov Modellip-readingRestricted Boltzmann Machine
collection DOAJ
language English
format Article
sources DOAJ
author Fatemeh Vakhshiteh
Farshad Almasganj
Ahmad Nickabadi
spellingShingle Fatemeh Vakhshiteh
Farshad Almasganj
Ahmad Nickabadi
LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
Image Analysis and Stereology
Deep belief Networks
Hidden Markov Model
lip-reading
Restricted Boltzmann Machine
author_facet Fatemeh Vakhshiteh
Farshad Almasganj
Ahmad Nickabadi
author_sort Fatemeh Vakhshiteh
title LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
title_short LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
title_full LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
title_fullStr LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
title_full_unstemmed LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
title_sort lip-reading via deep neural networks using hybrid visual features
publisher Slovenian Society for Stereology and Quantitative Image Analysis
series Image Analysis and Stereology
issn 1580-3139
1854-5165
publishDate 2018-07-01
description Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.
topic Deep belief Networks
Hidden Markov Model
lip-reading
Restricted Boltzmann Machine
url https://www.ias-iss.org/ojs/IAS/article/view/1859
work_keys_str_mv AT fatemehvakhshiteh lipreadingviadeepneuralnetworksusinghybridvisualfeatures
AT farshadalmasganj lipreadingviadeepneuralnetworksusinghybridvisualfeatures
AT ahmadnickabadi lipreadingviadeepneuralnetworksusinghybridvisualfeatures
_version_ 1725613062450315264