LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking st...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Slovenian Society for Stereology and Quantitative Image Analysis
2018-07-01
|
Series: | Image Analysis and Stereology |
Subjects: | |
Online Access: | https://www.ias-iss.org/ojs/IAS/article/view/1859 |
id |
doaj-e01eccdec6094166aeaf066f87557962 |
---|---|
record_format |
Article |
spelling |
doaj-e01eccdec6094166aeaf066f875579622020-11-24T23:08:39ZengSlovenian Society for Stereology and Quantitative Image AnalysisImage Analysis and Stereology1580-31391854-51652018-07-0137215917110.5566/ias.18591002LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURESFatemeh Vakhshiteh0Farshad Almasganj1Ahmad Nickabadi2Amirkabir University of Technology - Tehran PolytechnicAmirkabir University of Technology - Tehran PolytechnicAmirkabir University of Technology - Tehran PolytechnicLip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.https://www.ias-iss.org/ojs/IAS/article/view/1859Deep belief NetworksHidden Markov Modellip-readingRestricted Boltzmann Machine |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Fatemeh Vakhshiteh Farshad Almasganj Ahmad Nickabadi |
spellingShingle |
Fatemeh Vakhshiteh Farshad Almasganj Ahmad Nickabadi LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES Image Analysis and Stereology Deep belief Networks Hidden Markov Model lip-reading Restricted Boltzmann Machine |
author_facet |
Fatemeh Vakhshiteh Farshad Almasganj Ahmad Nickabadi |
author_sort |
Fatemeh Vakhshiteh |
title |
LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES |
title_short |
LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES |
title_full |
LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES |
title_fullStr |
LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES |
title_full_unstemmed |
LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES |
title_sort |
lip-reading via deep neural networks using hybrid visual features |
publisher |
Slovenian Society for Stereology and Quantitative Image Analysis |
series |
Image Analysis and Stereology |
issn |
1580-3139 1854-5165 |
publishDate |
2018-07-01 |
description |
Lip-reading is typically known as visually interpreting the speaker's lip movements during speaking. Experiments over many years have revealed that speech intelligibility increases if visual facial information becomes available. This effect becomes more apparent in noisy environments. Taking steps toward automating this process, some challenges will be raised such as coarticulation phenomenon, visual units' type, features diversity and their inter-speaker dependency. While efforts have been made to overcome these challenges, presentation of a flawless lip-reading system is still under the investigations. This paper searches for a lipreading model with an efficiently developed incorporation and arrangement of processing blocks to extract highly discriminative visual features. Here, application of a properly structured Deep Belief Network (DBN)- based recognizer is highlighted. Multi-speaker (MS) and speaker-independent (SI) tasks are performed over CUAVE database, and phone recognition rates (PRRs) of 77.65% and 73.40% are achieved, respectively. The best word recognition rates (WRRs) achieved in the tasks of MS and SI are 80.25% and 76.91%, respectively. Resulted accuracies demonstrate that the proposed method outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works. |
topic |
Deep belief Networks Hidden Markov Model lip-reading Restricted Boltzmann Machine |
url |
https://www.ias-iss.org/ojs/IAS/article/view/1859 |
work_keys_str_mv |
AT fatemehvakhshiteh lipreadingviadeepneuralnetworksusinghybridvisualfeatures AT farshadalmasganj lipreadingviadeepneuralnetworksusinghybridvisualfeatures AT ahmadnickabadi lipreadingviadeepneuralnetworksusinghybridvisualfeatures |
_version_ |
1725613062450315264 |