Attention-based Sound Event Recognition using Weakly Labeled Data

博士 === 國立臺灣大學 === 資訊網路與多媒體研究所 === 107 === Understanding the surrounding environment and ongoing events through acoustic cues, or the so-called ``sound intelligence,'' is a critical piece of the Artificial Intelligence (AI) puzzle. Human is able to recognize not only the sounds of s...

Full description

Bibliographic Details
Main Authors: Szu-Yu Chou, 周思瑜
Other Authors: 張智星
Format: Others
Language:en_US
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/cfexpb
Description
Summary:博士 === 國立臺灣大學 === 資訊網路與多媒體研究所 === 107 === Understanding the surrounding environment and ongoing events through acoustic cues, or the so-called ``sound intelligence,'' is a critical piece of the Artificial Intelligence (AI) puzzle. Human is able to recognize not only the sounds of speech utterance or musical piece, but also animal sounds, natural sounds and common everyday environmental sounds. With sound intelligence, an AI can do much better in applications such as smart surveillance, smart city, smart car, and smart factory. As a result, recent years have witnessed great and rapid progress in recognizing various sound events in daily environments. Most current research proposes a framework based on fully-supervised deep learning techniques using strongly labeled data. However, the labeled data for sound event recognition generally lack detailed annotations in time due to the high cost of the labeling process. This dissertation makes the following four contributions in recognizing sound events using weakly labeled data. First, we propose an attention-based model that recognizes transient sound events relying on only weakly labeled data. This task is challenging because weakly labeled data only provide annotations on the clip level, but some sound events appear only for a short period of time in an audio clip. We address this lack of detailed annotations with a novel attentional supervision mechanism that we propose. The resulting model, dubbed the M&mnet, outperforms all the other existing models on AudioSet, a collection of two million weakly-labeled audio clips released by Google in 2017. Second, we address the challenge to recognize sound events with only a few training examples of each class. This problem is critical in that fully-supervised learning algorithms cannot learn well when the data is sparse. We propose a novel attentional similarity module to guide the learning model to pay attention to specific segments of a long audio clip for recognizing sound events. We show that this module greatly improves the performance of few-shot sound recognition. Third, we propose FrameCNN, a novel weakly-supervised learning framework that improves the performance of convolutional neural network (CNN) for acoustic event detection by attending to details of each sound at various temporal levels. In the large-scale weakly supervised sound event detection for smart cars , we obtained a F-score 53.8% for sound event audio tagging, compared to the baseline of 19.8%, and a F-score 32.8% for sound event detection, compared to the baseline of 11.4%. Lastly, we attempt to build a noise-robust sound event detection model for mobile or embedded applications. We desire the model to be applicable in a real-world environment, with low memory usage and limited detection latency. By combining several state-of-the-art techniques in building deep learning models, we are able to implement a baby cry detector on the Raspberry Pi that can run in real time. We find that our model can effectively detect baby cries in various noisy conditions, whereas the baby cry detector available on the flagship smartphone of Samsung (as of late 2018) cannot.