Deep Hierarchical Sequence Generation with Self-Attention

碩士 === 國立交通大學 === 電信工程研究所 === 107 === In recent years, deep generative models offering the promise of learning based on unlabeled data and synthesizing realistic data have been rapidly developing for image, speech, and text processing. The popular approaches, such as the variational autoencoder (VAE...

Full description

Bibliographic Details
Main Authors: Wang, Chun-Wei, 王俊煒
Other Authors: Chien, Jen-Tzung
Format: Others
Language:en_US
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/63b5zq
Description
Summary:碩士 === 國立交通大學 === 電信工程研究所 === 107 === In recent years, deep generative models offering the promise of learning based on unlabeled data and synthesizing realistic data have been rapidly developing for image, speech, and text processing. The popular approaches, such as the variational autoencoder (VAE), generative adversarial network (GAN) and autoregressive neural network, have achieved remarkable performance in variety of applications. One of the most attractive solutions is the latent variable model based on VAE, which consists of an inference model (encoder) and a generative model (decoder). The encoder compresses the input data into a latent representation while the decoder generates synthesized samples from the latent space. VAE can learn both the encoder and decoder by optimizing a variational lower bound of log-likelihood from a set of training data. Despite a great success in VAE, a crucial issue is the difficulty in learning the complicated latent variable structure especially in presence of a large-scale set of images or with the highly structured sequential data. Given the complicated natural images for training, VAE tends to generate the blurry images in prediction. On the other hand, VAE in sequence generation is composed of two recurrent neural networks (RNNs) for both the encoder and decoder. In general, the RNN decoder is trained by teacher forcing where the model receives the ground truth output as input at next time during training. However, this leads to an issue in training where a latent loss function as a Kullback-Leibler divergence vanishes so that the latent variables are not really used. This phenomenon is known as posterior collapse which also widely exists in stochastic RNN where additional latent variables are introduced to represent the hidden states of RNN. To avoid the issue of posterior collapse, recently, RNN is implemented by a variant of convolutional neural network (CNN) for sequential data representation where the decoder is weakened to encourage the utilization of latent variables for complex data modeling. This study presents a new variational and hierarchical latent variable model for sequence generation. Basically, sequential data is more complex than non-temporal data in learning representation. Using a simple latent variable model for VAE in sequence learning may be insufficient. Instead of the CNN implementation, we strengthen the capability of encoder in a sequence-to-sequence model based on a hierarchical latent variable model. This model consists of two encoders with two meaningful latent variables. The global and local dependencies of latent variables are discovered and involved in a sophisticated model for sequence generation. The long short-term memory (LSTM) and the pyramid bidirectional LSTM (pBLSTM) are merged as two encoders to capture complimentary features for decoding of sequential outputs. The issue of posterior collapse is accordingly alleviated by using this framework. Traditionally, the sequence-to-sequence model encodes the input sequence with an RNN encoder and generates a variable-length output sequence with another RNN decoder. We present a self-attention model which further enhances the interaction between the inference model and generative model by incorporating the self-attention scheme as an interface between RNN-based encoder and decoder. However, developing an attention method under this setting is challenging because the encoder is disregarded in the generative process where we want to generate new samples from the latent space. In this thesis, we take the advantage of stochastic RNNs as the decoder which of additional latent variables allow us to model the context vector at each time step. The context vector is computed as the attention information which is the weighted sum of hidden states of the encoder during training. During test phase, we can sample from the prior of the decoder to reconstruct the context vector to compensate the missing of the attention information. In this way, such a framework can generate sequential data depending on latent variables of the encoder as well as decoder in inference time. In the experiments, the proposed models are evaluated by using two popular natural language processing (NLP) datasets, Penn Treebank and Yelp 2013. Experimental results show that our models improve the performance in terms of perplexity for the generation and avoid the issue of the posterior collapse. In addition, the hierarchical model is able to learn different meaningful latent representations through the global and local dependencies of latent variables. As for the self-attention model, we visualize the attention weights to investigate how the attention mechanism works. We found that the self-attention model can retrieve useful information to help generation