Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs

碩士 === 國立臺灣大學 === 電機工程學研究所 === 104 === There are two inputs in the system developed in this thesis, which are lyric text files and song wav files. The goal is to automatically mark the lyrics with the time codes so that the lyrics can be displayed when they are sung during music playing. We use for...

Full description

Bibliographic Details
Main Authors:	Lien-Chiao Lin, 林廉喬
Other Authors:	Shyh-Kang Jeng
Format:	Others
Language:	zh-TW
Published:	2016
Online Access:	http://ndltd.ncl.edu.tw/handle/87482174912882882097

id	ndltd-TW-104NTU05442047
record_format	oai_dc
spelling	ndltd-TW-104NTU054420472017-04-24T04:23:47Z http://ndltd.ncl.edu.tw/handle/87482174912882882097 Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs 中文流行歌曲與饒舌歌曲之歌詞對位 Lien-Chiao Lin 林廉喬碩士國立臺灣大學電機工程學研究所 104 There are two inputs in the system developed in this thesis, which are lyric text files and song wav files. The goal is to automatically mark the lyrics with the time codes so that the lyrics can be displayed when they are sung during music playing. We use forced alignment as our core architecture, which is based on the Hidden Markov Models (HMMs). The HMMs are first trained with the speech data, and then adapted with the songs, so that the adapted models will be more suitable for processing the singing voice over music. We adopt the Maximum a Posteriori (MAP) adaptation strategy. In order to do forced alignment, some preprocessing steps on the lyrics and audio songs are necessary. We also need an initial set of HMMs. For the lyrics, we perform word segmentation first, and then look up the phone sequence of each word in the lexicon so that we can know the phone sequence of the song. As for the audio songs, we use HTK to get the Mel-scale Frequency Cepstral Coefficients (MFCCs) from the wav files. We use the anchor reporter speech in the Mandarin Chinese Broadcast News Corpus (MATBN) to train 151 HMMs as the initial set of models. The training speech data were collected from November 2001 to December 2002. Among the 151 HMMs, there are 112 initial models, 38 final models, and one silence model. These 112 initial and 38 final HMMs are called the speech model, and the combination of speech and silence models is called the spoken voice model (SpoModel). With the phone sequence of the lyrics, the MFCCs of the audio signal, and the initial set of HMMs, we can use HTK to perform forced alignment. However, in order to make the models more robust against the background music, we conduct MAP adaptation on the initial models with some training songs. We have collected two types of training songs, namely Chinese pop songs and Chinese rap songs. Therefore, there are two sets of adapted models, which are called the pop song model (PopModel) and the rap song model (RapModel), respectively. We run forced alignment experiments on the two sets of adapted models and the test songs of two genres. The experimental results show that the genre has a big impact on the results of automatic lyrics-to-audio alignment. Shyh-Kang Jeng 鄭士康 2016 學位論文 ; thesis 30 zh-TW
collection	NDLTD
language	zh-TW
format	Others
sources	NDLTD
description	碩士 === 國立臺灣大學 === 電機工程學研究所 === 104 === There are two inputs in the system developed in this thesis, which are lyric text files and song wav files. The goal is to automatically mark the lyrics with the time codes so that the lyrics can be displayed when they are sung during music playing. We use forced alignment as our core architecture, which is based on the Hidden Markov Models (HMMs). The HMMs are first trained with the speech data, and then adapted with the songs, so that the adapted models will be more suitable for processing the singing voice over music. We adopt the Maximum a Posteriori (MAP) adaptation strategy. In order to do forced alignment, some preprocessing steps on the lyrics and audio songs are necessary. We also need an initial set of HMMs. For the lyrics, we perform word segmentation first, and then look up the phone sequence of each word in the lexicon so that we can know the phone sequence of the song. As for the audio songs, we use HTK to get the Mel-scale Frequency Cepstral Coefficients (MFCCs) from the wav files. We use the anchor reporter speech in the Mandarin Chinese Broadcast News Corpus (MATBN) to train 151 HMMs as the initial set of models. The training speech data were collected from November 2001 to December 2002. Among the 151 HMMs, there are 112 initial models, 38 final models, and one silence model. These 112 initial and 38 final HMMs are called the speech model, and the combination of speech and silence models is called the spoken voice model (SpoModel). With the phone sequence of the lyrics, the MFCCs of the audio signal, and the initial set of HMMs, we can use HTK to perform forced alignment. However, in order to make the models more robust against the background music, we conduct MAP adaptation on the initial models with some training songs. We have collected two types of training songs, namely Chinese pop songs and Chinese rap songs. Therefore, there are two sets of adapted models, which are called the pop song model (PopModel) and the rap song model (RapModel), respectively. We run forced alignment experiments on the two sets of adapted models and the test songs of two genres. The experimental results show that the genre has a big impact on the results of automatic lyrics-to-audio alignment.
author2	Shyh-Kang Jeng
author_facet	Shyh-Kang Jeng Lien-Chiao Lin 林廉喬
author	Lien-Chiao Lin 林廉喬
spellingShingle	Lien-Chiao Lin 林廉喬 Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
author_sort	Lien-Chiao Lin
title	Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
title_short	Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
title_full	Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
title_fullStr	Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
title_full_unstemmed	Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
title_sort	lyrics-to-audio alignment of chinese pop songs and rap songs
publishDate	2016
url	http://ndltd.ncl.edu.tw/handle/87482174912882882097
work_keys_str_mv	AT lienchiaolin lyricstoaudioalignmentofchinesepopsongsandrapsongs AT línliánqiáo lyricstoaudioalignmentofchinesepopsongsandrapsongs AT lienchiaolin zhōngwénliúxínggēqūyǔráoshégēqūzhīgēcíduìwèi AT línliánqiáo zhōngwénliúxínggēqūyǔráoshégēqūzhīgēcíduìwèi
_version_	1718444210985631744

Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs

Similar Items