Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs

碩士 === 國立臺灣大學 === 電機工程學研究所 === 104 === There are two inputs in the system developed in this thesis, which are lyric text files and song wav files. The goal is to automatically mark the lyrics with the time codes so that the lyrics can be displayed when they are sung during music playing. We use for...

Full description

Bibliographic Details
Main Authors: Lien-Chiao Lin, 林廉喬
Other Authors: Shyh-Kang Jeng
Format: Others
Language:zh-TW
Published: 2016
Online Access:http://ndltd.ncl.edu.tw/handle/87482174912882882097
id ndltd-TW-104NTU05442047
record_format oai_dc
spelling ndltd-TW-104NTU054420472017-04-24T04:23:47Z http://ndltd.ncl.edu.tw/handle/87482174912882882097 Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs 中文流行歌曲與饒舌歌曲之歌詞對位 Lien-Chiao Lin 林廉喬 碩士 國立臺灣大學 電機工程學研究所 104 There are two inputs in the system developed in this thesis, which are lyric text files and song wav files. The goal is to automatically mark the lyrics with the time codes so that the lyrics can be displayed when they are sung during music playing. We use forced alignment as our core architecture, which is based on the Hidden Markov Models (HMMs). The HMMs are first trained with the speech data, and then adapted with the songs, so that the adapted models will be more suitable for processing the singing voice over music. We adopt the Maximum a Posteriori (MAP) adaptation strategy. In order to do forced alignment, some preprocessing steps on the lyrics and audio songs are necessary. We also need an initial set of HMMs. For the lyrics, we perform word segmentation first, and then look up the phone sequence of each word in the lexicon so that we can know the phone sequence of the song. As for the audio songs, we use HTK to get the Mel-scale Frequency Cepstral Coefficients (MFCCs) from the wav files. We use the anchor reporter speech in the Mandarin Chinese Broadcast News Corpus (MATBN) to train 151 HMMs as the initial set of models. The training speech data were collected from November 2001 to December 2002. Among the 151 HMMs, there are 112 initial models, 38 final models, and one silence model. These 112 initial and 38 final HMMs are called the speech model, and the combination of speech and silence models is called the spoken voice model (SpoModel). With the phone sequence of the lyrics, the MFCCs of the audio signal, and the initial set of HMMs, we can use HTK to perform forced alignment. However, in order to make the models more robust against the background music, we conduct MAP adaptation on the initial models with some training songs. We have collected two types of training songs, namely Chinese pop songs and Chinese rap songs. Therefore, there are two sets of adapted models, which are called the pop song model (PopModel) and the rap song model (RapModel), respectively. We run forced alignment experiments on the two sets of adapted models and the test songs of two genres. The experimental results show that the genre has a big impact on the results of automatic lyrics-to-audio alignment. Shyh-Kang Jeng 鄭士康 2016 學位論文 ; thesis 30 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立臺灣大學 === 電機工程學研究所 === 104 === There are two inputs in the system developed in this thesis, which are lyric text files and song wav files. The goal is to automatically mark the lyrics with the time codes so that the lyrics can be displayed when they are sung during music playing. We use forced alignment as our core architecture, which is based on the Hidden Markov Models (HMMs). The HMMs are first trained with the speech data, and then adapted with the songs, so that the adapted models will be more suitable for processing the singing voice over music. We adopt the Maximum a Posteriori (MAP) adaptation strategy. In order to do forced alignment, some preprocessing steps on the lyrics and audio songs are necessary. We also need an initial set of HMMs. For the lyrics, we perform word segmentation first, and then look up the phone sequence of each word in the lexicon so that we can know the phone sequence of the song. As for the audio songs, we use HTK to get the Mel-scale Frequency Cepstral Coefficients (MFCCs) from the wav files. We use the anchor reporter speech in the Mandarin Chinese Broadcast News Corpus (MATBN) to train 151 HMMs as the initial set of models. The training speech data were collected from November 2001 to December 2002. Among the 151 HMMs, there are 112 initial models, 38 final models, and one silence model. These 112 initial and 38 final HMMs are called the speech model, and the combination of speech and silence models is called the spoken voice model (SpoModel). With the phone sequence of the lyrics, the MFCCs of the audio signal, and the initial set of HMMs, we can use HTK to perform forced alignment. However, in order to make the models more robust against the background music, we conduct MAP adaptation on the initial models with some training songs. We have collected two types of training songs, namely Chinese pop songs and Chinese rap songs. Therefore, there are two sets of adapted models, which are called the pop song model (PopModel) and the rap song model (RapModel), respectively. We run forced alignment experiments on the two sets of adapted models and the test songs of two genres. The experimental results show that the genre has a big impact on the results of automatic lyrics-to-audio alignment.
author2 Shyh-Kang Jeng
author_facet Shyh-Kang Jeng
Lien-Chiao Lin
林廉喬
author Lien-Chiao Lin
林廉喬
spellingShingle Lien-Chiao Lin
林廉喬
Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
author_sort Lien-Chiao Lin
title Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
title_short Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
title_full Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
title_fullStr Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
title_full_unstemmed Lyrics-to-audio Alignment of Chinese Pop Songs and Rap Songs
title_sort lyrics-to-audio alignment of chinese pop songs and rap songs
publishDate 2016
url http://ndltd.ncl.edu.tw/handle/87482174912882882097
work_keys_str_mv AT lienchiaolin lyricstoaudioalignmentofchinesepopsongsandrapsongs
AT línliánqiáo lyricstoaudioalignmentofchinesepopsongsandrapsongs
AT lienchiaolin zhōngwénliúxínggēqūyǔráoshégēqūzhīgēcíduìwèi
AT línliánqiáo zhōngwénliúxínggēqūyǔráoshégēqūzhīgēcíduìwèi
_version_ 1718444210985631744