Exploiting Machine Learning Methods for Spoken Document Retrieval

碩士 === 國立臺灣師範大學 === 資訊工程研究所 === 97 === This thesis investigates the use of machine-learning approaches, namely learning-to-rank algorithms, for information retrieval (IR), with special emphasis on their theoretical foundations and the associated features that are used by them, such as the lexical fe...

Full description

Bibliographic Details
Main Author: 游斯涵
Other Authors: Berlin Chen
Format: Others
Language:zh-TW
Published: 2009
Online Access:http://ndltd.ncl.edu.tw/handle/z2j5u7
id ndltd-TW-097NTNU5392003
record_format oai_dc
spelling ndltd-TW-097NTNU53920032019-05-29T03:43:27Z http://ndltd.ncl.edu.tw/handle/z2j5u7 Exploiting Machine Learning Methods for Spoken Document Retrieval 使用機器學習方法於語音文件檢索之研究 游斯涵 碩士 國立臺灣師範大學 資訊工程研究所 97 This thesis investigates the use of machine-learning approaches, namely learning-to-rank algorithms, for information retrieval (IR), with special emphasis on their theoretical foundations and the associated features that are used by them, such as the lexical features, proximity features, and probabilistic features. Meanwhile, we also consider the application of these approaches for spoken document retrieval (SDR). All experiments were conducted on the Topic Detection and Tracking corpora (especially, TDT-2 and TDT-3), which are the benchmark collections widely adopted for various SDR evaluations since they contain tens of hours of mainland-accented Chinese broadcast news documents equipped with topic labels and orthographic transcripts. In the hope of discovering more useful speech-related features for SDR as well as analyzing the problems caused by speech recognition errors, a large vocabulary speech recognition (LVCSR) system that can output a word lattice consisting of multiple recognition hypotheses for each broadcast news document is established. Moreover, we also deal with the problem of training the machine-learning retrieval models with unbalanced training data, and propose a remedy for it. Finally, the preliminary experimental results seem to show that the RankNet based retrieval model outperforms the support vector machine (SVM) based retrieval model for the SDR task studied in this thesis. Berlin Chen 陳柏琳 2009 學位論文 ; thesis 134 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立臺灣師範大學 === 資訊工程研究所 === 97 === This thesis investigates the use of machine-learning approaches, namely learning-to-rank algorithms, for information retrieval (IR), with special emphasis on their theoretical foundations and the associated features that are used by them, such as the lexical features, proximity features, and probabilistic features. Meanwhile, we also consider the application of these approaches for spoken document retrieval (SDR). All experiments were conducted on the Topic Detection and Tracking corpora (especially, TDT-2 and TDT-3), which are the benchmark collections widely adopted for various SDR evaluations since they contain tens of hours of mainland-accented Chinese broadcast news documents equipped with topic labels and orthographic transcripts. In the hope of discovering more useful speech-related features for SDR as well as analyzing the problems caused by speech recognition errors, a large vocabulary speech recognition (LVCSR) system that can output a word lattice consisting of multiple recognition hypotheses for each broadcast news document is established. Moreover, we also deal with the problem of training the machine-learning retrieval models with unbalanced training data, and propose a remedy for it. Finally, the preliminary experimental results seem to show that the RankNet based retrieval model outperforms the support vector machine (SVM) based retrieval model for the SDR task studied in this thesis.
author2 Berlin Chen
author_facet Berlin Chen
游斯涵
author 游斯涵
spellingShingle 游斯涵
Exploiting Machine Learning Methods for Spoken Document Retrieval
author_sort 游斯涵
title Exploiting Machine Learning Methods for Spoken Document Retrieval
title_short Exploiting Machine Learning Methods for Spoken Document Retrieval
title_full Exploiting Machine Learning Methods for Spoken Document Retrieval
title_fullStr Exploiting Machine Learning Methods for Spoken Document Retrieval
title_full_unstemmed Exploiting Machine Learning Methods for Spoken Document Retrieval
title_sort exploiting machine learning methods for spoken document retrieval
publishDate 2009
url http://ndltd.ncl.edu.tw/handle/z2j5u7
work_keys_str_mv AT yóusīhán exploitingmachinelearningmethodsforspokendocumentretrieval
AT yóusīhán shǐyòngjīqìxuéxífāngfǎyúyǔyīnwénjiànjiǎnsuǒzhīyánjiū
_version_ 1719193681700847616