Summarizing Spoken Documents Through Utterance Selection

The inherently linear and sequential property of speech raises the need for ways to better navigate through spoken documents. The strategy of navigation I focus on in this thesis is summarization, which aims to identify important excerpts in spoken documents. A basic characteristic that distinguish...

Full description

Bibliographic Details
Main Author: Zhu, Xiaodan
Other Authors: Penn, Gerald
Language:en_ca
Published: 2010
Subjects:
Online Access:http://hdl.handle.net/1807/24924
Description
Summary:The inherently linear and sequential property of speech raises the need for ways to better navigate through spoken documents. The strategy of navigation I focus on in this thesis is summarization, which aims to identify important excerpts in spoken documents. A basic characteristic that distinguishes speech summarization from traditional text summarization is the availability and utilization of speech-related features. Most previous research, however, has addressed this source from the perspective of descriptive linguistics, in considering only such prosodic features that appear in that literature. The experiments in this dissertation suggest that incorporating prosody does help but its usefulness is very limited—much less than has been suggested in some previous research. We reassess the role of prosodic features vs. features arising from speech recognition transcripts, as well as baseline selection in error-prone and disfluency-filled spontaneous speech. These problems interact with each other, and isolated observations have hampered a comprehensive understanding to date. The effectiveness of these prosodic features is largely confined because of their difficulty in predicting content relevance and redundancy. Nevertheless, untranscribed audio does contain more information than just prosody. This dissertation shows that collecting statistics from far more complex acoustic patterns does allow for estimating state-of-the-art summarization models directly. To this end, we propose an acoustics-based summarization model that is estimated directly on acoustic patterns. We empirically determine the extent to which this acoustics-based model can effectively replace ASR-based models. The extent to which written sources can benefit speech summarization has also been limited, namely to noisy speech recognition transcripts. Predicting the salience of utterances can indeed benefit from more sources than raw audio only. Since speaking and writing are two basic ways of communication and are by nature closely related to each other, in many situations, speech is accompanied with relevant written text. Richer semantics conveyed in the relevant written text provides additional information over speech by itself. This thesis utilizes such information in content selection to help identify salient utterances in the corresponding speech documents. We also employ such richer content to find the structure of spoken documents—i.e., subtopic boundaries—which may in turn help summarization.