Summary: | The inherently linear and sequential property of speech raises the need for
ways to better navigate through spoken documents. The strategy of navigation I
focus on in this thesis is summarization, which aims to identify important excerpts
in spoken documents.
A basic characteristic that distinguishes speech summarization from traditional
text summarization is the availability and utilization of speech-related features.
Most previous research, however, has addressed this source from the perspective of
descriptive linguistics, in considering only such prosodic features that appear in that
literature. The experiments in this dissertation suggest that incorporating prosody
does help but its usefulness is very limited—much less than has been suggested in
some previous research. We reassess the role of prosodic features vs. features arising
from speech recognition transcripts, as well as baseline selection in error-prone
and disfluency-filled spontaneous speech. These problems interact with each other,
and isolated observations have hampered a comprehensive understanding to date.
The effectiveness of these prosodic features is largely confined because of their
difficulty in predicting content relevance and redundancy. Nevertheless, untranscribed
audio does contain more information than just prosody. This dissertation
shows that collecting statistics from far more complex acoustic patterns does allow
for estimating state-of-the-art summarization models directly. To this end, we propose
an acoustics-based summarization model that is estimated directly on acoustic
patterns. We empirically determine the extent to which this acoustics-based model
can effectively replace ASR-based models.
The extent to which written sources can benefit speech summarization has
also been limited, namely to noisy speech recognition transcripts. Predicting the
salience of utterances can indeed benefit from more sources than raw audio only.
Since speaking and writing are two basic ways of communication and are by nature
closely related to each other, in many situations, speech is accompanied with relevant
written text. Richer semantics conveyed in the relevant written text provides
additional information over speech by itself. This thesis utilizes such information
in content selection to help identify salient utterances in the corresponding speech
documents. We also employ such richer content to find the structure of spoken
documents—i.e., subtopic boundaries—which may in turn help summarization.
|