Text Segmentation: Methodology and Application

博士 === 國立交通大學 === 資訊科學與工程研究所 === 102 === The task of text segmentation is to divide a long text into several shorter segments, each of which shares a common topic. It has been shown that text segmentation is beneficial to several natural language processing tasks, such as information retrieval and t...

Full description

Bibliographic Details
Main Authors: Wu, Ji-Wei, 吳智瑋
Other Authors: Tsai, Wen-Nung
Format: Others
Language:en_US
Published: 2013
Online Access:http://ndltd.ncl.edu.tw/handle/05635453496814253110
Description
Summary:博士 === 國立交通大學 === 資訊科學與工程研究所 === 102 === The task of text segmentation is to divide a long text into several shorter segments, each of which shares a common topic. It has been shown that text segmentation is beneficial to several natural language processing tasks, such as information retrieval and text summarization. Many algorithms have been proposed and shown to improve the performance of text segmentation. However, previous studies often suffer from either lower segmentation accuracy or higher computational complexity. Moreover, parameter setting is also a critical problem in some algorithms. Although manual assignment is an approach to solve this problem, it may increase the user’s burden, and the parameters provided may not always be suitable to reflect the real metadata of a text. To tackle with these problems, three novel text segmentation algorithms are proposed in this dissertation. At first, a text segmentation algorithm based on Discrete Particle Swarm Optimization (called DPSOTS), is proposed. DPSOTS finds topical segments by using global information, global measurement, and a global optimization algorithm, DPSO, which improves both segmentation accuracy and computational complexity. Subsequently, an efficient text segmentation algorithm based on Hierarchical Agglomerative Clustering (called TSHAC), is proposed. TSHAC is implemented without parameter setting and user involvement. Finally, a hybrid algorithm, TSHAC-DPSO, is proposed. As well as TSHAC, TSHAC-DPSO is implemented without parameter setting. Moreover, TSHAC-DPSO fully utilizes the merits of both algorithms which not only improve the accuracy of text segmentation, but also make the execution more efficient and flexible. As examples, two applications of text segmentation in knowledge management and e-learning are also introduced in this dissertation. It has been demonstrated that text segmentation can be successfully applied in both applications.