Incorporating top-level and bottom-level information for Chinese word dependency analysis

博士 === 國立中央大學 === 資訊工程研究所 === 96 === This thesis proposes a unified Chinese dependency parsing framework where the word segmentation and POS-tagging were included. We first discuss the issue of the Chinese word segmentation and part-of-speech tagging. Then we exploit the conversion of Chinese POS ta...

Full description

Bibliographic Details
Main Authors: Yu-Chieh Wu, 吳毓傑
Other Authors: Yue-Shi Lee
Format: Others
Language:en_US
Published: 2007
Online Access:http://ndltd.ncl.edu.tw/handle/96981739977222106486
Description
Summary:博士 === 國立中央大學 === 資訊工程研究所 === 96 === This thesis proposes a unified Chinese dependency parsing framework where the word segmentation and POS-tagging were included. We first discuss the issue of the Chinese word segmentation and part-of-speech tagging. Then we exploit the conversion of Chinese POS tagging as sequential chunk-labeling problem and treat it as the conventional sequential chunk labeling tasks. To train a sequential labeler several classification algorithms are investigated. However, the observed best method-CRF yields superior but slower performance than the other approaches which make the POS tagging intractable. To circumvent this, we propose a two-pass sequential chunk labeling model to combine CRF with SVM. The experimental result showed that the two-pass learner achieves the best result than the other single-pass methods (96.2 vs. 95.9). In the well-known benchmark corpus (SIGHAN), our method also showed very competitive performance. By means of the two-pass Chinese POS tagging, the words associated with their part-of-speech labels could be auto-segmented and labeled. We therefore employ the auto-segmented words for dependency parsing. To enhance the performance our parser integrates both top-down and bottom-up syntactic information. Meanwhile, we also compare with current state-of-the-art dependency parsers. The experimental result showed that our method is not only more accurate but also spends much less training and testing time than the other approaches. In addition, an approximate K-best reranking method is designed to improve the overall dependency parse and also for word segmentation results. The advantage is that one can independently train these modules, while taking the global parse into consideration through the K-best selection.