以詞性組合為基礎之中文語言特徵研究

在作者歸屬的研究中,語言特徵的選擇一直是很重要的一環,因為會反映到整個預測結果表現。大多數常用的語言特徵雖然在分類上表現優異,像是高頻詞彙、n-grams、及標點符號等,但這些語言特徵內的詞組卻無法解釋分類間的因果關係及相互差異。為了解決這問題,本論文提出詞性組合、否定程度組合及情態詞組合共3種具有語言學意義的語言特徵作為輔助驗證,並以雷震這位作者的文本為基準,探討在「同主題不同作者」及「同作者不同主題」兩個研究方向上是否適用。本論文將會使用隨機森林演算法建立分類模型,使用OOB錯誤率評估分類模型分類表現,並透過重要特徵數值找出各詞組作為決策點的權重。最後希望能從分類規則中,找出不同作者以及不...

Full description

Bibliographic Details
Main Authors: 江易倫, Jiang, Yi Lun
Language:中文
Published: 國立政治大學
Subjects:
Online Access:http://thesis.lib.nccu.edu.tw/cgi-bin/cdrfb3/gsweb.cgi?o=dstdcdr&i=sid=%22G0104753018%22.
Description
Summary:在作者歸屬的研究中,語言特徵的選擇一直是很重要的一環,因為會反映到整個預測結果表現。大多數常用的語言特徵雖然在分類上表現優異,像是高頻詞彙、n-grams、及標點符號等,但這些語言特徵內的詞組卻無法解釋分類間的因果關係及相互差異。為了解決這問題,本論文提出詞性組合、否定程度組合及情態詞組合共3種具有語言學意義的語言特徵作為輔助驗證,並以雷震這位作者的文本為基準,探討在「同主題不同作者」及「同作者不同主題」兩個研究方向上是否適用。本論文將會使用隨機森林演算法建立分類模型,使用OOB錯誤率評估分類模型分類表現,並透過重要特徵數值找出各詞組作為決策點的權重。最後希望能從分類規則中,找出不同作者以及不同類型間語言特徵的獨特性詞組並做解釋。 === In the study of authorship attribution, the choice of language features have always been a very important part because it reflects the performance of the whole prediction. Most of the commonly used language features are excellent in classification, such as word frequencies, n-grams, and punctuation, but the phrases within these language features can not explain the causal relationship between categories and the differences between them. In order to solve this problem, this paper proposes 3 kinds of linguistic meaning as a auxiliary verification, and based on the Lei-Chen 's text, discussed "different authors with same topics" and "different genres with same author" is applied on the two research directions. In this paper, we will use the random forest algorithm to establish the classification model, use the OOB error rate assessment classification model classification performance, and through the important feature values to find the weight of each phrase as a decision point. Finally, we hope to find out unique phrases of different authors and different genres of language features from the classification rules and explain them.