An enhanced writer language model for Chinese historical corpora
碩士 === 國立政治大學 === 資訊科學學系 === 105 === In recent years, the trend of digital collections has been developing day by day, and more and more precious Chinese historical corpora have been selected for preservation. The preservation of the corpora at the same time will face the loss or lack of the authors...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | zh-TW |
Online Access: | http://ndltd.ncl.edu.tw/handle/c47ph3 |
id |
ndltd-TW-105NCCU5394032 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-105NCCU53940322019-05-15T23:39:15Z http://ndltd.ncl.edu.tw/handle/c47ph3 An enhanced writer language model for Chinese historical corpora 適用於中文史料文本之作者語言模型分析方法研究 Liang, Shao Zhong 梁韶中 碩士 國立政治大學 資訊科學學系 105 In recent years, the trend of digital collections has been developing day by day, and more and more precious Chinese historical corpora have been selected for preservation. The preservation of the corpora at the same time will face the loss or lack of the authors, thus affecting the integrity of the corpora. A method for analyzing the author of the Chinese historical text is mainly through the construction of the language model, for each potential author to train a specific language model, and with a different smoothing method can be avoided zero probability of words and the error is caused by the calculation. This paper mainly adopts the Interpolated Modified Kneser-Ney smoothing method, which will take into account the influence of higher order and lower order n-grams string frequency. So, Interpolated Modified Kneser-Ney smoothing is become a very popular way to construct a general choice of language models. The combination of all the articles of each potential author into a single language model will ignore many of the features, so this paper in addition to the value of the historical corpora, but also to add the metadata to integrate analysis, including the statistical information of the subject matter classification of the artificial mark, so that the constructed language model is more suitable for the measured text, increase the accuracy of the forecast results, add additional custom words to match the language of the proper nouns, in addition. But also on the basis of the general construction language model, the weight of the long word to join, to determine the length of the word on the relationship between the accuracy of prediction. Finally, recursive neural networks language models are also used to predict the authors and to make further comparisons with the traditional language model analysis. Tsai, Ming Feng 蔡銘峰 學位論文 ; thesis 35 zh-TW |
collection |
NDLTD |
language |
zh-TW |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立政治大學 === 資訊科學學系 === 105 === In recent years, the trend of digital collections has been developing day by day, and more and more precious Chinese historical corpora have been selected for preservation. The preservation of the corpora at the same time will face the loss or lack of the authors, thus affecting the integrity of the corpora. A method for analyzing the author of the Chinese historical text is mainly through the construction of the language model, for each potential author to train a specific language model, and with a different smoothing method can be avoided zero probability of words and the error is caused by the calculation. This paper mainly adopts the Interpolated Modified Kneser-Ney smoothing method, which will take into account the influence of higher order and lower order n-grams string frequency. So, Interpolated Modified Kneser-Ney smoothing is become a very popular way to construct a general choice of language models.
The combination of all the articles of each potential author into a single language model will ignore many of the features, so this paper in addition to the value of the historical corpora, but also to add the metadata to integrate analysis, including the statistical information of the subject matter classification of the artificial mark, so that the constructed language model is more suitable for the measured text, increase the accuracy of the forecast results, add additional custom words to match the language of the proper nouns, in addition. But also on the basis of the general construction language model, the weight of the long word to join, to determine the length of the word on the relationship between the accuracy of prediction. Finally, recursive neural networks language models are also used to predict the authors and to make further comparisons with the traditional language model analysis.
|
author2 |
Tsai, Ming Feng |
author_facet |
Tsai, Ming Feng Liang, Shao Zhong 梁韶中 |
author |
Liang, Shao Zhong 梁韶中 |
spellingShingle |
Liang, Shao Zhong 梁韶中 An enhanced writer language model for Chinese historical corpora |
author_sort |
Liang, Shao Zhong |
title |
An enhanced writer language model for Chinese historical corpora |
title_short |
An enhanced writer language model for Chinese historical corpora |
title_full |
An enhanced writer language model for Chinese historical corpora |
title_fullStr |
An enhanced writer language model for Chinese historical corpora |
title_full_unstemmed |
An enhanced writer language model for Chinese historical corpora |
title_sort |
enhanced writer language model for chinese historical corpora |
url |
http://ndltd.ncl.edu.tw/handle/c47ph3 |
work_keys_str_mv |
AT liangshaozhong anenhancedwriterlanguagemodelforchinesehistoricalcorpora AT liángsháozhōng anenhancedwriterlanguagemodelforchinesehistoricalcorpora AT liangshaozhong shìyòngyúzhōngwénshǐliàowénběnzhīzuòzhěyǔyánmóxíngfēnxīfāngfǎyánjiū AT liángsháozhōng shìyòngyúzhōngwénshǐliàowénběnzhīzuòzhěyǔyánmóxíngfēnxīfāngfǎyánjiū AT liangshaozhong enhancedwriterlanguagemodelforchinesehistoricalcorpora AT liángsháozhōng enhancedwriterlanguagemodelforchinesehistoricalcorpora |
_version_ |
1719150193347133440 |