An enhanced writer language model for Chinese historical corpora

碩士 === 國立政治大學 === 資訊科學學系 === 105 === In recent years, the trend of digital collections has been developing day by day, and more and more precious Chinese historical corpora have been selected for preservation. The preservation of the corpora at the same time will face the loss or lack of the authors...

Full description

Bibliographic Details
Main Authors: Liang, Shao Zhong, 梁韶中
Other Authors: Tsai, Ming Feng
Format: Others
Language:zh-TW
Online Access:http://ndltd.ncl.edu.tw/handle/c47ph3
id ndltd-TW-105NCCU5394032
record_format oai_dc
spelling ndltd-TW-105NCCU53940322019-05-15T23:39:15Z http://ndltd.ncl.edu.tw/handle/c47ph3 An enhanced writer language model for Chinese historical corpora 適用於中文史料文本之作者語言模型分析方法研究 Liang, Shao Zhong 梁韶中 碩士 國立政治大學 資訊科學學系 105 In recent years, the trend of digital collections has been developing day by day, and more and more precious Chinese historical corpora have been selected for preservation. The preservation of the corpora at the same time will face the loss or lack of the authors, thus affecting the integrity of the corpora. A method for analyzing the author of the Chinese historical text is mainly through the construction of the language model, for each potential author to train a specific language model, and with a different smoothing method can be avoided zero probability of words and the error is caused by the calculation. This paper mainly adopts the Interpolated Modified Kneser-Ney smoothing method, which will take into account the influence of higher order and lower order n-grams string frequency. So, Interpolated Modified Kneser-Ney smoothing is become a very popular way to construct a general choice of language models. The combination of all the articles of each potential author into a single language model will ignore many of the features, so this paper in addition to the value of the historical corpora, but also to add the metadata to integrate analysis, including the statistical information of the subject matter classification of the artificial mark, so that the constructed language model is more suitable for the measured text, increase the accuracy of the forecast results, add additional custom words to match the language of the proper nouns, in addition. But also on the basis of the general construction language model, the weight of the long word to join, to determine the length of the word on the relationship between the accuracy of prediction. Finally, recursive neural networks language models are also used to predict the authors and to make further comparisons with the traditional language model analysis. Tsai, Ming Feng 蔡銘峰 學位論文 ; thesis 35 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立政治大學 === 資訊科學學系 === 105 === In recent years, the trend of digital collections has been developing day by day, and more and more precious Chinese historical corpora have been selected for preservation. The preservation of the corpora at the same time will face the loss or lack of the authors, thus affecting the integrity of the corpora. A method for analyzing the author of the Chinese historical text is mainly through the construction of the language model, for each potential author to train a specific language model, and with a different smoothing method can be avoided zero probability of words and the error is caused by the calculation. This paper mainly adopts the Interpolated Modified Kneser-Ney smoothing method, which will take into account the influence of higher order and lower order n-grams string frequency. So, Interpolated Modified Kneser-Ney smoothing is become a very popular way to construct a general choice of language models. The combination of all the articles of each potential author into a single language model will ignore many of the features, so this paper in addition to the value of the historical corpora, but also to add the metadata to integrate analysis, including the statistical information of the subject matter classification of the artificial mark, so that the constructed language model is more suitable for the measured text, increase the accuracy of the forecast results, add additional custom words to match the language of the proper nouns, in addition. But also on the basis of the general construction language model, the weight of the long word to join, to determine the length of the word on the relationship between the accuracy of prediction. Finally, recursive neural networks language models are also used to predict the authors and to make further comparisons with the traditional language model analysis.
author2 Tsai, Ming Feng
author_facet Tsai, Ming Feng
Liang, Shao Zhong
梁韶中
author Liang, Shao Zhong
梁韶中
spellingShingle Liang, Shao Zhong
梁韶中
An enhanced writer language model for Chinese historical corpora
author_sort Liang, Shao Zhong
title An enhanced writer language model for Chinese historical corpora
title_short An enhanced writer language model for Chinese historical corpora
title_full An enhanced writer language model for Chinese historical corpora
title_fullStr An enhanced writer language model for Chinese historical corpora
title_full_unstemmed An enhanced writer language model for Chinese historical corpora
title_sort enhanced writer language model for chinese historical corpora
url http://ndltd.ncl.edu.tw/handle/c47ph3
work_keys_str_mv AT liangshaozhong anenhancedwriterlanguagemodelforchinesehistoricalcorpora
AT liángsháozhōng anenhancedwriterlanguagemodelforchinesehistoricalcorpora
AT liangshaozhong shìyòngyúzhōngwénshǐliàowénběnzhīzuòzhěyǔyánmóxíngfēnxīfāngfǎyánjiū
AT liángsháozhōng shìyòngyúzhōngwénshǐliàowénběnzhīzuòzhěyǔyánmóxíngfēnxīfāngfǎyánjiū
AT liangshaozhong enhancedwriterlanguagemodelforchinesehistoricalcorpora
AT liángsháozhōng enhancedwriterlanguagemodelforchinesehistoricalcorpora
_version_ 1719150193347133440