CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information Enhancement

This paper proposed a method for improving the XLNet model to address the shortcomings of segmentation algorithm for processing Chinese language, such as long sub-word lengths, long word lists and incomplete word list coverage. To address these issues, we proposed the CWSXLNet (Chinese Word Segmenta...

Full description

Bibliographic Details
Published in:Applied Sciences
Main Authors: Shiqian Guo, Yansun Huang, Baohua Huang, Linda Yang, Cong Zhou
Format: Article
Language:English
Published: MDPI AG 2023-03-01
Subjects:
Online Access:https://www.mdpi.com/2076-3417/13/6/4056
_version_ 1850093280337854464
author Shiqian Guo
Yansun Huang
Baohua Huang
Linda Yang
Cong Zhou
author_facet Shiqian Guo
Yansun Huang
Baohua Huang
Linda Yang
Cong Zhou
author_sort Shiqian Guo
collection DOAJ
container_title Applied Sciences
description This paper proposed a method for improving the XLNet model to address the shortcomings of segmentation algorithm for processing Chinese language, such as long sub-word lengths, long word lists and incomplete word list coverage. To address these issues, we proposed the CWSXLNet (Chinese Word Segmentation XLNet) model based on Chinese word segmentation information enhancement. The model first pre-processed Chinese pretrained text by Chinese word segmentation tool, and proposed a Chinese word segmentation attention mask mechanism by combining PLM (Permuted Language Model) and two-stream self-attention mechanism of XLNet. While performing natural language processing at word granularity, it can reduce the degree of masking between masked and non-masked words for two words belonging to the same word. For the Chinese sentiment analysis task, proposed the CWSXLNet-BiGRU-Attention model, which introduces bi-directional GRU as well as self-attention mechanism in the downstream task. Experiments show that CWSXLNet has achieved 89.91% precision, 91.53% recall rate and 90.71% F1-score, and CWSXLNet-BiGRU-Attention has achieved 92.61% precision, 93.19% recall rate and 92.90% F1-score on ChnSentiCorp dataset, which indicates that CWSXLNet has better performance than other models in Chinese sentiment analysis.
format Article
id doaj-art-1a1d42e2eee94fddba2ba3955e45e17e
institution Directory of Open Access Journals
issn 2076-3417
language English
publishDate 2023-03-01
publisher MDPI AG
record_format Article
spelling doaj-art-1a1d42e2eee94fddba2ba3955e45e17e2025-08-20T00:08:02ZengMDPI AGApplied Sciences2076-34172023-03-01136405610.3390/app13064056CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information EnhancementShiqian Guo0Yansun Huang1Baohua Huang2Linda Yang3Cong Zhou4School of Computer, Electronics and Information, Guangxi University, Nanning 530004, ChinaAuditing Bureau of Xixiangtang, Nanning 530001, ChinaSchool of Computer, Electronics and Information, Guangxi University, Nanning 530004, ChinaSchool of Computer, Electronics and Information, Guangxi University, Nanning 530004, ChinaSchool of Computer, Electronics and Information, Guangxi University, Nanning 530004, ChinaThis paper proposed a method for improving the XLNet model to address the shortcomings of segmentation algorithm for processing Chinese language, such as long sub-word lengths, long word lists and incomplete word list coverage. To address these issues, we proposed the CWSXLNet (Chinese Word Segmentation XLNet) model based on Chinese word segmentation information enhancement. The model first pre-processed Chinese pretrained text by Chinese word segmentation tool, and proposed a Chinese word segmentation attention mask mechanism by combining PLM (Permuted Language Model) and two-stream self-attention mechanism of XLNet. While performing natural language processing at word granularity, it can reduce the degree of masking between masked and non-masked words for two words belonging to the same word. For the Chinese sentiment analysis task, proposed the CWSXLNet-BiGRU-Attention model, which introduces bi-directional GRU as well as self-attention mechanism in the downstream task. Experiments show that CWSXLNet has achieved 89.91% precision, 91.53% recall rate and 90.71% F1-score, and CWSXLNet-BiGRU-Attention has achieved 92.61% precision, 93.19% recall rate and 92.90% F1-score on ChnSentiCorp dataset, which indicates that CWSXLNet has better performance than other models in Chinese sentiment analysis.https://www.mdpi.com/2076-3417/13/6/4056sentiment analysisChinese word segmentationXLNetattention maskmachine learningnatural language processing
spellingShingle Shiqian Guo
Yansun Huang
Baohua Huang
Linda Yang
Cong Zhou
CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information Enhancement
sentiment analysis
Chinese word segmentation
XLNet
attention mask
machine learning
natural language processing
title CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information Enhancement
title_full CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information Enhancement
title_fullStr CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information Enhancement
title_full_unstemmed CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information Enhancement
title_short CWSXLNet: A Sentiment Analysis Model Based on Chinese Word Segmentation Information Enhancement
title_sort cwsxlnet a sentiment analysis model based on chinese word segmentation information enhancement
topic sentiment analysis
Chinese word segmentation
XLNet
attention mask
machine learning
natural language processing
url https://www.mdpi.com/2076-3417/13/6/4056
work_keys_str_mv AT shiqianguo cwsxlnetasentimentanalysismodelbasedonchinesewordsegmentationinformationenhancement
AT yansunhuang cwsxlnetasentimentanalysismodelbasedonchinesewordsegmentationinformationenhancement
AT baohuahuang cwsxlnetasentimentanalysismodelbasedonchinesewordsegmentationinformationenhancement
AT lindayang cwsxlnetasentimentanalysismodelbasedonchinesewordsegmentationinformationenhancement
AT congzhou cwsxlnetasentimentanalysismodelbasedonchinesewordsegmentationinformationenhancement