A Study of the Effects of Stemming Strategies on Arabic Document Classification

Stemming is one of the most effective techniques, which has been adopted in many applications, such as machine learning, machine translation, document classification (DC), information retrieval, and natural language processing. The stemming technique is meant to be applied during the classification...

Full description

Bibliographic Details
Main Authors: Yousif A. Alhaj, Jianwen Xiang, Dongdong Zhao, Mohammed A. A. Al-Qaness, Mohamed Abd Elaziz, Abdelghani Dahou
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8664087/
id doaj-8f5b5fcc48d7472c85b88b503f66e019
record_format Article
spelling doaj-8f5b5fcc48d7472c85b88b503f66e0192021-03-29T22:50:45ZengIEEEIEEE Access2169-35362019-01-017326643267110.1109/ACCESS.2019.29033318664087A Study of the Effects of Stemming Strategies on Arabic Document ClassificationYousif A. Alhaj0https://orcid.org/0000-0002-8770-1584Jianwen Xiang1Dongdong Zhao2https://orcid.org/0000-0002-4697-6901Mohammed A. A. Al-Qaness3https://orcid.org/0000-0002-6956-7641Mohamed Abd Elaziz4https://orcid.org/0000-0002-7682-6269Abdelghani Dahou5Hubei Key Laboratory of Transportation of Internet of Things, School of Computer Science and Technology, Wuhan University of Technology, Wuhan, ChinaHubei Key Laboratory of Transportation of Internet of Things, School of Computer Science and Technology, Wuhan University of Technology, Wuhan, ChinaHubei Key Laboratory of Transportation of Internet of Things, School of Computer Science and Technology, Wuhan University of Technology, Wuhan, ChinaSchool of Computer Science, Wuhan University, Wuhan, ChinaDepartment of Mathematics, Faculty of Science, Zagazig University, Zagazig, EgyptHubei Key Laboratory of Transportation of Internet of Things, School of Computer Science and Technology, Wuhan University of Technology, Wuhan, ChinaStemming is one of the most effective techniques, which has been adopted in many applications, such as machine learning, machine translation, document classification (DC), information retrieval, and natural language processing. The stemming technique is meant to be applied during the classification of documents to reduce the high dimensionality of the feature space, which, in turn, raises the functioning of the classification system, particularly with extreme modulated language, for instance, Arabic language. This paper aims to study the impact of stemming techniques, namely Information Science Research Institute (ISRI), Tashaphyne, and ARLStem on Arabic DC. The classification algorithms, namely Naïve Bayesian (NB), support vector machine (SVM), and K-nearest neighbors (KNN), are used in this paper. In addition, the chi-square feature selection is used to select the most relevant features. Experiments are conducted on CNN Arabic corpus, which is collected from Arabic websites to assess the performance of the classification system. In order to evaluate the classifiers, the K-fold cross-validation method and Micro-F1 are used. Findings of this paper indicate that the ARLStem outperforms the ISRI and Tashaphyne stemmers. The outcomes clearly showed the effectiveness of the SVM over the KNN and NB classifiers, which achieved 94.64% Micro-F1 value when using the ARLStem stemmer.https://ieeexplore.ieee.org/document/8664087/Arabic text classificationtext preprocessingstemming techniquesfeature extractionfeature selection
collection DOAJ
language English
format Article
sources DOAJ
author Yousif A. Alhaj
Jianwen Xiang
Dongdong Zhao
Mohammed A. A. Al-Qaness
Mohamed Abd Elaziz
Abdelghani Dahou
spellingShingle Yousif A. Alhaj
Jianwen Xiang
Dongdong Zhao
Mohammed A. A. Al-Qaness
Mohamed Abd Elaziz
Abdelghani Dahou
A Study of the Effects of Stemming Strategies on Arabic Document Classification
IEEE Access
Arabic text classification
text preprocessing
stemming techniques
feature extraction
feature selection
author_facet Yousif A. Alhaj
Jianwen Xiang
Dongdong Zhao
Mohammed A. A. Al-Qaness
Mohamed Abd Elaziz
Abdelghani Dahou
author_sort Yousif A. Alhaj
title A Study of the Effects of Stemming Strategies on Arabic Document Classification
title_short A Study of the Effects of Stemming Strategies on Arabic Document Classification
title_full A Study of the Effects of Stemming Strategies on Arabic Document Classification
title_fullStr A Study of the Effects of Stemming Strategies on Arabic Document Classification
title_full_unstemmed A Study of the Effects of Stemming Strategies on Arabic Document Classification
title_sort study of the effects of stemming strategies on arabic document classification
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description Stemming is one of the most effective techniques, which has been adopted in many applications, such as machine learning, machine translation, document classification (DC), information retrieval, and natural language processing. The stemming technique is meant to be applied during the classification of documents to reduce the high dimensionality of the feature space, which, in turn, raises the functioning of the classification system, particularly with extreme modulated language, for instance, Arabic language. This paper aims to study the impact of stemming techniques, namely Information Science Research Institute (ISRI), Tashaphyne, and ARLStem on Arabic DC. The classification algorithms, namely Naïve Bayesian (NB), support vector machine (SVM), and K-nearest neighbors (KNN), are used in this paper. In addition, the chi-square feature selection is used to select the most relevant features. Experiments are conducted on CNN Arabic corpus, which is collected from Arabic websites to assess the performance of the classification system. In order to evaluate the classifiers, the K-fold cross-validation method and Micro-F1 are used. Findings of this paper indicate that the ARLStem outperforms the ISRI and Tashaphyne stemmers. The outcomes clearly showed the effectiveness of the SVM over the KNN and NB classifiers, which achieved 94.64% Micro-F1 value when using the ARLStem stemmer.
topic Arabic text classification
text preprocessing
stemming techniques
feature extraction
feature selection
url https://ieeexplore.ieee.org/document/8664087/
work_keys_str_mv AT yousifaalhaj astudyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT jianwenxiang astudyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT dongdongzhao astudyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT mohammedaaalqaness astudyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT mohamedabdelaziz astudyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT abdelghanidahou astudyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT yousifaalhaj studyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT jianwenxiang studyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT dongdongzhao studyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT mohammedaaalqaness studyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT mohamedabdelaziz studyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
AT abdelghanidahou studyoftheeffectsofstemmingstrategiesonarabicdocumentclassification
_version_ 1724190687262408704