An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods

The linguistic corpus of Sindhi language is significant for computational linguistics process, machine learning process, language features identification and analysis, semantic and sentiment analysis, information retrieval and so on. There is little computational linguistics work done on Sindhi text...

Full description

Bibliographic Details
Main Authors:	Mazhar Ali, Asim Imdad Wagan
Format:	Article
Language:	English
Published:	Mehran University of Engineering and Technology 2019-01-01
Series:	Mehran University Research Journal of Engineering and Technology
Online Access:	http://publications.muet.edu.pk/index.php/muetrj/article/view/754

id	doaj-6247e74f4cf247ce86ffba8fbb8738f6
record_format	Article
spelling	doaj-6247e74f4cf247ce86ffba8fbb8738f62020-11-24T21:22:38ZengMehran University of Engineering and TechnologyMehran University Research Journal of Engineering and Technology0254-78212413-72192019-01-0138118519610.22581/muet1982.1901.15754An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning MethodsMazhar Ali0Asim Imdad Wagan1Benazir Bhutto Shaheed University, Lyari, Karachi, PakistanMohammad Ali Jinnah University, Karachi, Pakistan.The linguistic corpus of Sindhi language is significant for computational linguistics process, machine learning process, language features identification and analysis, semantic and sentiment analysis, information retrieval and so on. There is little computational linguistics work done on Sindhi text whereas, English, Arabic, Urdu and some other languages are fully resourced computationally. The grammar and morphemes of these languages are analyzed properly using dissimilar machine learning methods. The development and research work regarding computational linguistics are in progress on Sindhi language at this time. This study is planned to develop the Sindhi annotated corpus using universal POS (Part of Speech) tag set and Sindhi POS tag set for the purpose of language features and variation analysis. The features are extracted using TF-IDF (Term Frequency and Inverse Document Frequency) technique. The supervised machine learning model is developed to assess the annotated corpus to know the grammatical annotation of Sindhi language. The model is trained with 80% of annotated corpus and tested with 20% of test set. The cross-validation technique with 10-folds is utilized to evaluate and validate the model. The results of model show the better performance of model as well as confirm the proper annotation to Sindhi corpus. This study described a number of research gaps to work more on topic modeling, language variation, sentiment and semantic analysis of Sindhi language.http://publications.muet.edu.pk/index.php/muetrj/article/view/754
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Mazhar Ali Asim Imdad Wagan
spellingShingle	Mazhar Ali Asim Imdad Wagan An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods Mehran University Research Journal of Engineering and Technology
author_facet	Mazhar Ali Asim Imdad Wagan
author_sort	Mazhar Ali
title	An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods
title_short	An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods
title_full	An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods
title_fullStr	An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods
title_full_unstemmed	An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods
title_sort	analysis of sindhi annotated corpus using supervised machine learning methods
publisher	Mehran University of Engineering and Technology
series	Mehran University Research Journal of Engineering and Technology
issn	0254-7821 2413-7219
publishDate	2019-01-01
description	The linguistic corpus of Sindhi language is significant for computational linguistics process, machine learning process, language features identification and analysis, semantic and sentiment analysis, information retrieval and so on. There is little computational linguistics work done on Sindhi text whereas, English, Arabic, Urdu and some other languages are fully resourced computationally. The grammar and morphemes of these languages are analyzed properly using dissimilar machine learning methods. The development and research work regarding computational linguistics are in progress on Sindhi language at this time. This study is planned to develop the Sindhi annotated corpus using universal POS (Part of Speech) tag set and Sindhi POS tag set for the purpose of language features and variation analysis. The features are extracted using TF-IDF (Term Frequency and Inverse Document Frequency) technique. The supervised machine learning model is developed to assess the annotated corpus to know the grammatical annotation of Sindhi language. The model is trained with 80% of annotated corpus and tested with 20% of test set. The cross-validation technique with 10-folds is utilized to evaluate and validate the model. The results of model show the better performance of model as well as confirm the proper annotation to Sindhi corpus. This study described a number of research gaps to work more on topic modeling, language variation, sentiment and semantic analysis of Sindhi language.
url	http://publications.muet.edu.pk/index.php/muetrj/article/view/754
work_keys_str_mv	AT mazharali ananalysisofsindhiannotatedcorpususingsupervisedmachinelearningmethods AT asimimdadwagan ananalysisofsindhiannotatedcorpususingsupervisedmachinelearningmethods AT mazharali analysisofsindhiannotatedcorpususingsupervisedmachinelearningmethods AT asimimdadwagan analysisofsindhiannotatedcorpususingsupervisedmachinelearningmethods
_version_	1725994956340854784

An Analysis of Sindhi Annotated Corpus using Supervised Machine Learning Methods

Similar Items