Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language

Semantic similarity is a long-standing problem in natural language processing (NLP). It is a topic of great interest as its understanding can provide a look into how human beings comprehend meaning and make associations between words. However, when this problem is looked at from the viewpoint of mac...

Full description

Bibliographic Details
Main Authors: Rajat Pandit, Saptarshi Sengupta, Sudip Kumar Naskar, Niladri Sekhar Dash, Mohini Mohan Sardar
Format: Article
Language:English
Published: MDPI AG 2019-05-01
Series:Informatics
Subjects:
Online Access:https://www.mdpi.com/2227-9709/6/2/19
id doaj-fd5d0d71e318499f8f14b8de527467e2
record_format Article
spelling doaj-fd5d0d71e318499f8f14b8de527467e22020-11-25T02:11:58ZengMDPI AGInformatics2227-97092019-05-01621910.3390/informatics6020019informatics6020019Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced LanguageRajat Pandit0Saptarshi Sengupta1Sudip Kumar Naskar2Niladri Sekhar Dash3Mohini Mohan Sardar4Department of Computer Science, West Bengal State University, Kolkata 700126, IndiaDepartment of Computer Science, University of Minnesota Duluth, Duluth, MN 55812, USADepartment of Computer Science & Engineering, Jadavpur University, Kolkata 700032, IndiaLinguistic Research Unit, Indian Statistical Institute, Kolkata 700108, IndiaDepartment of Bengali, West Bengal State University, Kolkata 700126, IndiaSemantic similarity is a long-standing problem in natural language processing (NLP). It is a topic of great interest as its understanding can provide a look into how human beings comprehend meaning and make associations between words. However, when this problem is looked at from the viewpoint of machine understanding, particularly for under resourced languages, it poses a different problem altogether. In this paper, semantic similarity is explored in Bangla, a less resourced language. For ameliorating the situation in such languages, the most rudimentary method (path-based) and the latest state-of-the-art method (Word2Vec) for semantic similarity calculation were augmented using cross-lingual resources in English and the results obtained are truly astonishing. In the presented paper, two semantic similarity approaches have been explored in Bangla, namely the path-based and distributional model and their cross-lingual counterparts were synthesized in light of the English WordNet and Corpora. The proposed methods were evaluated on a dataset comprising of 162 Bangla word pairs, which were annotated by five expert raters. The correlation scores obtained between the four metrics and human evaluation scores demonstrate a marked enhancement that the cross-lingual approach brings into the process of semantic similarity calculation for Bangla.https://www.mdpi.com/2227-9709/6/2/19semantic similarityWord2Vectranslationlow-resource languagesWordNet
collection DOAJ
language English
format Article
sources DOAJ
author Rajat Pandit
Saptarshi Sengupta
Sudip Kumar Naskar
Niladri Sekhar Dash
Mohini Mohan Sardar
spellingShingle Rajat Pandit
Saptarshi Sengupta
Sudip Kumar Naskar
Niladri Sekhar Dash
Mohini Mohan Sardar
Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language
Informatics
semantic similarity
Word2Vec
translation
low-resource languages
WordNet
author_facet Rajat Pandit
Saptarshi Sengupta
Sudip Kumar Naskar
Niladri Sekhar Dash
Mohini Mohan Sardar
author_sort Rajat Pandit
title Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language
title_short Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language
title_full Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language
title_fullStr Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language
title_full_unstemmed Improving Semantic Similarity with Cross-Lingual Resources: A Study in Bangla—A Low Resourced Language
title_sort improving semantic similarity with cross-lingual resources: a study in bangla—a low resourced language
publisher MDPI AG
series Informatics
issn 2227-9709
publishDate 2019-05-01
description Semantic similarity is a long-standing problem in natural language processing (NLP). It is a topic of great interest as its understanding can provide a look into how human beings comprehend meaning and make associations between words. However, when this problem is looked at from the viewpoint of machine understanding, particularly for under resourced languages, it poses a different problem altogether. In this paper, semantic similarity is explored in Bangla, a less resourced language. For ameliorating the situation in such languages, the most rudimentary method (path-based) and the latest state-of-the-art method (Word2Vec) for semantic similarity calculation were augmented using cross-lingual resources in English and the results obtained are truly astonishing. In the presented paper, two semantic similarity approaches have been explored in Bangla, namely the path-based and distributional model and their cross-lingual counterparts were synthesized in light of the English WordNet and Corpora. The proposed methods were evaluated on a dataset comprising of 162 Bangla word pairs, which were annotated by five expert raters. The correlation scores obtained between the four metrics and human evaluation scores demonstrate a marked enhancement that the cross-lingual approach brings into the process of semantic similarity calculation for Bangla.
topic semantic similarity
Word2Vec
translation
low-resource languages
WordNet
url https://www.mdpi.com/2227-9709/6/2/19
work_keys_str_mv AT rajatpandit improvingsemanticsimilaritywithcrosslingualresourcesastudyinbanglaalowresourcedlanguage
AT saptarshisengupta improvingsemanticsimilaritywithcrosslingualresourcesastudyinbanglaalowresourcedlanguage
AT sudipkumarnaskar improvingsemanticsimilaritywithcrosslingualresourcesastudyinbanglaalowresourcedlanguage
AT niladrisekhardash improvingsemanticsimilaritywithcrosslingualresourcesastudyinbanglaalowresourcedlanguage
AT mohinimohansardar improvingsemanticsimilaritywithcrosslingualresourcesastudyinbanglaalowresourcedlanguage
_version_ 1724911502942535680