Statistical analysis of the Indus script using n-grams.

The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilizat...

Full description

Bibliographic Details
Main Authors:	Nisha Yadav, Hrishikesh Joglekar, Rajesh P N Rao, Mayank N Vahia, Ronojoy Adhikari, Iravatham Mahadevan
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2010-01-01
Series:	PLoS ONE
Online Access:	http://europepmc.org/articles/PMC2841631?pdf=render

id	doaj-e3f78267d9d04ad7ae5230aa2d864abf
record_format	Article
spelling	doaj-e3f78267d9d04ad7ae5230aa2d864abf2020-11-25T01:08:23ZengPublic Library of Science (PLoS)PLoS ONE1932-62032010-01-0153e950610.1371/journal.pone.0009506Statistical analysis of the Indus script using n-grams.Nisha YadavHrishikesh JoglekarRajesh P N RaoMayank N VahiaRonojoy AdhikariIravatham MahadevanThe Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilization. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically n-gram Markov chains, to analyze the syntax of the Indus script. We find that unigrams follow a Zipf-Mandelbrot distribution. Text beginner and ender distributions are unequal, providing internal evidence for syntax. We see clear evidence of strong bigram correlations and extract significant pairs and triplets using a log-likelihood measure of association. Highly frequent pairs and triplets are not always highly significant. The model performance is evaluated using information-theoretic measures and cross-validation. The model can restore doubtfully read texts with an accuracy of about 75%. We find that a quadrigram Markov chain saturates information theoretic measures against a held-out corpus. Our work forms the basis for the development of a stochastic grammar which may be used to explore the syntax of the Indus script in greater detail.http://europepmc.org/articles/PMC2841631?pdf=render
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Nisha Yadav Hrishikesh Joglekar Rajesh P N Rao Mayank N Vahia Ronojoy Adhikari Iravatham Mahadevan
spellingShingle	Nisha Yadav Hrishikesh Joglekar Rajesh P N Rao Mayank N Vahia Ronojoy Adhikari Iravatham Mahadevan Statistical analysis of the Indus script using n-grams. PLoS ONE
author_facet	Nisha Yadav Hrishikesh Joglekar Rajesh P N Rao Mayank N Vahia Ronojoy Adhikari Iravatham Mahadevan
author_sort	Nisha Yadav
title	Statistical analysis of the Indus script using n-grams.
title_short	Statistical analysis of the Indus script using n-grams.
title_full	Statistical analysis of the Indus script using n-grams.
title_fullStr	Statistical analysis of the Indus script using n-grams.
title_full_unstemmed	Statistical analysis of the Indus script using n-grams.
title_sort	statistical analysis of the indus script using n-grams.
publisher	Public Library of Science (PLoS)
series	PLoS ONE
issn	1932-6203
publishDate	2010-01-01
description	The Indus script is one of the major undeciphered scripts of the ancient world. The small size of the corpus, the absence of bilingual texts, and the lack of definite knowledge of the underlying language has frustrated efforts at decipherment since the discovery of the remains of the Indus civilization. Building on previous statistical approaches, we apply the tools of statistical language processing, specifically n-gram Markov chains, to analyze the syntax of the Indus script. We find that unigrams follow a Zipf-Mandelbrot distribution. Text beginner and ender distributions are unequal, providing internal evidence for syntax. We see clear evidence of strong bigram correlations and extract significant pairs and triplets using a log-likelihood measure of association. Highly frequent pairs and triplets are not always highly significant. The model performance is evaluated using information-theoretic measures and cross-validation. The model can restore doubtfully read texts with an accuracy of about 75%. We find that a quadrigram Markov chain saturates information theoretic measures against a held-out corpus. Our work forms the basis for the development of a stochastic grammar which may be used to explore the syntax of the Indus script in greater detail.
url	http://europepmc.org/articles/PMC2841631?pdf=render
work_keys_str_mv	AT nishayadav statisticalanalysisoftheindusscriptusingngrams AT hrishikeshjoglekar statisticalanalysisoftheindusscriptusingngrams AT rajeshpnrao statisticalanalysisoftheindusscriptusingngrams AT mayanknvahia statisticalanalysisoftheindusscriptusingngrams AT ronojoyadhikari statisticalanalysisoftheindusscriptusingngrams AT iravathammahadevan statisticalanalysisoftheindusscriptusingngrams
_version_	1725182829777649664

Statistical analysis of the Indus script using n-grams.

Similar Items