Large-Scale Analysis of Zipf's Law in English Texts.

Despite being a paradigm of quantitative linguistics, Zipf's law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So...

Full description

Bibliographic Details
Main Authors: Isabel Moreno-Sánchez, Francesc Font-Clos, Álvaro Corral
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2016-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC4723055?pdf=render
id doaj-59786414741147efa73d30d8f8b3a9f6
record_format Article
spelling doaj-59786414741147efa73d30d8f8b3a9f62020-11-24T21:52:03ZengPublic Library of Science (PLoS)PLoS ONE1932-62032016-01-01111e014707310.1371/journal.pone.0147073Large-Scale Analysis of Zipf's Law in English Texts.Isabel Moreno-SánchezFrancesc Font-ClosÁlvaro CorralDespite being a paradigm of quantitative linguistics, Zipf's law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf's law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf's law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30 000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf's law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value), and with only one free parameter (the exponent).http://europepmc.org/articles/PMC4723055?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Isabel Moreno-Sánchez
Francesc Font-Clos
Álvaro Corral
spellingShingle Isabel Moreno-Sánchez
Francesc Font-Clos
Álvaro Corral
Large-Scale Analysis of Zipf's Law in English Texts.
PLoS ONE
author_facet Isabel Moreno-Sánchez
Francesc Font-Clos
Álvaro Corral
author_sort Isabel Moreno-Sánchez
title Large-Scale Analysis of Zipf's Law in English Texts.
title_short Large-Scale Analysis of Zipf's Law in English Texts.
title_full Large-Scale Analysis of Zipf's Law in English Texts.
title_fullStr Large-Scale Analysis of Zipf's Law in English Texts.
title_full_unstemmed Large-Scale Analysis of Zipf's Law in English Texts.
title_sort large-scale analysis of zipf's law in english texts.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2016-01-01
description Despite being a paradigm of quantitative linguistics, Zipf's law for words suffers from three main problems: its formulation is ambiguous, its validity has not been tested rigorously from a statistical point of view, and it has not been confronted to a representatively large number of texts. So, we can summarize the current support of Zipf's law in texts as anecdotic. We try to solve these issues by studying three different versions of Zipf's law and fitting them to all available English texts in the Project Gutenberg database (consisting of more than 30 000 texts). To do so we use state-of-the art tools in fitting and goodness-of-fit tests, carefully tailored to the peculiarities of text statistics. Remarkably, one of the three versions of Zipf's law, consisting of a pure power-law form in the complementary cumulative distribution function of word frequencies, is able to fit more than 40% of the texts in the database (at the 0.05 significance level), for the whole domain of frequencies (from 1 to the maximum value), and with only one free parameter (the exponent).
url http://europepmc.org/articles/PMC4723055?pdf=render
work_keys_str_mv AT isabelmorenosanchez largescaleanalysisofzipfslawinenglishtexts
AT francescfontclos largescaleanalysisofzipfslawinenglishtexts
AT alvarocorral largescaleanalysisofzipfslawinenglishtexts
_version_ 1725877188997152768