Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.

Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network grow...

Full description

Bibliographic Details
Main Authors:	Camilo Akimushkin, Diego Raphael Amancio, Osvaldo Novais Oliveira
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2017-01-01
Series:	PLoS ONE
Online Access:	http://europepmc.org/articles/PMC5268788?pdf=render

id	doaj-943e621cf5b84cc39bc7120b02a73519
record_format	Article
spelling	doaj-943e621cf5b84cc39bc7120b02a735192020-11-24T20:45:07ZengPublic Library of Science (PLoS)PLoS ONE1932-62032017-01-01121e017052710.1371/journal.pone.0170527Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.Camilo AkimushkinDiego Raphael AmancioOsvaldo Novais OliveiraAutomatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks.http://europepmc.org/articles/PMC5268788?pdf=render
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Camilo Akimushkin Diego Raphael Amancio Osvaldo Novais Oliveira
spellingShingle	Camilo Akimushkin Diego Raphael Amancio Osvaldo Novais Oliveira Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks. PLoS ONE
author_facet	Camilo Akimushkin Diego Raphael Amancio Osvaldo Novais Oliveira
author_sort	Camilo Akimushkin
title	Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.
title_short	Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.
title_full	Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.
title_fullStr	Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.
title_full_unstemmed	Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.
title_sort	text authorship identified using the dynamics of word co-occurrence networks.
publisher	Public Library of Science (PLoS)
series	PLoS ONE
issn	1932-6203
publishDate	2017-01-01
description	Automatic identification of authorship in disputed documents has benefited from complex network theory as this approach does not require human expertise or detailed semantic knowledge. Networks modeling entire books can be used to discriminate texts from different sources and understand network growth mechanisms, but only a few studies have probed the suitability of networks in modeling small chunks of text to grasp stylistic features. In this study, we introduce a methodology based on the dynamics of word co-occurrence networks representing written texts to classify a corpus of 80 texts by 8 authors. The texts were divided into sections with equal number of linguistic tokens, from which time series were created for 12 topological metrics. Since 73% of all series were stationary (ARIMA(p, 0, q)) and the remaining were integrable of first order (ARIMA(p, 1, q)), probability distributions could be obtained for the global network metrics. The metrics exhibit bell-shaped non-Gaussian distributions, and therefore distribution moments were used as learning attributes. With an optimized supervised learning procedure based on a nonlinear transformation performed by Isomap, 71 out of 80 texts were correctly classified using the K-nearest neighbors algorithm, i.e. a remarkable 88.75% author matching success rate was achieved. Hence, purely dynamic fluctuations in network metrics can characterize authorship, thus paving the way for a robust description of large texts in terms of small evolving networks.
url	http://europepmc.org/articles/PMC5268788?pdf=render
work_keys_str_mv	AT camiloakimushkin textauthorshipidentifiedusingthedynamicsofwordcooccurrencenetworks AT diegoraphaelamancio textauthorshipidentifiedusingthedynamicsofwordcooccurrencenetworks AT osvaldonovaisoliveira textauthorshipidentifiedusingthedynamicsofwordcooccurrencenetworks
_version_	1716815377543987200

Text Authorship Identified Using the Dynamics of Word Co-Occurrence Networks.

Similar Items