Practical Text Phylogeny for Real-World Settings

The ease with which one can edit and redistribute digital documents on the Internet is one of modernity's great achievements, but it also leads to some vexing problems. With growing academic interest in the study of the evolution of digital writing on the one hand and the rise of disinformation...

Full description

Bibliographic Details
Main Authors: Bingyu Shen, Christopher W. Forstall, Anderson De Rezende Rocha, Walter J. Scheirer
Format: Article
Language:English
Published: IEEE 2018-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8412174/
id doaj-a86d24e7a9314693b87aca97a60a21b0
record_format Article
spelling doaj-a86d24e7a9314693b87aca97a60a21b02021-03-29T21:17:35ZengIEEEIEEE Access2169-35362018-01-016410024101210.1109/ACCESS.2018.28568658412174Practical Text Phylogeny for Real-World SettingsBingyu Shen0https://orcid.org/0000-0002-0792-7904Christopher W. Forstall1Anderson De Rezende Rocha2Walter J. Scheirer3Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USADepartment of Classics, Mount Allison University, Sackville, CanadaInstitute of Computing, University of Campinas, Campinas, BrazilDepartment of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN, USAThe ease with which one can edit and redistribute digital documents on the Internet is one of modernity's great achievements, but it also leads to some vexing problems. With growing academic interest in the study of the evolution of digital writing on the one hand and the rise of disinformation on the other, the problem of identifying the relationship between texts with similar content is becoming more important. Traditional vector space representations of texts have made progress in solving this problem when it is cast as a reconstruction task that organizes related texts into a tree expressing relationships-this is dubbed text phylogeny in the information forensics literature. However, as new text representation methods have been successfully applied to many other text analysis problems, it is worth investigating if they too are used in text phylogeny tree reconstruction. In this paper, we explore the use of word embeddings as a text representation method, with the aim of trying to improve the accuracy of reconstructed phylogeny trees for real-world data and compare it with other widely used text representation methods. We evaluate the performance on established benchmarks for this task: a synthetic data set and data collected from Wikipedia. We also apply our framework to a new data set of fan fiction based on some famous fairy tales. Experimental results show that word embeddings are competitive with other feature sets for the published benchmarks, and are highly effective for creative writing.https://ieeexplore.ieee.org/document/8412174/Text phylogenyword embeddingsmachine learningnatural language processingforensicsdigital humanities
collection DOAJ
language English
format Article
sources DOAJ
author Bingyu Shen
Christopher W. Forstall
Anderson De Rezende Rocha
Walter J. Scheirer
spellingShingle Bingyu Shen
Christopher W. Forstall
Anderson De Rezende Rocha
Walter J. Scheirer
Practical Text Phylogeny for Real-World Settings
IEEE Access
Text phylogeny
word embeddings
machine learning
natural language processing
forensics
digital humanities
author_facet Bingyu Shen
Christopher W. Forstall
Anderson De Rezende Rocha
Walter J. Scheirer
author_sort Bingyu Shen
title Practical Text Phylogeny for Real-World Settings
title_short Practical Text Phylogeny for Real-World Settings
title_full Practical Text Phylogeny for Real-World Settings
title_fullStr Practical Text Phylogeny for Real-World Settings
title_full_unstemmed Practical Text Phylogeny for Real-World Settings
title_sort practical text phylogeny for real-world settings
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2018-01-01
description The ease with which one can edit and redistribute digital documents on the Internet is one of modernity's great achievements, but it also leads to some vexing problems. With growing academic interest in the study of the evolution of digital writing on the one hand and the rise of disinformation on the other, the problem of identifying the relationship between texts with similar content is becoming more important. Traditional vector space representations of texts have made progress in solving this problem when it is cast as a reconstruction task that organizes related texts into a tree expressing relationships-this is dubbed text phylogeny in the information forensics literature. However, as new text representation methods have been successfully applied to many other text analysis problems, it is worth investigating if they too are used in text phylogeny tree reconstruction. In this paper, we explore the use of word embeddings as a text representation method, with the aim of trying to improve the accuracy of reconstructed phylogeny trees for real-world data and compare it with other widely used text representation methods. We evaluate the performance on established benchmarks for this task: a synthetic data set and data collected from Wikipedia. We also apply our framework to a new data set of fan fiction based on some famous fairy tales. Experimental results show that word embeddings are competitive with other feature sets for the published benchmarks, and are highly effective for creative writing.
topic Text phylogeny
word embeddings
machine learning
natural language processing
forensics
digital humanities
url https://ieeexplore.ieee.org/document/8412174/
work_keys_str_mv AT bingyushen practicaltextphylogenyforrealworldsettings
AT christopherwforstall practicaltextphylogenyforrealworldsettings
AT andersonderezenderocha practicaltextphylogenyforrealworldsettings
AT walterjscheirer practicaltextphylogenyforrealworldsettings
_version_ 1724193224707276800