A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.

Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 mi...

Full description

Bibliographic Details
Main Authors: David Westergaard, Hans-Henrik Stærfeldt, Christian Tønsberg, Lars Juhl Jensen, Søren Brunak
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2018-02-01
Series:PLoS Computational Biology
Online Access:http://europepmc.org/articles/PMC5831415?pdf=render
id doaj-4aedde1c9d994b6bacc3882e0b9a9e00
record_format Article
spelling doaj-4aedde1c9d994b6bacc3882e0b9a9e002020-11-25T01:32:26ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582018-02-01142e100596210.1371/journal.pcbi.1005962A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.David WestergaardHans-Henrik StærfeldtChristian TønsbergLars Juhl JensenSøren BrunakAcross academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.http://europepmc.org/articles/PMC5831415?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author David Westergaard
Hans-Henrik Stærfeldt
Christian Tønsberg
Lars Juhl Jensen
Søren Brunak
spellingShingle David Westergaard
Hans-Henrik Stærfeldt
Christian Tønsberg
Lars Juhl Jensen
Søren Brunak
A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.
PLoS Computational Biology
author_facet David Westergaard
Hans-Henrik Stærfeldt
Christian Tønsberg
Lars Juhl Jensen
Søren Brunak
author_sort David Westergaard
title A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.
title_short A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.
title_full A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.
title_fullStr A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.
title_full_unstemmed A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.
title_sort comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.
publisher Public Library of Science (PLoS)
series PLoS Computational Biology
issn 1553-734X
1553-7358
publishDate 2018-02-01
description Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
url http://europepmc.org/articles/PMC5831415?pdf=render
work_keys_str_mv AT davidwestergaard acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT hanshenrikstærfeldt acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT christiantønsberg acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT larsjuhljensen acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT sørenbrunak acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT davidwestergaard comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT hanshenrikstærfeldt comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT christiantønsberg comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT larsjuhljensen comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
AT sørenbrunak comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts
_version_ 1725082156424757248