A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 mi...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2018-02-01
|
Series: | PLoS Computational Biology |
Online Access: | http://europepmc.org/articles/PMC5831415?pdf=render |
id |
doaj-4aedde1c9d994b6bacc3882e0b9a9e00 |
---|---|
record_format |
Article |
spelling |
doaj-4aedde1c9d994b6bacc3882e0b9a9e002020-11-25T01:32:26ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582018-02-01142e100596210.1371/journal.pcbi.1005962A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts.David WestergaardHans-Henrik StærfeldtChristian TønsbergLars Juhl JensenSøren BrunakAcross academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.http://europepmc.org/articles/PMC5831415?pdf=render |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
David Westergaard Hans-Henrik Stærfeldt Christian Tønsberg Lars Juhl Jensen Søren Brunak |
spellingShingle |
David Westergaard Hans-Henrik Stærfeldt Christian Tønsberg Lars Juhl Jensen Søren Brunak A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Computational Biology |
author_facet |
David Westergaard Hans-Henrik Stærfeldt Christian Tønsberg Lars Juhl Jensen Søren Brunak |
author_sort |
David Westergaard |
title |
A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. |
title_short |
A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. |
title_full |
A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. |
title_fullStr |
A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. |
title_full_unstemmed |
A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. |
title_sort |
comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. |
publisher |
Public Library of Science (PLoS) |
series |
PLoS Computational Biology |
issn |
1553-734X 1553-7358 |
publishDate |
2018-02-01 |
description |
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only. |
url |
http://europepmc.org/articles/PMC5831415?pdf=render |
work_keys_str_mv |
AT davidwestergaard acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts AT hanshenrikstærfeldt acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts AT christiantønsberg acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts AT larsjuhljensen acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts AT sørenbrunak acomprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts AT davidwestergaard comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts AT hanshenrikstærfeldt comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts AT christiantønsberg comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts AT larsjuhljensen comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts AT sørenbrunak comprehensiveandquantitativecomparisonoftextminingin15millionfulltextarticlesversustheircorrespondingabstracts |
_version_ |
1725082156424757248 |