Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

Abstract Background The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for c...

Full description

Bibliographic Details
Main Authors:	Corrado Lanera, Paola Berchialla, Abhinav Sharma, Clara Minto, Dario Gregori, Ileana Baldi
Format:	Article
Language:	English
Published:	BMC 2019-12-01
Series:	Systematic Reviews
Subjects:	Classification Indexed search engine Machine learning Text mining Unbalanced data, systematic review
Online Access:	https://doi.org/10.1186/s13643-019-1245-8

id	doaj-67d00573bf8348d2ace7c4cd8b230e1d
record_format	Article
spelling	doaj-67d00573bf8348d2ace7c4cd8b230e1d2020-12-06T12:10:04ZengBMCSystematic Reviews2046-40532019-12-01811910.1186/s13643-019-1245-8Screening PubMed abstracts: is class imbalance always a challenge to machine learning?Corrado Lanera0Paola Berchialla1Abhinav Sharma2Clara Minto3Dario Gregori4Ileana Baldi5Unit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of PadovaDepartment of Clinical and Biological Sciences, University of TorinoDepartment of Biological Sciences and Bioengineering, Indian Institute of Technology KanpurUnit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of PadovaUnit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of PadovaUnit of Biostatistics, Epidemiology and Public Health, Department of Cardiac Thoracic Vascular Sciences and Public Health, University of PadovaAbstract Background The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for class imbalance to identify the outperforming strategy to screen articles in PubMed for inclusion in systematic reviews. Methods We trained four binary text classifiers (support vector machines, k-nearest neighbor, random forest, and elastic-net regularized generalized linear models) in combination with four techniques for class imbalance: random undersampling and oversampling with 50:50 and 35:65 positive to negative class ratios and none as a benchmark. We used textual data of 14 systematic reviews as case studies. Difference between cross-validated area under the receiver operating characteristic curve (AUC-ROC) for machine learning techniques with and without preprocessing (delta AUC) was estimated within each systematic review, separately for each classifier. Meta-analytic fixed-effect models were used to pool delta AUCs separately by classifier and strategy. Results Cross-validated AUC-ROC for machine learning techniques (excluding k-nearest neighbor) without preprocessing was prevalently above 90%. Except for k-nearest neighbor, machine learning techniques achieved the best improvement in conjunction with random oversampling 50:50 and random undersampling 35:65. Conclusions Resampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred.https://doi.org/10.1186/s13643-019-1245-8ClassificationIndexed search engineMachine learningText miningUnbalanced data, systematic review
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Corrado Lanera Paola Berchialla Abhinav Sharma Clara Minto Dario Gregori Ileana Baldi
spellingShingle	Corrado Lanera Paola Berchialla Abhinav Sharma Clara Minto Dario Gregori Ileana Baldi Screening PubMed abstracts: is class imbalance always a challenge to machine learning? Systematic Reviews Classification Indexed search engine Machine learning Text mining Unbalanced data, systematic review
author_facet	Corrado Lanera Paola Berchialla Abhinav Sharma Clara Minto Dario Gregori Ileana Baldi
author_sort	Corrado Lanera
title	Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title_short	Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title_full	Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title_fullStr	Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title_full_unstemmed	Screening PubMed abstracts: is class imbalance always a challenge to machine learning?
title_sort	screening pubmed abstracts: is class imbalance always a challenge to machine learning?
publisher	BMC
series	Systematic Reviews
issn	2046-4053
publishDate	2019-12-01
description	Abstract Background The growing number of medical literature and textual data in online repositories led to an exponential increase in the workload of researchers involved in citation screening for systematic reviews. This work aims to combine machine learning techniques and data preprocessing for class imbalance to identify the outperforming strategy to screen articles in PubMed for inclusion in systematic reviews. Methods We trained four binary text classifiers (support vector machines, k-nearest neighbor, random forest, and elastic-net regularized generalized linear models) in combination with four techniques for class imbalance: random undersampling and oversampling with 50:50 and 35:65 positive to negative class ratios and none as a benchmark. We used textual data of 14 systematic reviews as case studies. Difference between cross-validated area under the receiver operating characteristic curve (AUC-ROC) for machine learning techniques with and without preprocessing (delta AUC) was estimated within each systematic review, separately for each classifier. Meta-analytic fixed-effect models were used to pool delta AUCs separately by classifier and strategy. Results Cross-validated AUC-ROC for machine learning techniques (excluding k-nearest neighbor) without preprocessing was prevalently above 90%. Except for k-nearest neighbor, machine learning techniques achieved the best improvement in conjunction with random oversampling 50:50 and random undersampling 35:65. Conclusions Resampling techniques slightly improved the performance of the investigated machine learning techniques. From a computational perspective, random undersampling 35:65 may be preferred.
topic	Classification Indexed search engine Machine learning Text mining Unbalanced data, systematic review
url	https://doi.org/10.1186/s13643-019-1245-8
work_keys_str_mv	AT corradolanera screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning AT paolaberchialla screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning AT abhinavsharma screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning AT claraminto screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning AT dariogregori screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning AT ileanabaldi screeningpubmedabstractsisclassimbalancealwaysachallengetomachinelearning
_version_	1724399194562625536

Screening PubMed abstracts: is class imbalance always a challenge to machine learning?

Similar Items