Models and Processes to Extract Drug-like Molecules From Natural Language Text

Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be...

Full description

Bibliographic Details
Main Authors: Zhi Hong, J. Gregory Pauloski, Logan Ward, Kyle Chard, Ben Blaiszik, Ian Foster
Format: Article
Language:English
Published: Frontiers Media S.A. 2021-08-01
Series:Frontiers in Molecular Biosciences
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fmolb.2021.636077/full
id doaj-25dc6e150f3044b7a3c02a54d6b2bba8
record_format Article
spelling doaj-25dc6e150f3044b7a3c02a54d6b2bba82021-09-03T15:34:52ZengFrontiers Media S.A.Frontiers in Molecular Biosciences2296-889X2021-08-01810.3389/fmolb.2021.636077636077Models and Processes to Extract Drug-like Molecules From Natural Language TextZhi Hong0J. Gregory Pauloski1Logan Ward2Kyle Chard3Kyle Chard4Kyle Chard5Ben Blaiszik6Ben Blaiszik7Ian Foster8Ian Foster9Ian Foster10Department of Computer Science, University of Chicago, Chicago, IL, United StatesDepartment of Computer Science, University of Chicago, Chicago, IL, United StatesData Science and Learning Division, Argonne National Laboratory, Lemont, IL, United StatesDepartment of Computer Science, University of Chicago, Chicago, IL, United StatesData Science and Learning Division, Argonne National Laboratory, Lemont, IL, United StatesGlobus, University of Chicago, Chicago, IL, United StatesData Science and Learning Division, Argonne National Laboratory, Lemont, IL, United StatesGlobus, University of Chicago, Chicago, IL, United StatesDepartment of Computer Science, University of Chicago, Chicago, IL, United StatesData Science and Learning Division, Argonne National Laboratory, Lemont, IL, United StatesGlobus, University of Chicago, Chicago, IL, United StatesResearchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein.https://www.frontiersin.org/articles/10.3389/fmolb.2021.636077/fullcoronavirus disease-19coronavirus disease-19 open research dataset challengenamed entity recognitionlong-short term memorydata mining
collection DOAJ
language English
format Article
sources DOAJ
author Zhi Hong
J. Gregory Pauloski
Logan Ward
Kyle Chard
Kyle Chard
Kyle Chard
Ben Blaiszik
Ben Blaiszik
Ian Foster
Ian Foster
Ian Foster
spellingShingle Zhi Hong
J. Gregory Pauloski
Logan Ward
Kyle Chard
Kyle Chard
Kyle Chard
Ben Blaiszik
Ben Blaiszik
Ian Foster
Ian Foster
Ian Foster
Models and Processes to Extract Drug-like Molecules From Natural Language Text
Frontiers in Molecular Biosciences
coronavirus disease-19
coronavirus disease-19 open research dataset challenge
named entity recognition
long-short term memory
data mining
author_facet Zhi Hong
J. Gregory Pauloski
Logan Ward
Kyle Chard
Kyle Chard
Kyle Chard
Ben Blaiszik
Ben Blaiszik
Ian Foster
Ian Foster
Ian Foster
author_sort Zhi Hong
title Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_short Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_full Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_fullStr Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_full_unstemmed Models and Processes to Extract Drug-like Molecules From Natural Language Text
title_sort models and processes to extract drug-like molecules from natural language text
publisher Frontiers Media S.A.
series Frontiers in Molecular Biosciences
issn 2296-889X
publishDate 2021-08-01
description Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%—on par with that of non-expert humans—and when applied to CORD’19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team’s targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein.
topic coronavirus disease-19
coronavirus disease-19 open research dataset challenge
named entity recognition
long-short term memory
data mining
url https://www.frontiersin.org/articles/10.3389/fmolb.2021.636077/full
work_keys_str_mv AT zhihong modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT jgregorypauloski modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT loganward modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT kylechard modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT kylechard modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT kylechard modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT benblaiszik modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT benblaiszik modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT ianfoster modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT ianfoster modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
AT ianfoster modelsandprocessestoextractdruglikemoleculesfromnaturallanguagetext
_version_ 1717816130004844544