ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents

Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic pro...

Full description

Bibliographic Details
Main Authors: Jiayuan He, Dat Quoc Nguyen, Saber A. Akhondi, Christian Druckenbrodt, Camilo Thorne, Ralph Hoessel, Zubair Afzal, Zenan Zhai, Biaoyan Fang, Hiyori Yoshikawa, Ameer Albahem, Lawrence Cavedon, Trevor Cohn, Timothy Baldwin, Karin Verspoor
Format: Article
Language:English
Published: Frontiers Media S.A. 2021-03-01
Series:Frontiers in Research Metrics and Analytics
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/frma.2021.654438/full
id doaj-0be7504740b24a0eb8b2400b266f3147
record_format Article
spelling doaj-0be7504740b24a0eb8b2400b266f31472021-06-02T20:32:16ZengFrontiers Media S.A.Frontiers in Research Metrics and Analytics2504-05372021-03-01610.3389/frma.2021.654438654438ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical PatentsJiayuan He0Jiayuan He1Dat Quoc Nguyen2Dat Quoc Nguyen3Saber A. Akhondi4Christian Druckenbrodt5Camilo Thorne6Ralph Hoessel7Zubair Afzal8Zenan Zhai9Biaoyan Fang10Hiyori Yoshikawa11Hiyori Yoshikawa12Ameer Albahem13Ameer Albahem14Lawrence Cavedon15Trevor Cohn16Timothy Baldwin17Karin Verspoor18Karin Verspoor19The University of Melbourne, Parkville, VIC, AustraliaRMIT University, Melbourne, VIC, AustraliaThe University of Melbourne, Parkville, VIC, AustraliaVinAI Research, Hanoi, VietnamElsevier BV, Amsterdam, NetherlandsElsevier Information Systems GmbH, Frankfurt, GermanyElsevier Information Systems GmbH, Frankfurt, GermanyElsevier Information Systems GmbH, Frankfurt, GermanyElsevier BV, Amsterdam, NetherlandsThe University of Melbourne, Parkville, VIC, AustraliaThe University of Melbourne, Parkville, VIC, AustraliaThe University of Melbourne, Parkville, VIC, AustraliaFujitsu Laboratories Ltd., Tokyo, JapanThe University of Melbourne, Parkville, VIC, AustraliaRMIT University, Melbourne, VIC, AustraliaRMIT University, Melbourne, VIC, AustraliaThe University of Melbourne, Parkville, VIC, AustraliaThe University of Melbourne, Parkville, VIC, AustraliaThe University of Melbourne, Parkville, VIC, AustraliaRMIT University, Melbourne, VIC, AustraliaChemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.https://www.frontiersin.org/articles/10.3389/frma.2021.654438/fullnamed entity recognitionevent extractioninformation extractionchemical reactionspatent text miningcheminformatics
collection DOAJ
language English
format Article
sources DOAJ
author Jiayuan He
Jiayuan He
Dat Quoc Nguyen
Dat Quoc Nguyen
Saber A. Akhondi
Christian Druckenbrodt
Camilo Thorne
Ralph Hoessel
Zubair Afzal
Zenan Zhai
Biaoyan Fang
Hiyori Yoshikawa
Hiyori Yoshikawa
Ameer Albahem
Ameer Albahem
Lawrence Cavedon
Trevor Cohn
Timothy Baldwin
Karin Verspoor
Karin Verspoor
spellingShingle Jiayuan He
Jiayuan He
Dat Quoc Nguyen
Dat Quoc Nguyen
Saber A. Akhondi
Christian Druckenbrodt
Camilo Thorne
Ralph Hoessel
Zubair Afzal
Zenan Zhai
Biaoyan Fang
Hiyori Yoshikawa
Hiyori Yoshikawa
Ameer Albahem
Ameer Albahem
Lawrence Cavedon
Trevor Cohn
Timothy Baldwin
Karin Verspoor
Karin Verspoor
ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
Frontiers in Research Metrics and Analytics
named entity recognition
event extraction
information extraction
chemical reactions
patent text mining
cheminformatics
author_facet Jiayuan He
Jiayuan He
Dat Quoc Nguyen
Dat Quoc Nguyen
Saber A. Akhondi
Christian Druckenbrodt
Camilo Thorne
Ralph Hoessel
Zubair Afzal
Zenan Zhai
Biaoyan Fang
Hiyori Yoshikawa
Hiyori Yoshikawa
Ameer Albahem
Ameer Albahem
Lawrence Cavedon
Trevor Cohn
Timothy Baldwin
Karin Verspoor
Karin Verspoor
author_sort Jiayuan He
title ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title_short ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title_full ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title_fullStr ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title_full_unstemmed ChEMU 2020: Natural Language Processing Methods Are Effective for Information Extraction From Chemical Patents
title_sort chemu 2020: natural language processing methods are effective for information extraction from chemical patents
publisher Frontiers Media S.A.
series Frontiers in Research Metrics and Analytics
issn 2504-0537
publishDate 2021-03-01
description Chemical patents represent a valuable source of information about new chemical compounds, which is critical to the drug discovery process. Automated information extraction over chemical patents is, however, a challenging task due to the large volume of existing patents and the complex linguistic properties of chemical patents. The Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020), was introduced to support the development of advanced text mining techniques for chemical patents. The ChEMU 2020 lab proposed two fundamental information extraction tasks focusing on chemical reaction processes described in chemical patents: (1) chemical named entity recognition, requiring identification of essential chemical entities and their roles in chemical reactions, as well as reaction conditions; and (2) event extraction, which aims at identification of event steps relating the entities involved in chemical reactions. The ChEMU 2020 lab received 37 team registrations and 46 runs. Overall, the performance of submissions for these tasks exceeded our expectations, with the top systems outperforming strong baselines. We further show the methods to be robust to variations in sampling of the test data. We provide a detailed overview of the ChEMU 2020 corpus and its annotation, showing that inter-annotator agreement is very strong. We also present the methods adopted by participants, provide a detailed analysis of their performance, and carefully consider the potential impact of data leakage on interpretation of the results. The ChEMU 2020 Lab has shown the viability of automated methods to support information extraction of key information in chemical patents.
topic named entity recognition
event extraction
information extraction
chemical reactions
patent text mining
cheminformatics
url https://www.frontiersin.org/articles/10.3389/frma.2021.654438/full
work_keys_str_mv AT jiayuanhe chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT jiayuanhe chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT datquocnguyen chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT datquocnguyen chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT saberaakhondi chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT christiandruckenbrodt chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT camilothorne chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT ralphhoessel chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT zubairafzal chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT zenanzhai chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT biaoyanfang chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT hiyoriyoshikawa chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT hiyoriyoshikawa chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT ameeralbahem chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT ameeralbahem chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT lawrencecavedon chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT trevorcohn chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT timothybaldwin chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT karinverspoor chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
AT karinverspoor chemu2020naturallanguageprocessingmethodsareeffectiveforinformationextractionfromchemicalpatents
_version_ 1721401009689853952