A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptides

Drug discovery often requires the identification of off-targets as the binding of a compound to targets other than the intended target(s) can be beneficial in some cases or detrimental in other situations (e.g., binding to anti-targets). Such investigations are also of importance during the early st...

وصف كامل

التفاصيل البيبلوغرافية
الحاوية / القاعدة:Data in Brief
المؤلفون الرئيسيون: Shivalika Tanwar, Patrick Auberger, Germain Gillet, Mario DiPaola, Katya Tsaioun, Bruno O. Villoutreix
التنسيق: مقال
اللغة:الإنجليزية
منشور في: Elsevier 2022-06-01
الموضوعات:
الوصول للمادة أونلاين:http://www.sciencedirect.com/science/article/pii/S2352340922003638
_version_ 1851931723333369856
author Shivalika Tanwar
Patrick Auberger
Germain Gillet
Mario DiPaola
Katya Tsaioun
Bruno O. Villoutreix
author_facet Shivalika Tanwar
Patrick Auberger
Germain Gillet
Mario DiPaola
Katya Tsaioun
Bruno O. Villoutreix
author_sort Shivalika Tanwar
collection DOAJ
container_title Data in Brief
description Drug discovery often requires the identification of off-targets as the binding of a compound to targets other than the intended target(s) can be beneficial in some cases or detrimental in other situations (e.g., binding to anti-targets). Such investigations are also of importance during the early stage of a project, for example when the target is not known (e.g., phenotypic screening). Target identification can be performed in-vitro, but various in-silico methods have also been developed in recent years to facilitate target identification and help generate ideas. FastTargetPred is one such approach, it is a freely available Python/C program that attempts to predict putative macromolecular targets (i.e., target fishing) for a single input small molecule query or an entire compound collection using established chemical similarity search approaches. Indeed, the putative macromolecular target(s) of a small chemical compound can be predicted by identifying ligands that are known experimentally to bind to some targets and that are structurally similar to the input query chemical compound. Therefore, this type of target fishing approach relies on a large collection of experimentally validated macromolecule-chemical compound binding data. The small chemical compounds can be described as molecular fingerprints encoding their structural characteristics as a vector. The published version of FastTargetPred used ligand-target binding data extracted from the release 25 (2019) of the ChEMBL database. Here we provide a new dataset for FastTargetPred extracted from the last ChEMBL release, namely, at the time of writing, ChEMBL29 (2021). Four fingerprints were computed (ECFP4, ECFP6, MACCS and PL) for the extracted compound dataset (714,780 unique ChEMBL29 compounds while the entire ChEMBL29 database contained about 2.1 million compounds). However, it was not possible to compute fingerprints for 19 molecules because of their unusual chemistry (complex macrocycles). These data files were then prepared so as to be compatible with FastTargetPred requirements. The 714,761 ChEMBL chemical compounds with computed fingerprints hit 6,477 macromolecular targets based on the selected criteria. For these ChEMBL compounds a ChEMBL target ID is reported and these target IDs were matched with the corresponding UniProt IDs. Thus, when available, the UniProt ID is provided, the protein UniProt name, the gene name, the organism as well as annotated involvement in diseases, gene ontology data, and cross-references to the Reactome pathway database. As short peptides can be of interest for drug discovery and chemical biology endeavours, we were interested in attempting to predict putative macromolecular targets for a previously reported exhaustive combination of peptides containing four natural amino acids (i.e., 20 × 20 × 20 × 20 = 160,000 linear tetrapeptides) using FastTargetPred and the presently generated ChEMBL29 dataset. With the parameters used, putative targets are reported for 63,944 unique query peptides. These target predictions are provided in two different searchable files with hyperlinks to the ChEMBL, UniProt and Reactome databases.
format Article
id doaj-art-6e535296d326447eb4efea2ac8a595bb
institution Directory of Open Access Journals
issn 2352-3409
language English
publishDate 2022-06-01
publisher Elsevier
record_format Article
spelling doaj-art-6e535296d326447eb4efea2ac8a595bb2025-08-19T21:54:11ZengElsevierData in Brief2352-34092022-06-014210815910.1016/j.dib.2022.108159A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptidesShivalika Tanwar0Patrick Auberger1Germain Gillet2Mario DiPaola3Katya Tsaioun4Bruno O. Villoutreix5Inserm UMR 1141 NeuroDiderot, Robert-Debré Hospital, Université de Paris, Paris 75019, FranceUniversité Côte d'azur, Nice, France; Inserm U1065, C3M, Team 2, Nice, FranceCenter de Recherche en Cancérologie de Lyon, U1052 INSERM, UMR CNRS 5286, Université de Lyon, Université Lyon I, Center Léon Bérard, 28 rue Laennec, Lyon 69008, FranceAkttyva Therapeutics, Inc., MA, Mansfield, USAAkttyva Therapeutics, Inc., MA, Mansfield, USA; Johns Hopkins Bloomberg School of Public Health, MD, Baltimore, USAInserm UMR 1141 NeuroDiderot, Robert-Debré Hospital, Université de Paris, Paris 75019, France; Center de Recherche en Cancérologie de Lyon, U1052 INSERM, UMR CNRS 5286, Université de Lyon, Université Lyon I, Center Léon Bérard, 28 rue Laennec, Lyon 69008, France; Corresponding author at: Inserm UMR 1141 NeuroDiderot, Robert-Debré Hospital, Université Paris Cité, Paris 75019, France.Drug discovery often requires the identification of off-targets as the binding of a compound to targets other than the intended target(s) can be beneficial in some cases or detrimental in other situations (e.g., binding to anti-targets). Such investigations are also of importance during the early stage of a project, for example when the target is not known (e.g., phenotypic screening). Target identification can be performed in-vitro, but various in-silico methods have also been developed in recent years to facilitate target identification and help generate ideas. FastTargetPred is one such approach, it is a freely available Python/C program that attempts to predict putative macromolecular targets (i.e., target fishing) for a single input small molecule query or an entire compound collection using established chemical similarity search approaches. Indeed, the putative macromolecular target(s) of a small chemical compound can be predicted by identifying ligands that are known experimentally to bind to some targets and that are structurally similar to the input query chemical compound. Therefore, this type of target fishing approach relies on a large collection of experimentally validated macromolecule-chemical compound binding data. The small chemical compounds can be described as molecular fingerprints encoding their structural characteristics as a vector. The published version of FastTargetPred used ligand-target binding data extracted from the release 25 (2019) of the ChEMBL database. Here we provide a new dataset for FastTargetPred extracted from the last ChEMBL release, namely, at the time of writing, ChEMBL29 (2021). Four fingerprints were computed (ECFP4, ECFP6, MACCS and PL) for the extracted compound dataset (714,780 unique ChEMBL29 compounds while the entire ChEMBL29 database contained about 2.1 million compounds). However, it was not possible to compute fingerprints for 19 molecules because of their unusual chemistry (complex macrocycles). These data files were then prepared so as to be compatible with FastTargetPred requirements. The 714,761 ChEMBL chemical compounds with computed fingerprints hit 6,477 macromolecular targets based on the selected criteria. For these ChEMBL compounds a ChEMBL target ID is reported and these target IDs were matched with the corresponding UniProt IDs. Thus, when available, the UniProt ID is provided, the protein UniProt name, the gene name, the organism as well as annotated involvement in diseases, gene ontology data, and cross-references to the Reactome pathway database. As short peptides can be of interest for drug discovery and chemical biology endeavours, we were interested in attempting to predict putative macromolecular targets for a previously reported exhaustive combination of peptides containing four natural amino acids (i.e., 20 × 20 × 20 × 20 = 160,000 linear tetrapeptides) using FastTargetPred and the presently generated ChEMBL29 dataset. With the parameters used, putative targets are reported for 63,944 unique query peptides. These target predictions are provided in two different searchable files with hyperlinks to the ChEMBL, UniProt and Reactome databases.http://www.sciencedirect.com/science/article/pii/S2352340922003638PeptideVirtual screeningDrug discoveryTarget prediction
spellingShingle Shivalika Tanwar
Patrick Auberger
Germain Gillet
Mario DiPaola
Katya Tsaioun
Bruno O. Villoutreix
A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptides
Peptide
Virtual screening
Drug discovery
Target prediction
title A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptides
title_full A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptides
title_fullStr A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptides
title_full_unstemmed A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptides
title_short A new ChEMBL dataset for the similarity-based target fishing engine FastTargetPred: Annotation of an exhaustive list of linear tetrapeptides
title_sort new chembl dataset for the similarity based target fishing engine fasttargetpred annotation of an exhaustive list of linear tetrapeptides
topic Peptide
Virtual screening
Drug discovery
Target prediction
url http://www.sciencedirect.com/science/article/pii/S2352340922003638
work_keys_str_mv AT shivalikatanwar anewchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT patrickauberger anewchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT germaingillet anewchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT mariodipaola anewchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT katyatsaioun anewchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT brunoovilloutreix anewchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT shivalikatanwar newchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT patrickauberger newchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT germaingillet newchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT mariodipaola newchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT katyatsaioun newchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides
AT brunoovilloutreix newchembldatasetforthesimilaritybasedtargetfishingenginefasttargetpredannotationofanexhaustivelistoflineartetrapeptides