Genomic benchmarks: a collection of datasets for genomic sequence classification

Abstract Background Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most p...

詳細記述

書誌詳細
出版年:	BMC Genomic Data
主要な著者:	Katarína Grešová, Vlastimil Martinek, David Čechák, Petr Šimeček, Panagiotis Alexiou
フォーマット:	論文
言語:	英語
出版事項:	BMC 2023-05-01
主題:	Genomics Dataset Benchmark Deep learning Convolutional neural network
オンライン･アクセス:	https://doi.org/10.1186/s12863-023-01123-8

_version_	1852677678457421824
author	Katarína Grešová Vlastimil Martinek David Čechák Petr Šimeček Panagiotis Alexiou
author_facet	Katarína Grešová Vlastimil Martinek David Čechák Petr Šimeček Panagiotis Alexiou
author_sort	Katarína Grešová
collection	DOAJ
container_title	BMC Genomic Data
description	Abstract Background Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. Results Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . Conclusions Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.
format	Article
id	doaj-art-cf6dd7e2cf5c4e5da2445ff3263c0ebf
institution	Directory of Open Access Journals
issn	2730-6844
language	English
publishDate	2023-05-01
publisher	BMC
record_format	Article
spelling	doaj-art-cf6dd7e2cf5c4e5da2445ff3263c0ebf2025-08-19T21:29:57ZengBMCBMC Genomic Data2730-68442023-05-012411910.1186/s12863-023-01123-8Genomic benchmarks: a collection of datasets for genomic sequence classificationKatarína Grešová0Vlastimil Martinek1David Čechák2Petr Šimeček3Panagiotis Alexiou4Centre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk UniversityCentre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk UniversityCentre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk UniversityCentre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk UniversityCentre for Molecular Medicine, Central European Institute of Technology (CEITEC), Masaryk UniversityAbstract Background Recently, deep neural networks have been successfully applied in many biological fields. In 2020, a deep learning model AlphaFold won the protein folding competition with predicted structures within the error tolerance of experimental methods. However, this solution to the most prominent bioinformatic challenge of the past 50 years has been possible only thanks to a carefully curated benchmark of experimentally predicted protein structures. In Genomics, we have similar challenges (annotation of genomes and identification of functional elements) but currently, we lack benchmarks similar to protein folding competition. Results Here we present a collection of curated and easily accessible sequence classification datasets in the field of genomics. The proposed collection is based on a combination of novel datasets constructed from the mining of publicly available databases and existing datasets obtained from published articles. The collection currently contains nine datasets that focus on regulatory elements (promoters, enhancers, open chromatin region) from three model organisms: human, mouse, and roundworm. A simple convolution neural network is also included in a repository and can be used as a baseline model. Benchmarks and the baseline model are distributed as the Python package ‘genomic-benchmarks’, and the code is available at https://github.com/ML-Bioinfo-CEITEC/genomic_benchmarks . Conclusions Deep learning techniques revolutionized many biological fields but mainly thanks to the carefully curated benchmarks. For the field of Genomics, we propose a collection of benchmark datasets for the classification of genomic sequences with an interface for the most commonly used deep learning libraries, implementation of the simple neural network and a training framework that can be used as a starting point for future research. The main aim of this effort is to create a repository for shared datasets that will make machine learning for genomics more comparable and reproducible while reducing the overhead of researchers who want to enter the field, leading to healthy competition and new discoveries.https://doi.org/10.1186/s12863-023-01123-8GenomicsDatasetBenchmarkDeep learningConvolutional neural network
spellingShingle	Katarína Grešová Vlastimil Martinek David Čechák Petr Šimeček Panagiotis Alexiou Genomic benchmarks: a collection of datasets for genomic sequence classification Genomics Dataset Benchmark Deep learning Convolutional neural network
title	Genomic benchmarks: a collection of datasets for genomic sequence classification
title_full	Genomic benchmarks: a collection of datasets for genomic sequence classification
title_fullStr	Genomic benchmarks: a collection of datasets for genomic sequence classification
title_full_unstemmed	Genomic benchmarks: a collection of datasets for genomic sequence classification
title_short	Genomic benchmarks: a collection of datasets for genomic sequence classification
title_sort	genomic benchmarks a collection of datasets for genomic sequence classification
topic	Genomics Dataset Benchmark Deep learning Convolutional neural network
url	https://doi.org/10.1186/s12863-023-01123-8
work_keys_str_mv	AT katarinagresova genomicbenchmarksacollectionofdatasetsforgenomicsequenceclassification AT vlastimilmartinek genomicbenchmarksacollectionofdatasetsforgenomicsequenceclassification AT davidcechak genomicbenchmarksacollectionofdatasetsforgenomicsequenceclassification AT petrsimecek genomicbenchmarksacollectionofdatasetsforgenomicsequenceclassification AT panagiotisalexiou genomicbenchmarksacollectionofdatasetsforgenomicsequenceclassification

Genomic benchmarks: a collection of datasets for genomic sequence classification

類似資料