Natural family-free genomic distance

Abstract Background A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. The traditional approaches in this area are family-based, i.e., require t...

Full description

Bibliographic Details
Main Authors:	Diego P. Rubert, Fábio V. Martinez, Marília D. V. Braga
Format:	Article
Language:	English
Published:	BMC 2021-05-01
Series:	Algorithms for Molecular Biology
Subjects:	Comparative genomics Genome rearrangement DCJ-indel distance
Online Access:	https://doi.org/10.1186/s13015-021-00183-8

id	doaj-d286cf411eee4526b09bcaca19587e86
record_format	Article
spelling	doaj-d286cf411eee4526b09bcaca19587e862021-05-11T14:51:39ZengBMCAlgorithms for Molecular Biology1748-71882021-05-0116111610.1186/s13015-021-00183-8Natural family-free genomic distanceDiego P. Rubert0Fábio V. Martinez1Marília D. V. Braga2Faculdade de Computação, Universidade Federal de Mato Grosso do SulFaculdade de Computação, Universidade Federal de Mato Grosso do SulFaculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld UniversityAbstract Background A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkämper et al. (J Comput Biol 28:410–431, 2021) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almost empty matchings give smaller distances. Results In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our model then results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger search space, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkämper et al. for instances with the same number of multiple connections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results.https://doi.org/10.1186/s13015-021-00183-8Comparative genomicsGenome rearrangementDCJ-indel distance
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Diego P. Rubert Fábio V. Martinez Marília D. V. Braga
spellingShingle	Diego P. Rubert Fábio V. Martinez Marília D. V. Braga Natural family-free genomic distance Algorithms for Molecular Biology Comparative genomics Genome rearrangement DCJ-indel distance
author_facet	Diego P. Rubert Fábio V. Martinez Marília D. V. Braga
author_sort	Diego P. Rubert
title	Natural family-free genomic distance
title_short	Natural family-free genomic distance
title_full	Natural family-free genomic distance
title_fullStr	Natural family-free genomic distance
title_full_unstemmed	Natural family-free genomic distance
title_sort	natural family-free genomic distance
publisher	BMC
series	Algorithms for Molecular Biology
issn	1748-7188
publishDate	2021-05-01
description	Abstract Background A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome. The traditional approaches in this area are family-based, i.e., require the classification of DNA fragments of both genomes into families. Furthermore, the most elementary family-based models, which are able to compute distances in polynomial time, restrict the families to occur at most once in each genome. In contrast, the distance computation in models that allow multifamilies (i.e., families with multiple occurrences) is NP-hard. Very recently, Bohnenkämper et al. (J Comput Biol 28:410–431, 2021) proposed an ILP formulation for computing the genomic distance of genomes with multifamilies, allowing structural rearrangements, represented by the generic double cut and join (DCJ) operation, and content-modifying insertions and deletions of DNA segments. This ILP is very efficient, but must maximize a matching of the genes in each multifamily, in order to prevent the free lunch artifact that would otherwise let empty or almost empty matchings give smaller distances. Results In this paper, we adopt the alternative family-free setting that, instead of family classification, simply uses the pairwise similarities between DNA fragments of both genomes to compute their rearrangement distance. We adapted the ILP mentioned above and developed a model in which pairwise similarities are used to assign weights to both matched and unmatched genes, so that an optimal solution does not necessarily maximize the matching. Our model then results in a natural family-free genomic distance, that takes into consideration all given genes, without prior classification into families, and has a search space composed of matchings of any size. In spite of its bigger search space, our ILP seems to be boosted by a reduction of the number of co-optimal solutions due to the weights. Indeed, it converged faster than the original one by Bohnenkämper et al. for instances with the same number of multiple connections. We can handle not only bacterial genomes, but also fungi and insects, or sets of chromosomes of mammals and plants. In a comparison study of six fruit fly genomes, we obtained accurate results.
topic	Comparative genomics Genome rearrangement DCJ-indel distance
url	https://doi.org/10.1186/s13015-021-00183-8
work_keys_str_mv	AT diegoprubert naturalfamilyfreegenomicdistance AT fabiovmartinez naturalfamilyfreegenomicdistance AT mariliadvbraga naturalfamilyfreegenomicdistance
_version_	1721443932734226432

Natural family-free genomic distance

Similar Items