An ILP solution for the gene duplication problem

<p>Abstract</p> <p>Background</p> <p>The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories...

Full description

Bibliographic Details
Main Authors: Fernández-Baca David F, Burleigh Gordon J, Chang Wen-Chieh, Eulenstein Oliver
Format: Article
Language:English
Published: BMC 2011-02-01
Series:BMC Bioinformatics
id doaj-a005ce8123aa4abd8bc6eebb3c6cdf0d
record_format Article
spelling doaj-a005ce8123aa4abd8bc6eebb3c6cdf0d2020-11-25T02:42:07ZengBMCBMC Bioinformatics1471-21052011-02-0112Suppl 1S1410.1186/1471-2105-12-S1-S14An ILP solution for the gene duplication problemFernández-Baca David FBurleigh Gordon JChang Wen-ChiehEulenstein Oliver<p>Abstract</p> <p>Background</p> <p>The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee.</p> <p>Results</p> <p>We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics.</p> <p>Conclusions</p> <p>Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.</p>
collection DOAJ
language English
format Article
sources DOAJ
author Fernández-Baca David F
Burleigh Gordon J
Chang Wen-Chieh
Eulenstein Oliver
spellingShingle Fernández-Baca David F
Burleigh Gordon J
Chang Wen-Chieh
Eulenstein Oliver
An ILP solution for the gene duplication problem
BMC Bioinformatics
author_facet Fernández-Baca David F
Burleigh Gordon J
Chang Wen-Chieh
Eulenstein Oliver
author_sort Fernández-Baca David F
title An ILP solution for the gene duplication problem
title_short An ILP solution for the gene duplication problem
title_full An ILP solution for the gene duplication problem
title_fullStr An ILP solution for the gene duplication problem
title_full_unstemmed An ILP solution for the gene duplication problem
title_sort ilp solution for the gene duplication problem
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2011-02-01
description <p>Abstract</p> <p>Background</p> <p>The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee.</p> <p>Results</p> <p>We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics.</p> <p>Conclusions</p> <p>Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.</p>
work_keys_str_mv AT fernandezbacadavidf anilpsolutionforthegeneduplicationproblem
AT burleighgordonj anilpsolutionforthegeneduplicationproblem
AT changwenchieh anilpsolutionforthegeneduplicationproblem
AT eulensteinoliver anilpsolutionforthegeneduplicationproblem
AT fernandezbacadavidf ilpsolutionforthegeneduplicationproblem
AT burleighgordonj ilpsolutionforthegeneduplicationproblem
AT changwenchieh ilpsolutionforthegeneduplicationproblem
AT eulensteinoliver ilpsolutionforthegeneduplicationproblem
_version_ 1724775279419719680