Towards realistic benchmarks for multiple alignments of non-coding sequences

<p><b>Abstract</b></p> <p>Background</p> <p>With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks...

Full description

Bibliographic Details
Main Authors: Sinha Saurabh, Kim Jaebum
Format: Article
Language:English
Published: BMC 2010-01-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/11/54
id doaj-7f6af02c24ac4f6ab85e0d6cd0952eb4
record_format Article
spelling doaj-7f6af02c24ac4f6ab85e0d6cd0952eb42020-11-25T00:04:46ZengBMCBMC Bioinformatics1471-21052010-01-011115410.1186/1471-2105-11-54Towards realistic benchmarks for multiple alignments of non-coding sequencesSinha SaurabhKim Jaebum<p><b>Abstract</b></p> <p>Background</p> <p>With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks.</p> <p>Results</p> <p>We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the <it>Drosophila </it>group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of <it>Drosophila </it>non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in <it>Drosophila </it>non-coding sequences if provided with the true alignments.</p> <p>Conclusion</p> <p>We have developed a method to generate benchmarks for multiple alignments of <it>Drosophila </it>non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.</p> http://www.biomedcentral.com/1471-2105/11/54
collection DOAJ
language English
format Article
sources DOAJ
author Sinha Saurabh
Kim Jaebum
spellingShingle Sinha Saurabh
Kim Jaebum
Towards realistic benchmarks for multiple alignments of non-coding sequences
BMC Bioinformatics
author_facet Sinha Saurabh
Kim Jaebum
author_sort Sinha Saurabh
title Towards realistic benchmarks for multiple alignments of non-coding sequences
title_short Towards realistic benchmarks for multiple alignments of non-coding sequences
title_full Towards realistic benchmarks for multiple alignments of non-coding sequences
title_fullStr Towards realistic benchmarks for multiple alignments of non-coding sequences
title_full_unstemmed Towards realistic benchmarks for multiple alignments of non-coding sequences
title_sort towards realistic benchmarks for multiple alignments of non-coding sequences
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2010-01-01
description <p><b>Abstract</b></p> <p>Background</p> <p>With the continued development of new computational tools for multiple sequence alignment, it is necessary today to develop benchmarks that aid the selection of the most effective tools. Simulation-based benchmarks have been proposed to meet this necessity, especially for non-coding sequences. However, it is not clear if such benchmarks truly represent real sequence data from any given group of species, in terms of the difficulty of alignment tasks.</p> <p>Results</p> <p>We find that the conventional simulation approach, which relies on empirically estimated values for various parameters such as substitution rate or insertion/deletion rates, is unable to generate synthetic sequences reflecting the broad genomic variation in conservation levels. We tackle this problem with a new method for simulating non-coding sequence evolution, by relying on genome-wide distributions of evolutionary parameters rather than their averages. We then generate synthetic data sets to mimic orthologous sequences from the <it>Drosophila </it>group of species, and show that these data sets truly represent the variability observed in genomic data in terms of the difficulty of the alignment task. This allows us to make significant progress towards estimating the alignment accuracy of current tools in an absolute sense, going beyond only a relative assessment of different tools. We evaluate six widely used multiple alignment tools in the context of <it>Drosophila </it>non-coding sequences, and find the accuracy to be significantly different from previously reported values. Interestingly, the performance of most tools degrades more rapidly when there are more insertions than deletions in the data set, suggesting an asymmetric handling of insertions and deletions, even though none of the evaluated tools explicitly distinguishes these two types of events. We also examine the accuracy of two existing tools for annotating insertions versus deletions, and find their performance to be close to optimal in <it>Drosophila </it>non-coding sequences if provided with the true alignments.</p> <p>Conclusion</p> <p>We have developed a method to generate benchmarks for multiple alignments of <it>Drosophila </it>non-coding sequences, and shown it to be more realistic than traditional benchmarks. Apart from helping to select the most effective tools, these benchmarks will help practitioners of comparative genomics deal with the effects of alignment errors, by providing accurate estimates of the extent of these errors.</p>
url http://www.biomedcentral.com/1471-2105/11/54
work_keys_str_mv AT sinhasaurabh towardsrealisticbenchmarksformultiplealignmentsofnoncodingsequences
AT kimjaebum towardsrealisticbenchmarksformultiplealignmentsofnoncodingsequences
_version_ 1725428039367524352