PPalign: optimal alignment of Potts models representing proteins with direct coupling information
Abstract Background To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on signifi...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2021-06-01
|
Series: | BMC Bioinformatics |
Subjects: | |
Online Access: | https://doi.org/10.1186/s12859-021-04222-4 |
id |
doaj-b17c29587eab41fa9034677d17e64402 |
---|---|
record_format |
Article |
spelling |
doaj-b17c29587eab41fa9034677d17e644022021-06-13T11:57:25ZengBMCBMC Bioinformatics1471-21052021-06-0122112210.1186/s12859-021-04222-4PPalign: optimal alignment of Potts models representing proteins with direct coupling informationHugo Talibart0François Coste1Univ Rennes, Inria, CNRS, IRISAUniv Rennes, Inria, CNRS, IRISAAbstract Background To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use. Methods We introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between $$3\%$$ 3 % and $$20\%$$ 20 % ) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time ( $$1'37''$$ 1 ′ 37 ′ ′ in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and independent-site PPalign. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean $$F_1$$ F 1 score and finds significantly better alignments than HHalign and PPalign without couplings in some cases. Conclusions These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign’s guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction.https://doi.org/10.1186/s12859-021-04222-4Direct coupling analysisPotts modelInteger linear programmingProteinsSequence alignmentHomology |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Hugo Talibart François Coste |
spellingShingle |
Hugo Talibart François Coste PPalign: optimal alignment of Potts models representing proteins with direct coupling information BMC Bioinformatics Direct coupling analysis Potts model Integer linear programming Proteins Sequence alignment Homology |
author_facet |
Hugo Talibart François Coste |
author_sort |
Hugo Talibart |
title |
PPalign: optimal alignment of Potts models representing proteins with direct coupling information |
title_short |
PPalign: optimal alignment of Potts models representing proteins with direct coupling information |
title_full |
PPalign: optimal alignment of Potts models representing proteins with direct coupling information |
title_fullStr |
PPalign: optimal alignment of Potts models representing proteins with direct coupling information |
title_full_unstemmed |
PPalign: optimal alignment of Potts models representing proteins with direct coupling information |
title_sort |
ppalign: optimal alignment of potts models representing proteins with direct coupling information |
publisher |
BMC |
series |
BMC Bioinformatics |
issn |
1471-2105 |
publishDate |
2021-06-01 |
description |
Abstract Background To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use. Methods We introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between $$3\%$$ 3 % and $$20\%$$ 20 % ) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time ( $$1'37''$$ 1 ′ 37 ′ ′ in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and independent-site PPalign. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean $$F_1$$ F 1 score and finds significantly better alignments than HHalign and PPalign without couplings in some cases. Conclusions These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign’s guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction. |
topic |
Direct coupling analysis Potts model Integer linear programming Proteins Sequence alignment Homology |
url |
https://doi.org/10.1186/s12859-021-04222-4 |
work_keys_str_mv |
AT hugotalibart ppalignoptimalalignmentofpottsmodelsrepresentingproteinswithdirectcouplinginformation AT francoiscoste ppalignoptimalalignmentofpottsmodelsrepresentingproteinswithdirectcouplinginformation |
_version_ |
1721379273539846144 |