A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen

Abstract Background Sequence similarity networks are useful for classifying and characterizing biologically important proteins. Threshold-based approaches to similarity network construction using exact distance measures are prohibitively slow to compute and rely on the difficult task of selecting an...

Full description

Bibliographic Details
Main Authors: Helen N. Catanese, Kelly A. Brayton, Assefaw H. Gebremedhin
Format: Article
Language:English
Published: BMC 2018-12-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-018-2453-2
id doaj-a20336cb93514cbda4ec29eda41dd191
record_format Article
spelling doaj-a20336cb93514cbda4ec29eda41dd1912020-11-25T01:43:43ZengBMCBMC Bioinformatics1471-21052018-12-0119111810.1186/s12859-018-2453-2A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogenHelen N. Catanese0Kelly A. Brayton1Assefaw H. Gebremedhin2School of Electrical Engineering and Computer Science, Washington State UniversitySchool of Electrical Engineering and Computer Science, Washington State UniversitySchool of Electrical Engineering and Computer Science, Washington State UniversityAbstract Background Sequence similarity networks are useful for classifying and characterizing biologically important proteins. Threshold-based approaches to similarity network construction using exact distance measures are prohibitively slow to compute and rely on the difficult task of selecting an appropriate threshold, while similarity networks based on approximate distance calculations compromise useful structural information. Results We present an alternative network representation for a set of sequence data that overcomes these drawbacks. In our model, called the Directed Weighted All Nearest Neighbors (DiWANN) network, each sequence is represented by a node and is connected via a directed edge to only the closest sequence, or sequences in the case of ties, in the dataset. Our contributions span several aspects. Specifically, we: (i) Apply an all nearest neighbors network model to protein sequence data from three different applications and examine the structural properties of the networks; (ii) Compare the model against threshold-based networks to validate their semantic equivalence, and demonstrate the relative advantages the model offers; (iii) Demonstrate the model’s resilience to missing sequences; and (iv) Develop an efficient algorithm for constructing a DiWANN network from a set of sequences. We find that the DiWANN network representation attains similar semantic properties to threshold-based graphs, while avoiding weaknesses of both high and low threshold graphs. Additionally, we find that approximate distance networks, using BLAST bitscores in place of exact edit distances, can cause significant loss of structural information. We show that the proposed DiWANN network construction algorithm provides a fourfold speedup over a standard threshold based approach to network construction. We also identify a relationship between the centrality of a sequence in a similarity network of an Anaplasma marginale short sequence repeat dataset and how broadly that sequence is dispersed geographically. Conclusion We demonstrate that using approximate distance measures to rapidly construct similarity networks may lead to significant deficiencies in the structure of that network in terms centrality and clustering analyses. We present a new network representation that maintains the structural semantics of threshold-based networks while increasing connectedness, and an algorithm for constructing the network using exact distance measures in a fraction of the time it would take to build a threshold-based equivalent.http://link.springer.com/article/10.1186/s12859-018-2453-2Sequence similarity networkNetwork analysisCentralityClusteringAnaplasma marginale Msp1aGroEL
collection DOAJ
language English
format Article
sources DOAJ
author Helen N. Catanese
Kelly A. Brayton
Assefaw H. Gebremedhin
spellingShingle Helen N. Catanese
Kelly A. Brayton
Assefaw H. Gebremedhin
A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
BMC Bioinformatics
Sequence similarity network
Network analysis
Centrality
Clustering
Anaplasma marginale Msp1a
GroEL
author_facet Helen N. Catanese
Kelly A. Brayton
Assefaw H. Gebremedhin
author_sort Helen N. Catanese
title A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title_short A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title_full A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title_fullStr A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title_full_unstemmed A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
title_sort nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2018-12-01
description Abstract Background Sequence similarity networks are useful for classifying and characterizing biologically important proteins. Threshold-based approaches to similarity network construction using exact distance measures are prohibitively slow to compute and rely on the difficult task of selecting an appropriate threshold, while similarity networks based on approximate distance calculations compromise useful structural information. Results We present an alternative network representation for a set of sequence data that overcomes these drawbacks. In our model, called the Directed Weighted All Nearest Neighbors (DiWANN) network, each sequence is represented by a node and is connected via a directed edge to only the closest sequence, or sequences in the case of ties, in the dataset. Our contributions span several aspects. Specifically, we: (i) Apply an all nearest neighbors network model to protein sequence data from three different applications and examine the structural properties of the networks; (ii) Compare the model against threshold-based networks to validate their semantic equivalence, and demonstrate the relative advantages the model offers; (iii) Demonstrate the model’s resilience to missing sequences; and (iv) Develop an efficient algorithm for constructing a DiWANN network from a set of sequences. We find that the DiWANN network representation attains similar semantic properties to threshold-based graphs, while avoiding weaknesses of both high and low threshold graphs. Additionally, we find that approximate distance networks, using BLAST bitscores in place of exact edit distances, can cause significant loss of structural information. We show that the proposed DiWANN network construction algorithm provides a fourfold speedup over a standard threshold based approach to network construction. We also identify a relationship between the centrality of a sequence in a similarity network of an Anaplasma marginale short sequence repeat dataset and how broadly that sequence is dispersed geographically. Conclusion We demonstrate that using approximate distance measures to rapidly construct similarity networks may lead to significant deficiencies in the structure of that network in terms centrality and clustering analyses. We present a new network representation that maintains the structural semantics of threshold-based networks while increasing connectedness, and an algorithm for constructing the network using exact distance measures in a fraction of the time it would take to build a threshold-based equivalent.
topic Sequence similarity network
Network analysis
Centrality
Clustering
Anaplasma marginale Msp1a
GroEL
url http://link.springer.com/article/10.1186/s12859-018-2453-2
work_keys_str_mv AT helenncatanese anearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen
AT kellyabrayton anearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen
AT assefawhgebremedhin anearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen
AT helenncatanese nearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen
AT kellyabrayton nearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen
AT assefawhgebremedhin nearestneighborsnetworkmodelforsequencedatarevealsnewinsightintogenotypedistributionofapathogen
_version_ 1725031983455666176