A systematic bioinformatics approach for large-scale identification and characterization of host-pathogen shared sequences

Abstract Background Biology has entered the era of big data with the advent of high-throughput omics technologies. Biological databases provide public access to petabytes of data and information facilitating knowledge discovery. Over the years, sequence data of pathogens has seen a large increase in...

Full description

Bibliographic Details
Main Authors: Stephen Among James, Hui San Ong, Ranjeev Hari, Asif M. Khan
Format: Article
Language:English
Published: BMC 2021-09-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-021-07657-4
id doaj-919abc51af294c06b1bfeff4373e6d1e
record_format Article
spelling doaj-919abc51af294c06b1bfeff4373e6d1e2021-10-03T11:38:11ZengBMCBMC Genomics1471-21642021-09-0122S311810.1186/s12864-021-07657-4A systematic bioinformatics approach for large-scale identification and characterization of host-pathogen shared sequencesStephen Among James0Hui San Ong1Ranjeev Hari2Asif M. Khan3Centre for Bioinformatics, School of Data Sciences, Perdana UniversityCentre for Bioinformatics, School of Data Sciences, Perdana UniversityCentre for Bioinformatics, School of Data Sciences, Perdana UniversityCentre for Bioinformatics, School of Data Sciences, Perdana UniversityAbstract Background Biology has entered the era of big data with the advent of high-throughput omics technologies. Biological databases provide public access to petabytes of data and information facilitating knowledge discovery. Over the years, sequence data of pathogens has seen a large increase in the number of records, given the relatively small genome size and their important role as infectious and symbiotic agents. Humans are host to numerous pathogenic diseases, such as that by viruses, many of which are responsible for high mortality and morbidity. The interaction between pathogens and humans over the evolutionary history has resulted in sharing of sequences, with important biological and evolutionary implications. Results This study describes a large-scale, systematic bioinformatics approach for identification and characterization of shared sequences between the host and pathogen. An application of the approach is demonstrated through identification and characterization of the Flaviviridae-human share-ome. A total of 2430 nonamers represented the Flaviviridae-human share-ome with 100% identity. Although the share-ome represented a small fraction of the repertoire of Flaviviridae (~ 0.12%) and human (~ 0.013%) non-redundant nonamers, the 2430 shared nonamers mapped to 16,946 Flaviviridae and 7506 human non-redundant protein sequences. The shared nonamer sequences mapped to 125 species of Flaviviridae, including several with unclassified genus. The majority (~ 68%) of the shared sequences mapped to Hepacivirus C species; West Nile, dengue and Zika viruses of the Flavivirus genus accounted for ~ 11%, ~ 7%, and ~ 3%, respectively, of the Flaviviridae protein sequences (16,946) mapped by the share-ome. Further characterization of the share-ome provided important structural-functional insights to Flaviviridae-human interactions. Conclusion Mapping of the host-pathogen share-ome has important implications for the design of vaccines and drugs, diagnostics, disease surveillance and the discovery of unknown, potential host-pathogen interactions. The generic workflow presented herein is potentially applicable to a variety of pathogens, such as of viral, bacterial or parasitic origin.https://doi.org/10.1186/s12864-021-07657-4Shared sequencesShare-omeHost-pathogenBioinformaticsLarge-scaleMethodology
collection DOAJ
language English
format Article
sources DOAJ
author Stephen Among James
Hui San Ong
Ranjeev Hari
Asif M. Khan
spellingShingle Stephen Among James
Hui San Ong
Ranjeev Hari
Asif M. Khan
A systematic bioinformatics approach for large-scale identification and characterization of host-pathogen shared sequences
BMC Genomics
Shared sequences
Share-ome
Host-pathogen
Bioinformatics
Large-scale
Methodology
author_facet Stephen Among James
Hui San Ong
Ranjeev Hari
Asif M. Khan
author_sort Stephen Among James
title A systematic bioinformatics approach for large-scale identification and characterization of host-pathogen shared sequences
title_short A systematic bioinformatics approach for large-scale identification and characterization of host-pathogen shared sequences
title_full A systematic bioinformatics approach for large-scale identification and characterization of host-pathogen shared sequences
title_fullStr A systematic bioinformatics approach for large-scale identification and characterization of host-pathogen shared sequences
title_full_unstemmed A systematic bioinformatics approach for large-scale identification and characterization of host-pathogen shared sequences
title_sort systematic bioinformatics approach for large-scale identification and characterization of host-pathogen shared sequences
publisher BMC
series BMC Genomics
issn 1471-2164
publishDate 2021-09-01
description Abstract Background Biology has entered the era of big data with the advent of high-throughput omics technologies. Biological databases provide public access to petabytes of data and information facilitating knowledge discovery. Over the years, sequence data of pathogens has seen a large increase in the number of records, given the relatively small genome size and their important role as infectious and symbiotic agents. Humans are host to numerous pathogenic diseases, such as that by viruses, many of which are responsible for high mortality and morbidity. The interaction between pathogens and humans over the evolutionary history has resulted in sharing of sequences, with important biological and evolutionary implications. Results This study describes a large-scale, systematic bioinformatics approach for identification and characterization of shared sequences between the host and pathogen. An application of the approach is demonstrated through identification and characterization of the Flaviviridae-human share-ome. A total of 2430 nonamers represented the Flaviviridae-human share-ome with 100% identity. Although the share-ome represented a small fraction of the repertoire of Flaviviridae (~ 0.12%) and human (~ 0.013%) non-redundant nonamers, the 2430 shared nonamers mapped to 16,946 Flaviviridae and 7506 human non-redundant protein sequences. The shared nonamer sequences mapped to 125 species of Flaviviridae, including several with unclassified genus. The majority (~ 68%) of the shared sequences mapped to Hepacivirus C species; West Nile, dengue and Zika viruses of the Flavivirus genus accounted for ~ 11%, ~ 7%, and ~ 3%, respectively, of the Flaviviridae protein sequences (16,946) mapped by the share-ome. Further characterization of the share-ome provided important structural-functional insights to Flaviviridae-human interactions. Conclusion Mapping of the host-pathogen share-ome has important implications for the design of vaccines and drugs, diagnostics, disease surveillance and the discovery of unknown, potential host-pathogen interactions. The generic workflow presented herein is potentially applicable to a variety of pathogens, such as of viral, bacterial or parasitic origin.
topic Shared sequences
Share-ome
Host-pathogen
Bioinformatics
Large-scale
Methodology
url https://doi.org/10.1186/s12864-021-07657-4
work_keys_str_mv AT stephenamongjames asystematicbioinformaticsapproachforlargescaleidentificationandcharacterizationofhostpathogensharedsequences
AT huisanong asystematicbioinformaticsapproachforlargescaleidentificationandcharacterizationofhostpathogensharedsequences
AT ranjeevhari asystematicbioinformaticsapproachforlargescaleidentificationandcharacterizationofhostpathogensharedsequences
AT asifmkhan asystematicbioinformaticsapproachforlargescaleidentificationandcharacterizationofhostpathogensharedsequences
AT stephenamongjames systematicbioinformaticsapproachforlargescaleidentificationandcharacterizationofhostpathogensharedsequences
AT huisanong systematicbioinformaticsapproachforlargescaleidentificationandcharacterizationofhostpathogensharedsequences
AT ranjeevhari systematicbioinformaticsapproachforlargescaleidentificationandcharacterizationofhostpathogensharedsequences
AT asifmkhan systematicbioinformaticsapproachforlargescaleidentificationandcharacterizationofhostpathogensharedsequences
_version_ 1716845289924460544