Short Read Alignment Based on Maximal Approximate Match Seeds
Sequence alignment is a critical step in many critical genomic studies, such as variant calling, quantitative transcriptome analysis (RNA-seq), and metagenomic sequence classification. However, the alignment performance is largely affected by repetitive sequences in the reference genome, which exten...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Frontiers Media S.A.
2020-11-01
|
Series: | Frontiers in Molecular Biosciences |
Subjects: | |
Online Access: | https://www.frontiersin.org/articles/10.3389/fmolb.2020.572934/full |
id |
doaj-84ab495b487744a2a1e144415115d845 |
---|---|
record_format |
Article |
spelling |
doaj-84ab495b487744a2a1e144415115d8452020-11-25T04:08:24ZengFrontiers Media S.A.Frontiers in Molecular Biosciences2296-889X2020-11-01710.3389/fmolb.2020.572934572934Short Read Alignment Based on Maximal Approximate Match SeedsWei Quan0Dengfeng Guan1Dengfeng Guan2Guangri Quan3Bo Liu4Yadong Wang5School of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaInstitute of Zoology, Chinese Academy of Sciences, Beijing, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSequence alignment is a critical step in many critical genomic studies, such as variant calling, quantitative transcriptome analysis (RNA-seq), and metagenomic sequence classification. However, the alignment performance is largely affected by repetitive sequences in the reference genome, which extensively exist in species from bacteria to mammals. Aligning repeating sequences might lead to tremendous candidate locations, bringing about a challenging computational burden. Thus, most alignment tools prefer to simply discard highly repetitive seeds, but this may cause the true alignment to be missed. Using maximal approximate matches (MAMs) as seeds is an option, but MEMs seeds may fail due to sequencing errors or genomic variations in MEMs seeds. Here, we propose a novel sequence alignment algorithm, named MAM, which can efficiently align short DNA sequences. MAM first builds a modified Burrows-Wheeler transform (BWT) structure of a reference genome to accelerate approximate seed matching. Then, MAM uses maximal approximate matches (MAMs) seeds to reduce the candidate locations. Finally, MAM applies an affine-gap-penalty dynamic programming to extend MAMs seeds. Experimental results on simulated and real sequencing datasets show that MAM achieves better performance in speed than other state-of-the-art alignment tools. The source code is available at https://github.com/weiquan/mam.https://www.frontiersin.org/articles/10.3389/fmolb.2020.572934/fullwhole-genome resequencingnext-generation sequencingrepeatssequence alignmentFM-index |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Wei Quan Dengfeng Guan Dengfeng Guan Guangri Quan Bo Liu Yadong Wang |
spellingShingle |
Wei Quan Dengfeng Guan Dengfeng Guan Guangri Quan Bo Liu Yadong Wang Short Read Alignment Based on Maximal Approximate Match Seeds Frontiers in Molecular Biosciences whole-genome resequencing next-generation sequencing repeats sequence alignment FM-index |
author_facet |
Wei Quan Dengfeng Guan Dengfeng Guan Guangri Quan Bo Liu Yadong Wang |
author_sort |
Wei Quan |
title |
Short Read Alignment Based on Maximal Approximate Match Seeds |
title_short |
Short Read Alignment Based on Maximal Approximate Match Seeds |
title_full |
Short Read Alignment Based on Maximal Approximate Match Seeds |
title_fullStr |
Short Read Alignment Based on Maximal Approximate Match Seeds |
title_full_unstemmed |
Short Read Alignment Based on Maximal Approximate Match Seeds |
title_sort |
short read alignment based on maximal approximate match seeds |
publisher |
Frontiers Media S.A. |
series |
Frontiers in Molecular Biosciences |
issn |
2296-889X |
publishDate |
2020-11-01 |
description |
Sequence alignment is a critical step in many critical genomic studies, such as variant calling, quantitative transcriptome analysis (RNA-seq), and metagenomic sequence classification. However, the alignment performance is largely affected by repetitive sequences in the reference genome, which extensively exist in species from bacteria to mammals. Aligning repeating sequences might lead to tremendous candidate locations, bringing about a challenging computational burden. Thus, most alignment tools prefer to simply discard highly repetitive seeds, but this may cause the true alignment to be missed. Using maximal approximate matches (MAMs) as seeds is an option, but MEMs seeds may fail due to sequencing errors or genomic variations in MEMs seeds. Here, we propose a novel sequence alignment algorithm, named MAM, which can efficiently align short DNA sequences. MAM first builds a modified Burrows-Wheeler transform (BWT) structure of a reference genome to accelerate approximate seed matching. Then, MAM uses maximal approximate matches (MAMs) seeds to reduce the candidate locations. Finally, MAM applies an affine-gap-penalty dynamic programming to extend MAMs seeds. Experimental results on simulated and real sequencing datasets show that MAM achieves better performance in speed than other state-of-the-art alignment tools. The source code is available at https://github.com/weiquan/mam. |
topic |
whole-genome resequencing next-generation sequencing repeats sequence alignment FM-index |
url |
https://www.frontiersin.org/articles/10.3389/fmolb.2020.572934/full |
work_keys_str_mv |
AT weiquan shortreadalignmentbasedonmaximalapproximatematchseeds AT dengfengguan shortreadalignmentbasedonmaximalapproximatematchseeds AT dengfengguan shortreadalignmentbasedonmaximalapproximatematchseeds AT guangriquan shortreadalignmentbasedonmaximalapproximatematchseeds AT boliu shortreadalignmentbasedonmaximalapproximatematchseeds AT yadongwang shortreadalignmentbasedonmaximalapproximatematchseeds |
_version_ |
1724426044432187392 |