Short Read Alignment Based on Maximal Approximate Match Seeds

Sequence alignment is a critical step in many critical genomic studies, such as variant calling, quantitative transcriptome analysis (RNA-seq), and metagenomic sequence classification. However, the alignment performance is largely affected by repetitive sequences in the reference genome, which exten...

Full description

Bibliographic Details
Main Authors: Wei Quan, Dengfeng Guan, Guangri Quan, Bo Liu, Yadong Wang
Format: Article
Language:English
Published: Frontiers Media S.A. 2020-11-01
Series:Frontiers in Molecular Biosciences
Subjects:
Online Access:https://www.frontiersin.org/articles/10.3389/fmolb.2020.572934/full
id doaj-84ab495b487744a2a1e144415115d845
record_format Article
spelling doaj-84ab495b487744a2a1e144415115d8452020-11-25T04:08:24ZengFrontiers Media S.A.Frontiers in Molecular Biosciences2296-889X2020-11-01710.3389/fmolb.2020.572934572934Short Read Alignment Based on Maximal Approximate Match SeedsWei Quan0Dengfeng Guan1Dengfeng Guan2Guangri Quan3Bo Liu4Yadong Wang5School of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaInstitute of Zoology, Chinese Academy of Sciences, Beijing, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSchool of Computer Science and Technology, Harbin Institute of Technology, Harbin, ChinaSequence alignment is a critical step in many critical genomic studies, such as variant calling, quantitative transcriptome analysis (RNA-seq), and metagenomic sequence classification. However, the alignment performance is largely affected by repetitive sequences in the reference genome, which extensively exist in species from bacteria to mammals. Aligning repeating sequences might lead to tremendous candidate locations, bringing about a challenging computational burden. Thus, most alignment tools prefer to simply discard highly repetitive seeds, but this may cause the true alignment to be missed. Using maximal approximate matches (MAMs) as seeds is an option, but MEMs seeds may fail due to sequencing errors or genomic variations in MEMs seeds. Here, we propose a novel sequence alignment algorithm, named MAM, which can efficiently align short DNA sequences. MAM first builds a modified Burrows-Wheeler transform (BWT) structure of a reference genome to accelerate approximate seed matching. Then, MAM uses maximal approximate matches (MAMs) seeds to reduce the candidate locations. Finally, MAM applies an affine-gap-penalty dynamic programming to extend MAMs seeds. Experimental results on simulated and real sequencing datasets show that MAM achieves better performance in speed than other state-of-the-art alignment tools. The source code is available at https://github.com/weiquan/mam.https://www.frontiersin.org/articles/10.3389/fmolb.2020.572934/fullwhole-genome resequencingnext-generation sequencingrepeatssequence alignmentFM-index
collection DOAJ
language English
format Article
sources DOAJ
author Wei Quan
Dengfeng Guan
Dengfeng Guan
Guangri Quan
Bo Liu
Yadong Wang
spellingShingle Wei Quan
Dengfeng Guan
Dengfeng Guan
Guangri Quan
Bo Liu
Yadong Wang
Short Read Alignment Based on Maximal Approximate Match Seeds
Frontiers in Molecular Biosciences
whole-genome resequencing
next-generation sequencing
repeats
sequence alignment
FM-index
author_facet Wei Quan
Dengfeng Guan
Dengfeng Guan
Guangri Quan
Bo Liu
Yadong Wang
author_sort Wei Quan
title Short Read Alignment Based on Maximal Approximate Match Seeds
title_short Short Read Alignment Based on Maximal Approximate Match Seeds
title_full Short Read Alignment Based on Maximal Approximate Match Seeds
title_fullStr Short Read Alignment Based on Maximal Approximate Match Seeds
title_full_unstemmed Short Read Alignment Based on Maximal Approximate Match Seeds
title_sort short read alignment based on maximal approximate match seeds
publisher Frontiers Media S.A.
series Frontiers in Molecular Biosciences
issn 2296-889X
publishDate 2020-11-01
description Sequence alignment is a critical step in many critical genomic studies, such as variant calling, quantitative transcriptome analysis (RNA-seq), and metagenomic sequence classification. However, the alignment performance is largely affected by repetitive sequences in the reference genome, which extensively exist in species from bacteria to mammals. Aligning repeating sequences might lead to tremendous candidate locations, bringing about a challenging computational burden. Thus, most alignment tools prefer to simply discard highly repetitive seeds, but this may cause the true alignment to be missed. Using maximal approximate matches (MAMs) as seeds is an option, but MEMs seeds may fail due to sequencing errors or genomic variations in MEMs seeds. Here, we propose a novel sequence alignment algorithm, named MAM, which can efficiently align short DNA sequences. MAM first builds a modified Burrows-Wheeler transform (BWT) structure of a reference genome to accelerate approximate seed matching. Then, MAM uses maximal approximate matches (MAMs) seeds to reduce the candidate locations. Finally, MAM applies an affine-gap-penalty dynamic programming to extend MAMs seeds. Experimental results on simulated and real sequencing datasets show that MAM achieves better performance in speed than other state-of-the-art alignment tools. The source code is available at https://github.com/weiquan/mam.
topic whole-genome resequencing
next-generation sequencing
repeats
sequence alignment
FM-index
url https://www.frontiersin.org/articles/10.3389/fmolb.2020.572934/full
work_keys_str_mv AT weiquan shortreadalignmentbasedonmaximalapproximatematchseeds
AT dengfengguan shortreadalignmentbasedonmaximalapproximatematchseeds
AT dengfengguan shortreadalignmentbasedonmaximalapproximatematchseeds
AT guangriquan shortreadalignmentbasedonmaximalapproximatematchseeds
AT boliu shortreadalignmentbasedonmaximalapproximatematchseeds
AT yadongwang shortreadalignmentbasedonmaximalapproximatematchseeds
_version_ 1724426044432187392