START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries

Abstract Background A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed comp...

Full description

Bibliographic Details
Main Authors: Xinjie Zhu, Qiang Zhang, Eric Dun Ho, Ken Hung-On Yu, Chris Liu, Tim H. Huang, Alfred Sze-Lok Cheng, Ben Kao, Eric Lo, Kevin Y. Yip
Format: Article
Language:English
Published: BMC 2017-09-01
Series:BMC Genomics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12864-017-4071-1
id doaj-57b609a486fd4677a99168b5154b364e
record_format Article
spelling doaj-57b609a486fd4677a99168b5154b364e2020-11-25T02:34:42ZengBMCBMC Genomics1471-21642017-09-0118111810.1186/s12864-017-4071-1START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queriesXinjie Zhu0Qiang Zhang1Eric Dun Ho2Ken Hung-On Yu3Chris Liu4Tim H. Huang5Alfred Sze-Lok Cheng6Ben Kao7Eric Lo8Kevin Y. Yip9Department of Computer Science, The University of Hong KongSchool of Computing, Hong Kong Polytechnic UniversityDepartment of Computer Science and Engineering, The Chinese University of Hong KongDepartment of Computer Science and Engineering, The Chinese University of Hong KongDepartment of Computer Science and Engineering, The Chinese University of Hong KongDepartment of Molecular Medicine, University of Texas Health Science Center at San AntonioSchool of Biomedical Sciences, The Chinese University of Hong KongDepartment of Computer Science, The University of Hong KongDepartment of Computer Science and Engineering, The Chinese University of Hong KongDepartment of Computer Science and Engineering, The Chinese University of Hong KongAbstract Background A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. Results Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/ ), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. Conclusions Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.http://link.springer.com/article/10.1186/s12864-017-4071-1Human genomicsSignal tracksData analysis
collection DOAJ
language English
format Article
sources DOAJ
author Xinjie Zhu
Qiang Zhang
Eric Dun Ho
Ken Hung-On Yu
Chris Liu
Tim H. Huang
Alfred Sze-Lok Cheng
Ben Kao
Eric Lo
Kevin Y. Yip
spellingShingle Xinjie Zhu
Qiang Zhang
Eric Dun Ho
Ken Hung-On Yu
Chris Liu
Tim H. Huang
Alfred Sze-Lok Cheng
Ben Kao
Eric Lo
Kevin Y. Yip
START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
BMC Genomics
Human genomics
Signal tracks
Data analysis
author_facet Xinjie Zhu
Qiang Zhang
Eric Dun Ho
Ken Hung-On Yu
Chris Liu
Tim H. Huang
Alfred Sze-Lok Cheng
Ben Kao
Eric Lo
Kevin Y. Yip
author_sort Xinjie Zhu
title START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title_short START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title_full START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title_fullStr START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title_full_unstemmed START: a system for flexible analysis of hundreds of genomic signal tracks in few lines of SQL-like queries
title_sort start: a system for flexible analysis of hundreds of genomic signal tracks in few lines of sql-like queries
publisher BMC
series BMC Genomics
issn 1471-2164
publishDate 2017-09-01
description Abstract Background A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions. Results Here we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/ ), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples. Conclusions Overall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.
topic Human genomics
Signal tracks
Data analysis
url http://link.springer.com/article/10.1186/s12864-017-4071-1
work_keys_str_mv AT xinjiezhu startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT qiangzhang startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT ericdunho startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT kenhungonyu startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT chrisliu startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT timhhuang startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT alfredszelokcheng startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT benkao startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT ericlo startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
AT kevinyyip startasystemforflexibleanalysisofhundredsofgenomicsignaltracksinfewlinesofsqllikequeries
_version_ 1724807162111197184