Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers

The availability of exome sequence data for thousands of cancer samples has enabled the investigation of the sequence-level mutations that contribute to cancer. There is a need for strategies to analyse sequence data to gain new biological and clinical insights. This thesis investigates the use of m...

Full description

Bibliographic Details
Main Author: Sutherland, Russel David
Other Authors: Lewis, Cathryn Mair ; Dobson, Richard James Butler
Published: King's College London (University of London) 2016
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.700764
id ndltd-bl.uk-oai-ethos.bl.uk-700764
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-7007642018-06-06T15:32:52ZUsing machine learning and systems-biology approaches to analyse next-generation sequence data in cancersSutherland, Russel DavidLewis, Cathryn Mair ; Dobson, Richard James Butler2016The availability of exome sequence data for thousands of cancer samples has enabled the investigation of the sequence-level mutations that contribute to cancer. There is a need for strategies to analyse sequence data to gain new biological and clinical insights. This thesis investigates the use of machine learning and network-based methods to identify the mutated genes associated with important clinical features and cancer types, and to aid candidate gene prioritisation in colorectal cancer, and rheumatoid arthritis. Firstly, tumour/normal exome sequence data was analysed to identify the mutated genes associated with cancer grade and cancer stage across and within three adenocarcinomas. Tumour grading is an important prognostic indicator which is based upon subjective assessment by pathologists, and is not standardised across cancer types. Despite this, this study found that protein coding mutations within TP53 were indicative of high grade status across three adenocarcinomas once adjusted for age, gender, stage, and tumour type. Secondly, Random Forest models were used to identify the mutations that discriminate each of five high-order cancer types. Based on this work a Random Forest approach was used to investigate whether exome sequence data could be used to assign cancers to their tissue of origin without prior knowledge, for future use as a classifier for cancers of unknown primary origin. Finally, a network-based method to perform candidate disease gene prioritisation called ‘k-pseudo cliques analysis’ was developed. The method identifies sets of highly interacting proteins that are enriched for low gene-level p-values. In tests, the identified gene sets outperformed a univariate test for general cancer gene enrichment. As part of the final chapter a network-based method called ‘Region Growing Analysis’ was used to perform candidate disease gene prioritisation of rheumatoid arthritis genome-wide association study data. The findings and methods developed in this thesis can provide insights to the genetic correlates of cancer phenotypes and suggest new candidate disease genes.616.99King's College London (University of London)http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.700764https://kclpure.kcl.ac.uk/portal/en/theses/using-machine-learning-and-systemsbiology-approaches-to-analyse-nextgeneration-sequence-data-in-cancers(44ff20d1-dbf0-43f7-a5ad-18759598ec6b).htmlElectronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 616.99
spellingShingle 616.99
Sutherland, Russel David
Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers
description The availability of exome sequence data for thousands of cancer samples has enabled the investigation of the sequence-level mutations that contribute to cancer. There is a need for strategies to analyse sequence data to gain new biological and clinical insights. This thesis investigates the use of machine learning and network-based methods to identify the mutated genes associated with important clinical features and cancer types, and to aid candidate gene prioritisation in colorectal cancer, and rheumatoid arthritis. Firstly, tumour/normal exome sequence data was analysed to identify the mutated genes associated with cancer grade and cancer stage across and within three adenocarcinomas. Tumour grading is an important prognostic indicator which is based upon subjective assessment by pathologists, and is not standardised across cancer types. Despite this, this study found that protein coding mutations within TP53 were indicative of high grade status across three adenocarcinomas once adjusted for age, gender, stage, and tumour type. Secondly, Random Forest models were used to identify the mutations that discriminate each of five high-order cancer types. Based on this work a Random Forest approach was used to investigate whether exome sequence data could be used to assign cancers to their tissue of origin without prior knowledge, for future use as a classifier for cancers of unknown primary origin. Finally, a network-based method to perform candidate disease gene prioritisation called ‘k-pseudo cliques analysis’ was developed. The method identifies sets of highly interacting proteins that are enriched for low gene-level p-values. In tests, the identified gene sets outperformed a univariate test for general cancer gene enrichment. As part of the final chapter a network-based method called ‘Region Growing Analysis’ was used to perform candidate disease gene prioritisation of rheumatoid arthritis genome-wide association study data. The findings and methods developed in this thesis can provide insights to the genetic correlates of cancer phenotypes and suggest new candidate disease genes.
author2 Lewis, Cathryn Mair ; Dobson, Richard James Butler
author_facet Lewis, Cathryn Mair ; Dobson, Richard James Butler
Sutherland, Russel David
author Sutherland, Russel David
author_sort Sutherland, Russel David
title Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers
title_short Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers
title_full Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers
title_fullStr Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers
title_full_unstemmed Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers
title_sort using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers
publisher King's College London (University of London)
publishDate 2016
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.700764
work_keys_str_mv AT sutherlandrusseldavid usingmachinelearningandsystemsbiologyapproachestoanalysenextgenerationsequencedataincancers
_version_ 1718692140177948672