Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers
The availability of exome sequence data for thousands of cancer samples has enabled the investigation of the sequence-level mutations that contribute to cancer. There is a need for strategies to analyse sequence data to gain new biological and clinical insights. This thesis investigates the use of m...
Main Author: | |
---|---|
Other Authors: | |
Published: |
King's College London (University of London)
2016
|
Subjects: | |
Online Access: | http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.700764 |
id |
ndltd-bl.uk-oai-ethos.bl.uk-700764 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-bl.uk-oai-ethos.bl.uk-7007642018-06-06T15:32:52ZUsing machine learning and systems-biology approaches to analyse next-generation sequence data in cancersSutherland, Russel DavidLewis, Cathryn Mair ; Dobson, Richard James Butler2016The availability of exome sequence data for thousands of cancer samples has enabled the investigation of the sequence-level mutations that contribute to cancer. There is a need for strategies to analyse sequence data to gain new biological and clinical insights. This thesis investigates the use of machine learning and network-based methods to identify the mutated genes associated with important clinical features and cancer types, and to aid candidate gene prioritisation in colorectal cancer, and rheumatoid arthritis. Firstly, tumour/normal exome sequence data was analysed to identify the mutated genes associated with cancer grade and cancer stage across and within three adenocarcinomas. Tumour grading is an important prognostic indicator which is based upon subjective assessment by pathologists, and is not standardised across cancer types. Despite this, this study found that protein coding mutations within TP53 were indicative of high grade status across three adenocarcinomas once adjusted for age, gender, stage, and tumour type. Secondly, Random Forest models were used to identify the mutations that discriminate each of five high-order cancer types. Based on this work a Random Forest approach was used to investigate whether exome sequence data could be used to assign cancers to their tissue of origin without prior knowledge, for future use as a classifier for cancers of unknown primary origin. Finally, a network-based method to perform candidate disease gene prioritisation called ‘k-pseudo cliques analysis’ was developed. The method identifies sets of highly interacting proteins that are enriched for low gene-level p-values. In tests, the identified gene sets outperformed a univariate test for general cancer gene enrichment. As part of the final chapter a network-based method called ‘Region Growing Analysis’ was used to perform candidate disease gene prioritisation of rheumatoid arthritis genome-wide association study data. The findings and methods developed in this thesis can provide insights to the genetic correlates of cancer phenotypes and suggest new candidate disease genes.616.99King's College London (University of London)http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.700764https://kclpure.kcl.ac.uk/portal/en/theses/using-machine-learning-and-systemsbiology-approaches-to-analyse-nextgeneration-sequence-data-in-cancers(44ff20d1-dbf0-43f7-a5ad-18759598ec6b).htmlElectronic Thesis or Dissertation |
collection |
NDLTD |
sources |
NDLTD |
topic |
616.99 |
spellingShingle |
616.99 Sutherland, Russel David Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers |
description |
The availability of exome sequence data for thousands of cancer samples has enabled the investigation of the sequence-level mutations that contribute to cancer. There is a need for strategies to analyse sequence data to gain new biological and clinical insights. This thesis investigates the use of machine learning and network-based methods to identify the mutated genes associated with important clinical features and cancer types, and to aid candidate gene prioritisation in colorectal cancer, and rheumatoid arthritis. Firstly, tumour/normal exome sequence data was analysed to identify the mutated genes associated with cancer grade and cancer stage across and within three adenocarcinomas. Tumour grading is an important prognostic indicator which is based upon subjective assessment by pathologists, and is not standardised across cancer types. Despite this, this study found that protein coding mutations within TP53 were indicative of high grade status across three adenocarcinomas once adjusted for age, gender, stage, and tumour type. Secondly, Random Forest models were used to identify the mutations that discriminate each of five high-order cancer types. Based on this work a Random Forest approach was used to investigate whether exome sequence data could be used to assign cancers to their tissue of origin without prior knowledge, for future use as a classifier for cancers of unknown primary origin. Finally, a network-based method to perform candidate disease gene prioritisation called ‘k-pseudo cliques analysis’ was developed. The method identifies sets of highly interacting proteins that are enriched for low gene-level p-values. In tests, the identified gene sets outperformed a univariate test for general cancer gene enrichment. As part of the final chapter a network-based method called ‘Region Growing Analysis’ was used to perform candidate disease gene prioritisation of rheumatoid arthritis genome-wide association study data. The findings and methods developed in this thesis can provide insights to the genetic correlates of cancer phenotypes and suggest new candidate disease genes. |
author2 |
Lewis, Cathryn Mair ; Dobson, Richard James Butler |
author_facet |
Lewis, Cathryn Mair ; Dobson, Richard James Butler Sutherland, Russel David |
author |
Sutherland, Russel David |
author_sort |
Sutherland, Russel David |
title |
Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers |
title_short |
Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers |
title_full |
Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers |
title_fullStr |
Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers |
title_full_unstemmed |
Using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers |
title_sort |
using machine learning and systems-biology approaches to analyse next-generation sequence data in cancers |
publisher |
King's College London (University of London) |
publishDate |
2016 |
url |
http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.700764 |
work_keys_str_mv |
AT sutherlandrusseldavid usingmachinelearningandsystemsbiologyapproachestoanalysenextgenerationsequencedataincancers |
_version_ |
1718692140177948672 |