Computational Prediction of Gene Function From High-throughput Data Sources

A large number and variety of genome-wide genomics and proteomics datasets are now available for model organisms. Each dataset on its own presents a distinct but noisy view of cellular state. However, collectively, these datasets embody a more comprehensive view of cell function. This motivates the...

Full description

Bibliographic Details
Main Author: Mostafavi, Sara
Other Authors: Morris, Quaid
Language:en_ca
Published: 2011
Subjects:
Online Access:http://hdl.handle.net/1807/29820
id ndltd-TORONTO-oai-tspace.library.utoronto.ca-1807-29820
record_format oai_dc
spelling ndltd-TORONTO-oai-tspace.library.utoronto.ca-1807-298202013-04-19T19:56:12ZComputational Prediction of Gene Function From High-throughput Data SourcesMostafavi, SaraComputational BiologyMachine LearningPredicting Gene FunctionBiological NetworksCombining High-Throughput Data Sources098408000715A large number and variety of genome-wide genomics and proteomics datasets are now available for model organisms. Each dataset on its own presents a distinct but noisy view of cellular state. However, collectively, these datasets embody a more comprehensive view of cell function. This motivates the prediction of function for uncharacterized genes by combining multiple datasets, in order to exploit the associations between such genes and genes of known function--all in a query-specific fashion. Commonly, heterogeneous datasets are represented as networks in order to facilitate their combination. Here, I show that it is possible to accurately predict gene function in seconds by combining multiple large-scale networks. This facilitates function prediction on-demand, allowing users to take advantage of the persistent improvement and proliferation of genomics and proteomics datasets and continuously make up-to-date predictions for large genomes such as humans. Our algorithm, GeneMANIA, uses constrained linear regression to combine multiple association networks and uses label propagation to make predictions from the combined network. I introduce extensions that result in improved predictions when the number of labeled examples for training is limited, or when an ontological structure describing a hierarchy of gene function categorization scheme is available. Further, motivated by our empirical observations on predicting node labels for general networks, I propose a new label propagation algorithm that exploits common properties of real-world networks to increase both the speed and accuracy of our predictions.Morris, Quaid2011-062011-08-31T17:45:37ZNO_RESTRICTION2011-08-31T17:45:37Z2011-08-31Thesishttp://hdl.handle.net/1807/29820en_ca
collection NDLTD
language en_ca
sources NDLTD
topic Computational Biology
Machine Learning
Predicting Gene Function
Biological Networks
Combining High-Throughput Data Sources
0984
0800
0715
spellingShingle Computational Biology
Machine Learning
Predicting Gene Function
Biological Networks
Combining High-Throughput Data Sources
0984
0800
0715
Mostafavi, Sara
Computational Prediction of Gene Function From High-throughput Data Sources
description A large number and variety of genome-wide genomics and proteomics datasets are now available for model organisms. Each dataset on its own presents a distinct but noisy view of cellular state. However, collectively, these datasets embody a more comprehensive view of cell function. This motivates the prediction of function for uncharacterized genes by combining multiple datasets, in order to exploit the associations between such genes and genes of known function--all in a query-specific fashion. Commonly, heterogeneous datasets are represented as networks in order to facilitate their combination. Here, I show that it is possible to accurately predict gene function in seconds by combining multiple large-scale networks. This facilitates function prediction on-demand, allowing users to take advantage of the persistent improvement and proliferation of genomics and proteomics datasets and continuously make up-to-date predictions for large genomes such as humans. Our algorithm, GeneMANIA, uses constrained linear regression to combine multiple association networks and uses label propagation to make predictions from the combined network. I introduce extensions that result in improved predictions when the number of labeled examples for training is limited, or when an ontological structure describing a hierarchy of gene function categorization scheme is available. Further, motivated by our empirical observations on predicting node labels for general networks, I propose a new label propagation algorithm that exploits common properties of real-world networks to increase both the speed and accuracy of our predictions.
author2 Morris, Quaid
author_facet Morris, Quaid
Mostafavi, Sara
author Mostafavi, Sara
author_sort Mostafavi, Sara
title Computational Prediction of Gene Function From High-throughput Data Sources
title_short Computational Prediction of Gene Function From High-throughput Data Sources
title_full Computational Prediction of Gene Function From High-throughput Data Sources
title_fullStr Computational Prediction of Gene Function From High-throughput Data Sources
title_full_unstemmed Computational Prediction of Gene Function From High-throughput Data Sources
title_sort computational prediction of gene function from high-throughput data sources
publishDate 2011
url http://hdl.handle.net/1807/29820
work_keys_str_mv AT mostafavisara computationalpredictionofgenefunctionfromhighthroughputdatasources
_version_ 1716582077532471296