Statistical models for cancer gene expression data and visualization of biological networks

Gene expression data form a rich source of information for elucidating the biological function of cellular systems on the pathway level. For this reason, various pathwaybased methods have been developed for analyzing gene expression data from highthroughput experiments. However, in order to utilize...

Full description

Bibliographic Details
Main Author: Tripathi, Shailesh
Published: Queen's University Belfast 2013
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.603429
Description
Summary:Gene expression data form a rich source of information for elucidating the biological function of cellular systems on the pathway level. For this reason, various pathwaybased methods have been developed for analyzing gene expression data from highthroughput experiments. However, in order to utilize the full potential the data can offer, e.g., for cancer research, a more thorough understanding of such methods is required. This thesis consists of two major parts, which contain the results. In the first results part of the thesis we investigate the statistical characteristics of five competitive gene set methods. One major finding shows that three of these five methods, namely, GSEA, GSEArot and GAGE, are negatively influenced by the number of background genes, and, hence, the filtering of the data, in the sense that these methods become more sensitive for expression changes despite the fact that the number of samples remains constant This counter intuitive behavior leads to principle problems for the application of these methods to biological data making the results from these methods no longer reconcilable with the principles of statistical inference rendering the obtained results in the worst case inexpressive. In order to avoid these problems, we suggest an experimental design that helps preventing such issues. Further, we present a new assessment method that allows a power analysis of competitive but also self-contained gene set methods. More precisely, due to the general lack of a sufficient sample size in real data sets, simulated expression data are required in order to investigate statistical methods thoroughly. However, the simulation of pathway-based methods is challenging due to the presence of nontrivial correlation structures within pathways the simulations need to account for. For this reason, we investigated new simulation methods in order to identify commonalities and differences with respect to their biological characteristics. In the second results part we present an R software package we developed, called NETBlOV. NETBlOV enables the visualization of large biological networks and to highlight structural features that are of biological relevance, e.g., the modularity of these networks.