Summary: | DNA copy number alterations (CNAs) are genetic changes that can produce
adverse effects in numerous human diseases, including cancer. CNAs are
segments of DNA that have been deleted or amplified and can range in size
from one kilobases to whole chromosome arms. Development of array
comparative genomic hybridization (aCGH) technology enables CNAs to be
measured at sub-megabase resolution using tens of thousands of probes.
However, aCGH data are noisy and result in continuous valued measurements of
the discrete CNAs. Consequently, the data must be processed through
algorithmic and statistical techniques in order to derive meaningful
biological insights. We introduce model-based approaches to analysis of aCGH
data and develop state-of-the-art solutions to three distinct analytical
problems.
In the simplest scenario, the task is to infer CNAs from a single aCGH
experiment. We apply a hidden Markov model (HMM) to accurately identify
CNAs from aCGH data. We show that borrowing statistical strength across
chromosomes and explicitly modeling outliers in the data, improves on
baseline models.
In the second scenario, we wish to identify recurrent CNAs in a set of aCGH
data derived from a patient cohort. These are locations in the genome
altered in many patients, providing evidence for CNAs that may be playing
important molecular roles in the disease. We develop a novel hierarchical
HMM profiling method that explicitly models both statistical and biological
noise in the data and is capable of producing a representative profile for a
set of aCGH experiments. We demonstrate that our method is more accurate
than simpler baselines on synthetic data, and show our model produces output
that is more interpretable than other methods.
Finally, we develop a model based clustering framework to stratify a patient
cohort, expected to be composed of a fixed set of molecular subtypes. We
introduce a model that jointly infers CNAs, assigns patients to subgroups
and infers the profiles that represent each subgroup. We show our model to
be more accurate on synthetic data, and show in two patient cohorts how the
model discovers putative novel subtypes and clinically relevant subgroups. === Science, Faculty of === Computer Science, Department of === Graduate
|