Statistical Methods for High Throughput Screening Drug Discovery Data

High Throughput Screening (HTS) is used in drug discovery to screen large numbers of compounds against a biological target. Data on activity against the target are collected for a representative sample of compounds selected from a large library. The goal of drug discovery is to relate the act...

Full description

Bibliographic Details
Main Author: Wang, Yuanyuan (Marcia)
Format: Others
Language:en
Published: University of Waterloo 2006
Subjects:
SAR
HTS
KNN
Online Access:http://hdl.handle.net/10012/1204
id ndltd-WATERLOO-oai-uwspace.uwaterloo.ca-10012-1204
record_format oai_dc
collection NDLTD
language en
format Others
sources NDLTD
topic Statistics
SAR
HTS
averaging models
CART
KNN
spellingShingle Statistics
SAR
HTS
averaging models
CART
KNN
Wang, Yuanyuan (Marcia)
Statistical Methods for High Throughput Screening Drug Discovery Data
description High Throughput Screening (HTS) is used in drug discovery to screen large numbers of compounds against a biological target. Data on activity against the target are collected for a representative sample of compounds selected from a large library. The goal of drug discovery is to relate the activity of a compound to its chemical structure, which is quantified by various explanatory variables, and hence to identify further active compounds. Often, this application has a very unbalanced class distribution, with a rare active class. <br /><br /> Classification methods are commonly proposed as solutions to this problem. However, regarding drug discovery, researchers are more interested in ranking compounds by predicted activity than in the classification itself. This feature makes my approach distinct from common classification techniques. <br /><br /> In this thesis, two AIDS data sets from the National Cancer Institute (NCI) are mainly used. Local methods, namely K-nearest neighbours (KNN) and classification and regression trees (CART), perform very well on these data in comparison with linear/logistic regression, neural networks, and Multivariate Adaptive Regression Splines (MARS) models, which assume more smoothness. One reason for the superiority of local methods is the local behaviour of the data. Indeed, I argue that conventional classification criteria such as misclassification rate or deviance tend to select too small a tree or too large a value of <em>k</em> (the number of nearest neighbours). A more local model (bigger tree or smaller <em>k</em>) gives a better performance in terms of drug discovery. <br /><br /> Because off-the-shelf KNN works relatively well, this thesis takes this promising method and makes several novel modifications, which further improve its performance. The choice of <em>k</em> is optimized for each test point to be predicted. The empirically observed superiority of allowing <em>k</em> to vary is investigated. The nature of the problem, ranking of objects rather than estimating the probability of activity, enables the <em>k</em>-varying algorithm to stand out. Similarly, KNN combined with a kernel weight function (weighted KNN) is proposed and demonstrated to be superior to the regular KNN method. <br /><br /> High dimensionality of the explanatory variables is known to cause problems for KNN and many other classifiers. I propose a novel method (subset KNN) of averaging across multiple classifiers based on building classifiers on subspaces (subsets of variables). It improves the performance of KNN for HTS data. When applied to CART, it also performs as well as or even better than the popular methods of bagging and boosting. Part of this improvement is due to the discovery that classifiers based on irrelevant subspaces (unimportant explanatory variables) do little damage when averaged with good classifiers based on relevant subspaces (important variables). This result is particular to the ranking of objects rather than estimating the probability of activity. A theoretical justification is proposed. The thesis also suggests diagnostics for identifying important subsets of variables and hence further reducing the impact of the curse of dimensionality. <br /><br /> In order to have a broader evaluation of these methods, subset KNN and weighted KNN are applied to three other data sets: the NCI AIDS data with Constitutional descriptors, Mutagenicity data with BCUT descriptors and Mutagenicity data with Constitutional descriptors. The <em>k</em>-varying algorithm as a method for unbalanced data is also applied to NCI AIDS data with Constitutional descriptors. As a baseline, the performance of KNN on such data sets is reported. Although different methods are best for the different data sets, some of the proposed methods are always amongst the best. <br /><br /> Finally, methods are described for estimating activity rates and error rates in HTS data. By combining auxiliary information about repeat tests of the same compound, likelihood methods can extract interesting information about the magnitudes of the measurement errors made in the assay process. These estimates can be used to assess model performance, which sheds new light on how various models handle the large random or systematic assay errors often present in HTS data.
author Wang, Yuanyuan (Marcia)
author_facet Wang, Yuanyuan (Marcia)
author_sort Wang, Yuanyuan (Marcia)
title Statistical Methods for High Throughput Screening Drug Discovery Data
title_short Statistical Methods for High Throughput Screening Drug Discovery Data
title_full Statistical Methods for High Throughput Screening Drug Discovery Data
title_fullStr Statistical Methods for High Throughput Screening Drug Discovery Data
title_full_unstemmed Statistical Methods for High Throughput Screening Drug Discovery Data
title_sort statistical methods for high throughput screening drug discovery data
publisher University of Waterloo
publishDate 2006
url http://hdl.handle.net/10012/1204
work_keys_str_mv AT wangyuanyuanmarcia statisticalmethodsforhighthroughputscreeningdrugdiscoverydata
_version_ 1716572482201190400
spelling ndltd-WATERLOO-oai-uwspace.uwaterloo.ca-10012-12042013-01-08T18:49:25ZWang, Yuanyuan (Marcia)2006-08-22T14:28:13Z2006-08-22T14:28:13Z20052005http://hdl.handle.net/10012/1204High Throughput Screening (HTS) is used in drug discovery to screen large numbers of compounds against a biological target. Data on activity against the target are collected for a representative sample of compounds selected from a large library. The goal of drug discovery is to relate the activity of a compound to its chemical structure, which is quantified by various explanatory variables, and hence to identify further active compounds. Often, this application has a very unbalanced class distribution, with a rare active class. <br /><br /> Classification methods are commonly proposed as solutions to this problem. However, regarding drug discovery, researchers are more interested in ranking compounds by predicted activity than in the classification itself. This feature makes my approach distinct from common classification techniques. <br /><br /> In this thesis, two AIDS data sets from the National Cancer Institute (NCI) are mainly used. Local methods, namely K-nearest neighbours (KNN) and classification and regression trees (CART), perform very well on these data in comparison with linear/logistic regression, neural networks, and Multivariate Adaptive Regression Splines (MARS) models, which assume more smoothness. One reason for the superiority of local methods is the local behaviour of the data. Indeed, I argue that conventional classification criteria such as misclassification rate or deviance tend to select too small a tree or too large a value of <em>k</em> (the number of nearest neighbours). A more local model (bigger tree or smaller <em>k</em>) gives a better performance in terms of drug discovery. <br /><br /> Because off-the-shelf KNN works relatively well, this thesis takes this promising method and makes several novel modifications, which further improve its performance. The choice of <em>k</em> is optimized for each test point to be predicted. The empirically observed superiority of allowing <em>k</em> to vary is investigated. The nature of the problem, ranking of objects rather than estimating the probability of activity, enables the <em>k</em>-varying algorithm to stand out. Similarly, KNN combined with a kernel weight function (weighted KNN) is proposed and demonstrated to be superior to the regular KNN method. <br /><br /> High dimensionality of the explanatory variables is known to cause problems for KNN and many other classifiers. I propose a novel method (subset KNN) of averaging across multiple classifiers based on building classifiers on subspaces (subsets of variables). It improves the performance of KNN for HTS data. When applied to CART, it also performs as well as or even better than the popular methods of bagging and boosting. Part of this improvement is due to the discovery that classifiers based on irrelevant subspaces (unimportant explanatory variables) do little damage when averaged with good classifiers based on relevant subspaces (important variables). This result is particular to the ranking of objects rather than estimating the probability of activity. A theoretical justification is proposed. The thesis also suggests diagnostics for identifying important subsets of variables and hence further reducing the impact of the curse of dimensionality. <br /><br /> In order to have a broader evaluation of these methods, subset KNN and weighted KNN are applied to three other data sets: the NCI AIDS data with Constitutional descriptors, Mutagenicity data with BCUT descriptors and Mutagenicity data with Constitutional descriptors. The <em>k</em>-varying algorithm as a method for unbalanced data is also applied to NCI AIDS data with Constitutional descriptors. As a baseline, the performance of KNN on such data sets is reported. Although different methods are best for the different data sets, some of the proposed methods are always amongst the best. <br /><br /> Finally, methods are described for estimating activity rates and error rates in HTS data. By combining auxiliary information about repeat tests of the same compound, likelihood methods can extract interesting information about the magnitudes of the measurement errors made in the assay process. These estimates can be used to assess model performance, which sheds new light on how various models handle the large random or systematic assay errors often present in HTS data.application/pdf1284426 bytesapplication/pdfenUniversity of WaterlooCopyright: 2005, Wang, Yuanyuan (Marcia). All rights reserved.StatisticsSARHTSaveraging modelsCARTKNNStatistical Methods for High Throughput Screening Drug Discovery DataThesis or DissertationStatistics and Actuarial Science (Statistics)Doctor of Philosophy