Assessment of Random Assignment in Training and Test Sets using Generalized Cluster Analysis Technique

Aim: The properness of random assignment of compounds in training and validation sets was assessed using the generalized cluster technique. Material and Method: A quantitative Structure-Activity Relationship model using Molecular Descriptors Family on Vertices was evaluated in terms of assignment of...

Full description

Bibliographic Details
Main Author: Sorana D. BOLBOACĂ
Format: Article
Language:English
Published: Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca 2011-06-01
Series:Applied Medical Informatics
Subjects:
Online Access:http://ami.info.umfcluj.ro/index.php/AMI/article/view/225/pdf
id doaj-10d9d5dc98f546508026416482e14b46
record_format Article
spelling doaj-10d9d5dc98f546508026416482e14b462020-11-25T01:26:51ZengIuliu Hatieganu University of Medicine and Pharmacy, Cluj-NapocaApplied Medical Informatics1224-55932011-06-01282914Assessment of Random Assignment in Training and Test Sets using Generalized Cluster Analysis TechniqueSorana D. BOLBOACĂAim: The properness of random assignment of compounds in training and validation sets was assessed using the generalized cluster technique. Material and Method: A quantitative Structure-Activity Relationship model using Molecular Descriptors Family on Vertices was evaluated in terms of assignment of carboquinone derivatives in training and test sets during the leave-many-out analysis. Assignment of compounds was investigated using five variables: observed anticancer activity and four structure descriptors. Generalized cluster analysis with K-means algorithm was applied in order to investigate if the assignment of compounds was or not proper. The Euclidian distance and maximization of the initial distance using a cross-validation with a v-fold of 10 was applied. Results: All five variables included in analysis proved to have statistically significant contribution in identification of clusters. Three clusters were identified, each of them containing both carboquinone derivatives belonging to training as well as to test sets. The observed activity of carboquinone derivatives proved to be normal distributed on every. The presence of training and test sets in all clusters identified using generalized cluster analysis with K-means algorithm and the distribution of observed activity within clusters sustain a proper assignment of compounds in training and test set. Conclusion: Generalized cluster analysis using the K-means algorithm proved to be a valid method in assessment of random assignment of carboquinone derivatives in training and test sets.http://ami.info.umfcluj.ro/index.php/AMI/article/view/225/pdfquantitative Structure-Activity Relationship (qSAR)Molecular Descriptors Family on Vertices (MDFV)Anticancer drugGeneralized Cluster Analysis.
collection DOAJ
language English
format Article
sources DOAJ
author Sorana D. BOLBOACĂ
spellingShingle Sorana D. BOLBOACĂ
Assessment of Random Assignment in Training and Test Sets using Generalized Cluster Analysis Technique
Applied Medical Informatics
quantitative Structure-Activity Relationship (qSAR)
Molecular Descriptors Family on Vertices (MDFV)
Anticancer drug
Generalized Cluster Analysis.
author_facet Sorana D. BOLBOACĂ
author_sort Sorana D. BOLBOACĂ
title Assessment of Random Assignment in Training and Test Sets using Generalized Cluster Analysis Technique
title_short Assessment of Random Assignment in Training and Test Sets using Generalized Cluster Analysis Technique
title_full Assessment of Random Assignment in Training and Test Sets using Generalized Cluster Analysis Technique
title_fullStr Assessment of Random Assignment in Training and Test Sets using Generalized Cluster Analysis Technique
title_full_unstemmed Assessment of Random Assignment in Training and Test Sets using Generalized Cluster Analysis Technique
title_sort assessment of random assignment in training and test sets using generalized cluster analysis technique
publisher Iuliu Hatieganu University of Medicine and Pharmacy, Cluj-Napoca
series Applied Medical Informatics
issn 1224-5593
publishDate 2011-06-01
description Aim: The properness of random assignment of compounds in training and validation sets was assessed using the generalized cluster technique. Material and Method: A quantitative Structure-Activity Relationship model using Molecular Descriptors Family on Vertices was evaluated in terms of assignment of carboquinone derivatives in training and test sets during the leave-many-out analysis. Assignment of compounds was investigated using five variables: observed anticancer activity and four structure descriptors. Generalized cluster analysis with K-means algorithm was applied in order to investigate if the assignment of compounds was or not proper. The Euclidian distance and maximization of the initial distance using a cross-validation with a v-fold of 10 was applied. Results: All five variables included in analysis proved to have statistically significant contribution in identification of clusters. Three clusters were identified, each of them containing both carboquinone derivatives belonging to training as well as to test sets. The observed activity of carboquinone derivatives proved to be normal distributed on every. The presence of training and test sets in all clusters identified using generalized cluster analysis with K-means algorithm and the distribution of observed activity within clusters sustain a proper assignment of compounds in training and test set. Conclusion: Generalized cluster analysis using the K-means algorithm proved to be a valid method in assessment of random assignment of carboquinone derivatives in training and test sets.
topic quantitative Structure-Activity Relationship (qSAR)
Molecular Descriptors Family on Vertices (MDFV)
Anticancer drug
Generalized Cluster Analysis.
url http://ami.info.umfcluj.ro/index.php/AMI/article/view/225/pdf
work_keys_str_mv AT soranadbolboaca assessmentofrandomassignmentintrainingandtestsetsusinggeneralizedclusteranalysistechnique
_version_ 1725108648306278400