Open-source QSAR models for pKa prediction using multiple machine learning approaches

Abstract Background The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion, and toxicit...

Full description

Bibliographic Details
Main Authors: Kamel Mansouri, Neal F. Cariello, Alexandru Korotcov, Valery Tkachenko, Chris M. Grulke, Catherine S. Sprankle, David Allen, Warren M. Casey, Nicole C. Kleinstreuer, Antony J. Williams
Format: Article
Language:English
Published: BMC 2019-09-01
Series:Journal of Cheminformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13321-019-0384-1
id doaj-4f55baf346f34fe88b0e95561bea6d46
record_format Article
spelling doaj-4f55baf346f34fe88b0e95561bea6d462020-11-25T03:24:05ZengBMCJournal of Cheminformatics1758-29462019-09-0111112010.1186/s13321-019-0384-1Open-source QSAR models for pKa prediction using multiple machine learning approachesKamel Mansouri0Neal F. Cariello1Alexandru Korotcov2Valery Tkachenko3Chris M. Grulke4Catherine S. Sprankle5David Allen6Warren M. Casey7Nicole C. Kleinstreuer8Antony J. Williams9Integrated Laboratory Systems, Inc.Integrated Laboratory Systems, Inc.Science Data Software LLCScience Data Software LLCNational Center for Computational Toxicology, U.S. Environmental Protection AgencyIntegrated Laboratory Systems, Inc.Integrated Laboratory Systems, Inc.National Institute of Environmental Health SciencesNational Institute of Environmental Health SciencesNational Center for Computational Toxicology, U.S. Environmental Protection AgencyAbstract Background The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion, and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open-source programs exist for this purpose. Using a freely available data set and three machine learning approaches, we developed open-source models for pKa prediction. Methods The experimental strongest acidic and strongest basic pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for quantitative structure–activity relationship (QSAR) modeling using KNIME, and a subset comprising 79% of the initial set was used for modeling. To evaluate different approaches to modeling, several datasets were constructed based on different processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fingerprints, and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) support vector machines (SVM) combined with k-nearest neighbors (kNN), (2) extreme gradient boosting (XGB) and (3) deep neural networks (DNN). Results The three methods delivered comparable performances on the training and test sets with a root-mean-squared error (RMSE) around 1.5 and a coefficient of determination (R2) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work, and performance of our models compared favorably to the commercial products. Conclusions This work provides multiple QSAR models to predict the strongest acidic and strongest basic pKas of chemicals, built using publicly available data, and provided as free and open-source software on GitHub.http://link.springer.com/article/10.1186/s13321-019-0384-1pKa predictionQSARDataWarriorMachine learningChemical 2D descriptorsChemical fingerprints
collection DOAJ
language English
format Article
sources DOAJ
author Kamel Mansouri
Neal F. Cariello
Alexandru Korotcov
Valery Tkachenko
Chris M. Grulke
Catherine S. Sprankle
David Allen
Warren M. Casey
Nicole C. Kleinstreuer
Antony J. Williams
spellingShingle Kamel Mansouri
Neal F. Cariello
Alexandru Korotcov
Valery Tkachenko
Chris M. Grulke
Catherine S. Sprankle
David Allen
Warren M. Casey
Nicole C. Kleinstreuer
Antony J. Williams
Open-source QSAR models for pKa prediction using multiple machine learning approaches
Journal of Cheminformatics
pKa prediction
QSAR
DataWarrior
Machine learning
Chemical 2D descriptors
Chemical fingerprints
author_facet Kamel Mansouri
Neal F. Cariello
Alexandru Korotcov
Valery Tkachenko
Chris M. Grulke
Catherine S. Sprankle
David Allen
Warren M. Casey
Nicole C. Kleinstreuer
Antony J. Williams
author_sort Kamel Mansouri
title Open-source QSAR models for pKa prediction using multiple machine learning approaches
title_short Open-source QSAR models for pKa prediction using multiple machine learning approaches
title_full Open-source QSAR models for pKa prediction using multiple machine learning approaches
title_fullStr Open-source QSAR models for pKa prediction using multiple machine learning approaches
title_full_unstemmed Open-source QSAR models for pKa prediction using multiple machine learning approaches
title_sort open-source qsar models for pka prediction using multiple machine learning approaches
publisher BMC
series Journal of Cheminformatics
issn 1758-2946
publishDate 2019-09-01
description Abstract Background The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to pass through the plasma membrane. Thus, pKa affects chemical absorption, distribution, metabolism, excretion, and toxicity properties. Multiple proprietary software packages exist for the prediction of pKa, but to the best of our knowledge no free and open-source programs exist for this purpose. Using a freely available data set and three machine learning approaches, we developed open-source models for pKa prediction. Methods The experimental strongest acidic and strongest basic pKa values in water for 7912 chemicals were obtained from DataWarrior, a freely available software package. Chemical structures were curated and standardized for quantitative structure–activity relationship (QSAR) modeling using KNIME, and a subset comprising 79% of the initial set was used for modeling. To evaluate different approaches to modeling, several datasets were constructed based on different processing of chemical structures with acidic and/or basic pKas. Continuous molecular descriptors, binary fingerprints, and fragment counts were generated using PaDEL, and pKa prediction models were created using three machine learning methods, (1) support vector machines (SVM) combined with k-nearest neighbors (kNN), (2) extreme gradient boosting (XGB) and (3) deep neural networks (DNN). Results The three methods delivered comparable performances on the training and test sets with a root-mean-squared error (RMSE) around 1.5 and a coefficient of determination (R2) around 0.80. Two commercial pKa predictors from ACD/Labs and ChemAxon were used to benchmark the three best models developed in this work, and performance of our models compared favorably to the commercial products. Conclusions This work provides multiple QSAR models to predict the strongest acidic and strongest basic pKas of chemicals, built using publicly available data, and provided as free and open-source software on GitHub.
topic pKa prediction
QSAR
DataWarrior
Machine learning
Chemical 2D descriptors
Chemical fingerprints
url http://link.springer.com/article/10.1186/s13321-019-0384-1
work_keys_str_mv AT kamelmansouri opensourceqsarmodelsforpkapredictionusingmultiplemachinelearningapproaches
AT nealfcariello opensourceqsarmodelsforpkapredictionusingmultiplemachinelearningapproaches
AT alexandrukorotcov opensourceqsarmodelsforpkapredictionusingmultiplemachinelearningapproaches
AT valerytkachenko opensourceqsarmodelsforpkapredictionusingmultiplemachinelearningapproaches
AT chrismgrulke opensourceqsarmodelsforpkapredictionusingmultiplemachinelearningapproaches
AT catherinessprankle opensourceqsarmodelsforpkapredictionusingmultiplemachinelearningapproaches
AT davidallen opensourceqsarmodelsforpkapredictionusingmultiplemachinelearningapproaches
AT warrenmcasey opensourceqsarmodelsforpkapredictionusingmultiplemachinelearningapproaches
AT nicoleckleinstreuer opensourceqsarmodelsforpkapredictionusingmultiplemachinelearningapproaches
AT antonyjwilliams opensourceqsarmodelsforpkapredictionusingmultiplemachinelearningapproaches
_version_ 1724603597549731840