Clustering a database of optically absorbing organic molecules via a hierarchical fingerprint scheme that categorizes similar functional molecular fragments

A measure of chemical similarity is only useful if it implies similarity in some relevant property space. Typically, similarity calculations operate by assigning each molecule a chemical fingerprint: a fixed-length vector of bits where the on-bits signify the presence of a certain feature. Common fi...

Full description

Bibliographic Details
Main Authors: Cole, J.M (Author), Flanagan, P.J (Author)
Format: Article
Language:English
Published: NLM (Medline) 2022
Subjects:
Online Access:View Fulltext in Publisher
LEADER 02748nam a2200217Ia 4500
001 10.1063-5.0087603
008 220510s2022 CNT 000 0 und d
020 |a 10897690 (ISSN) 
245 1 0 |a Clustering a database of optically absorbing organic molecules via a hierarchical fingerprint scheme that categorizes similar functional molecular fragments 
260 0 |b NLM (Medline)  |c 2022 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1063/5.0087603 
520 3 |a A measure of chemical similarity is only useful if it implies similarity in some relevant property space. Typically, similarity calculations operate by assigning each molecule a chemical fingerprint: a fixed-length vector of bits where the on-bits signify the presence of a certain feature. Common fingerprinting schemes, such as extended-connectivity fingerprints, are by definition general and fail to capture much of the domain-specific theory that underpins similarity in a specific domain. In this work, a hierarchical fingerprinting scheme is developed that is bespoke to a database of ∼4500 organic molecules and their cognate optical absorption spectral properties. Our fingerprinting scheme incorporates molecular fragmentation and domain-specific chemical intuition into an algorithm that categorizes each fragment as being one of a core chemical group, a substituent, or a bridge. The algorithm is applied to every molecule in the database to generate a pool of chemically relevant fragments that are labeled according to their structural category. The fingerprint of each molecule is then composed of a nested Python dictionary specifying the unique identifiers of its constituent fragment entities and the structural links between them to give a hierarchical molecular encoding scheme. Four case studies show the application of our fingerprinting scheme to the subject database. In each case, the clustered molecules display a host of interesting chemical trends. The application that was used to develop and implement this bespoke fingerprinting scheme, referred to as ChemCluster, also exposes a host of other cheminformatics tools pertaining to this database, a selection of which is demonstrated in this work. The enhanced similarity comparisons afforded by our fingerprinting scheme, as well as the large repository of categorized fragments generated during its development, constitute the first step toward using this database in a data-driven materials discovery workflow. 
650 0 4 |a algorithm 
650 0 4 |a Algorithms 
650 0 4 |a cluster analysis 
650 0 4 |a Cluster Analysis 
650 0 4 |a Databases, Factual 
650 0 4 |a factual database 
700 1 |a Cole, J.M.  |e author 
700 1 |a Flanagan, P.J.  |e author 
773 |t The Journal of chemical physics