A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets

Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural...

Full description

Bibliographic Details
Main Authors: Deeksha Doddahonnaiah, Patrick J. Lenehan, Travis K. Hughes, David Zemmour, Enrique Garcia-Rivera, A. J. Venkatakrishnan, Ramakrishna Chilaka, Apoorv Khare, Akhil Kasaraneni, Abhinav Garg, Akash Anand, Rakesh Barve, Viswanathan Thiagarajan, Venky Soundararajan
Format: Article
Language:English
Published: MDPI AG 2021-06-01
Series:Genes
Subjects:
Online Access:https://www.mdpi.com/2073-4425/12/6/898
id doaj-32f7c50f8d8749be9e4c9d58d25d809d
record_format Article
spelling doaj-32f7c50f8d8749be9e4c9d58d25d809d2021-06-30T23:49:25ZengMDPI AGGenes2073-44252021-06-011289889810.3390/genes12060898A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq DatasetsDeeksha Doddahonnaiah0Patrick J. Lenehan1Travis K. Hughes2David Zemmour3Enrique Garcia-Rivera4A. J. Venkatakrishnan5Ramakrishna Chilaka6Apoorv Khare7Akhil Kasaraneni8Abhinav Garg9Akash Anand10Rakesh Barve11Viswanathan Thiagarajan12Venky Soundararajan13nference, One Main Street, Cambridge, MA 02142, USAnference, One Main Street, Cambridge, MA 02142, USAnference, One Main Street, Cambridge, MA 02142, USAnference, One Main Street, Cambridge, MA 02142, USAnference, One Main Street, Cambridge, MA 02142, USAnference, One Main Street, Cambridge, MA 02142, USAnference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference, One Main Street, Cambridge, MA 02142, USATechnology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney <i>p</i> = 6.15 × 10<sup>−76</sup>, r = 0.24; cohen’s D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.https://www.mdpi.com/2073-4425/12/6/898single cell genomicsnatural language processing
collection DOAJ
language English
format Article
sources DOAJ
author Deeksha Doddahonnaiah
Patrick J. Lenehan
Travis K. Hughes
David Zemmour
Enrique Garcia-Rivera
A. J. Venkatakrishnan
Ramakrishna Chilaka
Apoorv Khare
Akhil Kasaraneni
Abhinav Garg
Akash Anand
Rakesh Barve
Viswanathan Thiagarajan
Venky Soundararajan
spellingShingle Deeksha Doddahonnaiah
Patrick J. Lenehan
Travis K. Hughes
David Zemmour
Enrique Garcia-Rivera
A. J. Venkatakrishnan
Ramakrishna Chilaka
Apoorv Khare
Akhil Kasaraneni
Abhinav Garg
Akash Anand
Rakesh Barve
Viswanathan Thiagarajan
Venky Soundararajan
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
Genes
single cell genomics
natural language processing
author_facet Deeksha Doddahonnaiah
Patrick J. Lenehan
Travis K. Hughes
David Zemmour
Enrique Garcia-Rivera
A. J. Venkatakrishnan
Ramakrishna Chilaka
Apoorv Khare
Akhil Kasaraneni
Abhinav Garg
Akash Anand
Rakesh Barve
Viswanathan Thiagarajan
Venky Soundararajan
author_sort Deeksha Doddahonnaiah
title A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title_short A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title_full A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title_fullStr A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title_full_unstemmed A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
title_sort literature-derived knowledge graph augments the interpretation of single cell rna-seq datasets
publisher MDPI AG
series Genes
issn 2073-4425
publishDate 2021-06-01
description Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney <i>p</i> = 6.15 × 10<sup>−76</sup>, r = 0.24; cohen’s D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.
topic single cell genomics
natural language processing
url https://www.mdpi.com/2073-4425/12/6/898
work_keys_str_mv AT deekshadoddahonnaiah aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT patrickjlenehan aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT traviskhughes aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT davidzemmour aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT enriquegarciarivera aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT ajvenkatakrishnan aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT ramakrishnachilaka aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT apoorvkhare aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT akhilkasaraneni aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT abhinavgarg aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT akashanand aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT rakeshbarve aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT viswanathanthiagarajan aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT venkysoundararajan aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT deekshadoddahonnaiah literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT patrickjlenehan literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT traviskhughes literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT davidzemmour literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT enriquegarciarivera literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT ajvenkatakrishnan literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT ramakrishnachilaka literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT apoorvkhare literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT akhilkasaraneni literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT abhinavgarg literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT akashanand literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT rakeshbarve literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT viswanathanthiagarajan literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
AT venkysoundararajan literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets
_version_ 1721350302244798464