A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets
Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural...
Main Authors: | , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-06-01
|
Series: | Genes |
Subjects: | |
Online Access: | https://www.mdpi.com/2073-4425/12/6/898 |
id |
doaj-32f7c50f8d8749be9e4c9d58d25d809d |
---|---|
record_format |
Article |
spelling |
doaj-32f7c50f8d8749be9e4c9d58d25d809d2021-06-30T23:49:25ZengMDPI AGGenes2073-44252021-06-011289889810.3390/genes12060898A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq DatasetsDeeksha Doddahonnaiah0Patrick J. Lenehan1Travis K. Hughes2David Zemmour3Enrique Garcia-Rivera4A. J. Venkatakrishnan5Ramakrishna Chilaka6Apoorv Khare7Akhil Kasaraneni8Abhinav Garg9Akash Anand10Rakesh Barve11Viswanathan Thiagarajan12Venky Soundararajan13nference, One Main Street, Cambridge, MA 02142, USAnference, One Main Street, Cambridge, MA 02142, USAnference, One Main Street, Cambridge, MA 02142, USAnference, One Main Street, Cambridge, MA 02142, USAnference, One Main Street, Cambridge, MA 02142, USAnference, One Main Street, Cambridge, MA 02142, USAnference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference Labs, Bengaluru, Karnataka 560017, Indianference, One Main Street, Cambridge, MA 02142, USATechnology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney <i>p</i> = 6.15 × 10<sup>−76</sup>, r = 0.24; cohen’s D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data.https://www.mdpi.com/2073-4425/12/6/898single cell genomicsnatural language processing |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Deeksha Doddahonnaiah Patrick J. Lenehan Travis K. Hughes David Zemmour Enrique Garcia-Rivera A. J. Venkatakrishnan Ramakrishna Chilaka Apoorv Khare Akhil Kasaraneni Abhinav Garg Akash Anand Rakesh Barve Viswanathan Thiagarajan Venky Soundararajan |
spellingShingle |
Deeksha Doddahonnaiah Patrick J. Lenehan Travis K. Hughes David Zemmour Enrique Garcia-Rivera A. J. Venkatakrishnan Ramakrishna Chilaka Apoorv Khare Akhil Kasaraneni Abhinav Garg Akash Anand Rakesh Barve Viswanathan Thiagarajan Venky Soundararajan A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets Genes single cell genomics natural language processing |
author_facet |
Deeksha Doddahonnaiah Patrick J. Lenehan Travis K. Hughes David Zemmour Enrique Garcia-Rivera A. J. Venkatakrishnan Ramakrishna Chilaka Apoorv Khare Akhil Kasaraneni Abhinav Garg Akash Anand Rakesh Barve Viswanathan Thiagarajan Venky Soundararajan |
author_sort |
Deeksha Doddahonnaiah |
title |
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets |
title_short |
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets |
title_full |
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets |
title_fullStr |
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets |
title_full_unstemmed |
A Literature-Derived Knowledge Graph Augments the Interpretation of Single Cell RNA-seq Datasets |
title_sort |
literature-derived knowledge graph augments the interpretation of single cell rna-seq datasets |
publisher |
MDPI AG |
series |
Genes |
issn |
2073-4425 |
publishDate |
2021-06-01 |
description |
Technology to generate single cell RNA-sequencing (scRNA-seq) datasets and tools to annotate them have advanced rapidly in the past several years. Such tools generally rely on existing transcriptomic datasets or curated databases of cell type defining genes, while the application of scalable natural language processing (NLP) methods to enhance analysis workflows has not been adequately explored. Here we deployed an NLP framework to objectively quantify associations between a comprehensive set of over 20,000 human protein-coding genes and over 500 cell type terms across over 26 million biomedical documents. The resultant gene-cell type associations (GCAs) are significantly stronger between a curated set of matched cell type-marker pairs than the complementary set of mismatched pairs (Mann Whitney <i>p</i> = 6.15 × 10<sup>−76</sup>, r = 0.24; cohen’s D = 2.6). Building on this, we developed an augmented annotation algorithm (single cell Annotation via Literature Encoding, or scALE) that leverages GCAs to categorize cell clusters identified in scRNA-seq datasets, and we tested its ability to predict the cellular identity of 133 clusters from nine datasets of human breast, colon, heart, joint, ovary, prostate, skin, and small intestine tissues. With the optimized settings, the true cellular identity matched the top prediction in 59% of tested clusters and was present among the top five predictions for 91% of clusters. scALE slightly outperformed an existing method for reference data driven automated cluster annotation, and we demonstrate that integration of scALE can meaningfully improve the annotations derived from such methods. Further, contextualization of differential expression analyses with these GCAs highlights poorly characterized markers of well-studied cell types, such as CLIC6 and DNASE1L3 in retinal pigment epithelial cells and endothelial cells, respectively. Taken together, this study illustrates for the first time how the systematic application of a literature-derived knowledge graph can expedite and enhance the annotation and interpretation of scRNA-seq data. |
topic |
single cell genomics natural language processing |
url |
https://www.mdpi.com/2073-4425/12/6/898 |
work_keys_str_mv |
AT deekshadoddahonnaiah aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT patrickjlenehan aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT traviskhughes aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT davidzemmour aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT enriquegarciarivera aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT ajvenkatakrishnan aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT ramakrishnachilaka aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT apoorvkhare aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT akhilkasaraneni aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT abhinavgarg aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT akashanand aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT rakeshbarve aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT viswanathanthiagarajan aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT venkysoundararajan aliteraturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT deekshadoddahonnaiah literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT patrickjlenehan literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT traviskhughes literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT davidzemmour literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT enriquegarciarivera literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT ajvenkatakrishnan literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT ramakrishnachilaka literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT apoorvkhare literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT akhilkasaraneni literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT abhinavgarg literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT akashanand literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT rakeshbarve literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT viswanathanthiagarajan literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets AT venkysoundararajan literaturederivedknowledgegraphaugmentstheinterpretationofsinglecellrnaseqdatasets |
_version_ |
1721350302244798464 |