Latent Dirichlet Allocation in predicting clinical trial terminations

Abstract Background This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported th...

Full description

Bibliographic Details
Main Authors: Simon Geletta, Lendie Follett, Marcia Laugerman
Format: Article
Language:English
Published: BMC 2019-11-01
Series:BMC Medical Informatics and Decision Making
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12911-019-0973-y
id doaj-531e691fd8184ff9b810a41f4517b25a
record_format Article
spelling doaj-531e691fd8184ff9b810a41f4517b25a2020-11-25T01:08:43ZengBMCBMC Medical Informatics and Decision Making1472-69472019-11-0119111210.1186/s12911-019-0973-yLatent Dirichlet Allocation in predicting clinical trial terminationsSimon Geletta0Lendie Follett1Marcia Laugerman2Department of Public Health, Des Moines UniversityDepartment of Data Analytics, College of Business and Public Administration, Drake UniversityDepartment of Data Analytics, College of Business and Public Administration, Drake UniversityAbstract Background This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least 10 % of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures. Method We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data. Results In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone. Conclusions Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.http://link.springer.com/article/10.1186/s12911-019-0973-yClinical trialsStructured dataUnstructured dataLatent Dirichlet allocationPrediction
collection DOAJ
language English
format Article
sources DOAJ
author Simon Geletta
Lendie Follett
Marcia Laugerman
spellingShingle Simon Geletta
Lendie Follett
Marcia Laugerman
Latent Dirichlet Allocation in predicting clinical trial terminations
BMC Medical Informatics and Decision Making
Clinical trials
Structured data
Unstructured data
Latent Dirichlet allocation
Prediction
author_facet Simon Geletta
Lendie Follett
Marcia Laugerman
author_sort Simon Geletta
title Latent Dirichlet Allocation in predicting clinical trial terminations
title_short Latent Dirichlet Allocation in predicting clinical trial terminations
title_full Latent Dirichlet Allocation in predicting clinical trial terminations
title_fullStr Latent Dirichlet Allocation in predicting clinical trial terminations
title_full_unstemmed Latent Dirichlet Allocation in predicting clinical trial terminations
title_sort latent dirichlet allocation in predicting clinical trial terminations
publisher BMC
series BMC Medical Informatics and Decision Making
issn 1472-6947
publishDate 2019-11-01
description Abstract Background This study used natural language processing (NLP) and machine learning (ML) techniques to identify reliable patterns from within research narrative documents to distinguish studies that complete successfully, from the ones that terminate. Recent research findings have reported that at least 10 % of all studies that are funded by major research funding agencies terminate without yielding useful results. Since it is well-known that scientific studies that receive funding from major funding agencies are carefully planned, and rigorously vetted through the peer-review process, it was somewhat daunting to us that study-terminations are this prevalent. Moreover, our review of the literature about study terminations suggested that the reasons for study terminations are not well understood. We therefore aimed to address that knowledge gap, by seeking to identify the factors that contribute to study failures. Method We used data from the clinicialTrials.gov repository, from which we extracted both structured data (study characteristics), and unstructured data (the narrative description of the studies). We applied natural language processing techniques to the unstructured data to quantify the risk of termination by identifying distinctive topics that are more frequently associated with trials that are terminated and trials that are completed. We used the Latent Dirichlet Allocation (LDA) technique to derive 25 “topics” with corresponding sets of probabilities, which we then used to predict study-termination by utilizing random forest modeling. We fit two distinct models – one using only structured data as predictors and another model with both structured data and the 25 text topics derived from the unstructured data. Results In this paper, we demonstrate the interpretive and predictive value of LDA as it relates to predicting clinical trial failure. The results also demonstrate that the combined modeling approach yields robust predictive probabilities in terms of both sensitivity and specificity, relative to a model that utilizes the structured data alone. Conclusions Our study demonstrated that the use of topic modeling using LDA significantly raises the utility of unstructured data in better predicating the completion vs. termination of studies. This study sets the direction for future research to evaluate the viability of the designs of health studies.
topic Clinical trials
Structured data
Unstructured data
Latent Dirichlet allocation
Prediction
url http://link.springer.com/article/10.1186/s12911-019-0973-y
work_keys_str_mv AT simongeletta latentdirichletallocationinpredictingclinicaltrialterminations
AT lendiefollett latentdirichletallocationinpredictingclinicaltrialterminations
AT marcialaugerman latentdirichletallocationinpredictingclinicaltrialterminations
_version_ 1725181775983935488