Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example

Background: Different methods have been described for data extraction from pathology reports with varying degrees of success. Here a technique for directly extracting data from relational database is described. Methods: Our department uses synoptic reports modified from College of American Pathologi...

Full description

Bibliographic Details
Main Author: Jay J Ye
Format: Article
Language:English
Published: Wolters Kluwer Medknow Publications 2016-01-01
Series:Journal of Pathology Informatics
Subjects:
R
Online Access:http://www.jpathinformatics.org/article.asp?issn=2153-3539;year=2016;volume=7;issue=1;spage=44;epage=44;aulast=Ye
id doaj-18941f7ca1e34d488736419c4d8ac93e
record_format Article
spelling doaj-18941f7ca1e34d488736419c4d8ac93e2020-11-24T23:13:52ZengWolters Kluwer Medknow PublicationsJournal of Pathology Informatics2153-35392153-35392016-01-0171444410.4103/2153-3539.192822Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an exampleJay J YeBackground: Different methods have been described for data extraction from pathology reports with varying degrees of success. Here a technique for directly extracting data from relational database is described. Methods: Our department uses synoptic reports modified from College of American Pathologists (CAP) Cancer Protocol Templates to report most of our cancer diagnoses. Choosing the melanoma of skin synoptic report as an example, R scripting language extended with RODBC package was used to query the pathology information system database. Reports containing melanoma of skin synoptic report in the past 4 and a half years were retrieved and individual data elements were extracted. Using the retrieved list of the cases, the database was queried a second time to retrieve/extract the lymph node staging information in the subsequent reports from the same patients. Results: 426 synoptic reports corresponding to unique lesions of melanoma of skin were retrieved, and data elements of interest were extracted into an R data frame. The distribution of Breslow depth of melanomas grouped by year is used as an example of intra-report data extraction and analysis. When the new pN staging information was present in the subsequent reports, 82% (77/94) was precisely retrieved (pN0, pN1, pN2 and pN3). Additional 15% (14/94) was retrieved with certain ambiguity (positive or knowing there was an update). The specificity was 100% for both. The relationship between Breslow depth and lymph node status was graphed as an example of lesion-specific multi-report data extraction and analysis. Conclusions: R extended with RODBC package is a simple and versatile approach well-suited for the above tasks. The success or failure of the retrieval and extraction depended largely on whether the reports were formatted and whether the contents of the elements were consistently phrased. This approach can be easily modified and adopted for other pathology information systems that use relational database for data management.http://www.jpathinformatics.org/article.asp?issn=2153-3539;year=2016;volume=7;issue=1;spage=44;epage=44;aulast=YePathology report data extractionRSQL database
collection DOAJ
language English
format Article
sources DOAJ
author Jay J Ye
spellingShingle Jay J Ye
Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example
Journal of Pathology Informatics
Pathology report data extraction
R
SQL database
author_facet Jay J Ye
author_sort Jay J Ye
title Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example
title_short Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example
title_full Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example
title_fullStr Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example
title_full_unstemmed Pathology report data extraction from relational database using R, with extraction from reports on melanoma of skin as an example
title_sort pathology report data extraction from relational database using r, with extraction from reports on melanoma of skin as an example
publisher Wolters Kluwer Medknow Publications
series Journal of Pathology Informatics
issn 2153-3539
2153-3539
publishDate 2016-01-01
description Background: Different methods have been described for data extraction from pathology reports with varying degrees of success. Here a technique for directly extracting data from relational database is described. Methods: Our department uses synoptic reports modified from College of American Pathologists (CAP) Cancer Protocol Templates to report most of our cancer diagnoses. Choosing the melanoma of skin synoptic report as an example, R scripting language extended with RODBC package was used to query the pathology information system database. Reports containing melanoma of skin synoptic report in the past 4 and a half years were retrieved and individual data elements were extracted. Using the retrieved list of the cases, the database was queried a second time to retrieve/extract the lymph node staging information in the subsequent reports from the same patients. Results: 426 synoptic reports corresponding to unique lesions of melanoma of skin were retrieved, and data elements of interest were extracted into an R data frame. The distribution of Breslow depth of melanomas grouped by year is used as an example of intra-report data extraction and analysis. When the new pN staging information was present in the subsequent reports, 82% (77/94) was precisely retrieved (pN0, pN1, pN2 and pN3). Additional 15% (14/94) was retrieved with certain ambiguity (positive or knowing there was an update). The specificity was 100% for both. The relationship between Breslow depth and lymph node status was graphed as an example of lesion-specific multi-report data extraction and analysis. Conclusions: R extended with RODBC package is a simple and versatile approach well-suited for the above tasks. The success or failure of the retrieval and extraction depended largely on whether the reports were formatted and whether the contents of the elements were consistently phrased. This approach can be easily modified and adopted for other pathology information systems that use relational database for data management.
topic Pathology report data extraction
R
SQL database
url http://www.jpathinformatics.org/article.asp?issn=2153-3539;year=2016;volume=7;issue=1;spage=44;epage=44;aulast=Ye
work_keys_str_mv AT jayjye pathologyreportdataextractionfromrelationaldatabaseusingrwithextractionfromreportsonmelanomaofskinasanexample
_version_ 1725596324306354176