Comparing distance measures on assessed medical device incident data using Average Silhouette Width
Many machine learning algorithms depend on the choice of an appropriate similarity or distance measure. Comparing such measures in different domains and on diversely structured data is common, but often performed in regards of an algorithm to cluster or classify the data. In this study, data assesse...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
De Gruyter
2018-09-01
|
Series: | Current Directions in Biomedical Engineering |
Subjects: | |
Online Access: | https://doi.org/10.1515/cdbme-2018-0126 |
id |
doaj-844bca90ff1849b1be830b13308210de |
---|---|
record_format |
Article |
spelling |
doaj-844bca90ff1849b1be830b13308210de2021-09-06T19:19:26ZengDe GruyterCurrent Directions in Biomedical Engineering2364-55042018-09-014152552810.1515/cdbme-2018-0126cdbme-2018-0126Comparing distance measures on assessed medical device incident data using Average Silhouette WidthBayer Christian0Seidel RobinInstitute for Drugs and Medical Devices,Bonn, GermanyMany machine learning algorithms depend on the choice of an appropriate similarity or distance measure. Comparing such measures in different domains and on diversely structured data is common, but often performed in regards of an algorithm to cluster or classify the data. In this study, data assessed by experts is analyzed instead. The data is taken from the database of the Federal Institute for Drugs and Medical Devices (BfArM) and represents free text incident reports. The Average Silhouette Width, a cluster density measure, is used to compare the distance measures’ ability to discriminate the data according to the experts’ assessments. The Euclidean distance and four distance measures derived from the Jaccard similarity, the Simple Matching similarity, the Cosine similarity and the Yule similarity are compared on four subsets of this database. The results show, that a better data preprocessing is necessary, possibly due to boilerplate texts being used to write incident reports. These results will also provide the basis to compare improvements by different methods of data preprocessing in the future.https://doi.org/10.1515/cdbme-2018-0126average silhouette widthmachine learningdistance measuresregulatory affairstext categorization |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Bayer Christian Seidel Robin |
spellingShingle |
Bayer Christian Seidel Robin Comparing distance measures on assessed medical device incident data using Average Silhouette Width Current Directions in Biomedical Engineering average silhouette width machine learning distance measures regulatory affairs text categorization |
author_facet |
Bayer Christian Seidel Robin |
author_sort |
Bayer Christian |
title |
Comparing distance measures on assessed medical device incident data using Average Silhouette Width |
title_short |
Comparing distance measures on assessed medical device incident data using Average Silhouette Width |
title_full |
Comparing distance measures on assessed medical device incident data using Average Silhouette Width |
title_fullStr |
Comparing distance measures on assessed medical device incident data using Average Silhouette Width |
title_full_unstemmed |
Comparing distance measures on assessed medical device incident data using Average Silhouette Width |
title_sort |
comparing distance measures on assessed medical device incident data using average silhouette width |
publisher |
De Gruyter |
series |
Current Directions in Biomedical Engineering |
issn |
2364-5504 |
publishDate |
2018-09-01 |
description |
Many machine learning algorithms depend on the choice of an appropriate similarity or distance measure. Comparing such measures in different domains and on diversely structured data is common, but often performed in regards of an algorithm to cluster or classify the data. In this study, data assessed by experts is analyzed instead. The data is taken from the database of the Federal Institute for Drugs and Medical Devices (BfArM) and represents free text incident reports. The Average Silhouette Width, a cluster density measure, is used to compare the distance measures’ ability to discriminate the data according to the experts’ assessments. The Euclidean distance and four distance measures derived from the Jaccard similarity, the Simple Matching similarity, the Cosine similarity and the Yule similarity are compared on four subsets of this database. The results show, that a better data preprocessing is necessary, possibly due to boilerplate texts being used to write incident reports. These results will also provide the basis to compare improvements by different methods of data preprocessing in the future. |
topic |
average silhouette width machine learning distance measures regulatory affairs text categorization |
url |
https://doi.org/10.1515/cdbme-2018-0126 |
work_keys_str_mv |
AT bayerchristian comparingdistancemeasuresonassessedmedicaldeviceincidentdatausingaveragesilhouettewidth AT seidelrobin comparingdistancemeasuresonassessedmedicaldeviceincidentdatausingaveragesilhouettewidth |
_version_ |
1717778615594123264 |