Effects of Label Noise on Deep Learning-Based Skin Cancer Classification

Recent studies have shown that deep learning is capable of classifying dermatoscopic images at least as well as dermatologists. However, many studies in skin cancer classification utilize non-biopsy-verified training images. This imperfect ground truth introduces a systematic error, but the effects...

Full description

Bibliographic Details
Main Authors: Achim Hekler, Jakob N. Kather, Eva Krieghoff-Henning, Jochen S. Utikal, Friedegund Meier, Frank F. Gellrich, Julius Upmeier zu Belzen, Lars French, Justin G. Schlager, Kamran Ghoreschi, Tabea Wilhelm, Heinz Kutzner, Carola Berking, Markus V. Heppt, Sebastian Haferkamp, Wiebke Sondermann, Dirk Schadendorf, Bastian Schilling, Benjamin Izar, Roman Maron, Max Schmitt, Stefan Fröhling, Daniel B. Lipka, Titus J. Brinker
Format: Article
Language:English
Published: Frontiers Media S.A. 2020-05-01
Series:Frontiers in Medicine
Subjects:
Online Access:https://www.frontiersin.org/article/10.3389/fmed.2020.00177/full
id doaj-7c8a3d919b4d4e7da94887fa5381fe8d
record_format Article
collection DOAJ
language English
format Article
sources DOAJ
author Achim Hekler
Jakob N. Kather
Jakob N. Kather
Eva Krieghoff-Henning
Jochen S. Utikal
Jochen S. Utikal
Friedegund Meier
Friedegund Meier
Frank F. Gellrich
Frank F. Gellrich
Julius Upmeier zu Belzen
Lars French
Justin G. Schlager
Kamran Ghoreschi
Tabea Wilhelm
Heinz Kutzner
Carola Berking
Markus V. Heppt
Sebastian Haferkamp
Wiebke Sondermann
Dirk Schadendorf
Bastian Schilling
Benjamin Izar
Roman Maron
Max Schmitt
Stefan Fröhling
Stefan Fröhling
Daniel B. Lipka
Daniel B. Lipka
Daniel B. Lipka
Titus J. Brinker
spellingShingle Achim Hekler
Jakob N. Kather
Jakob N. Kather
Eva Krieghoff-Henning
Jochen S. Utikal
Jochen S. Utikal
Friedegund Meier
Friedegund Meier
Frank F. Gellrich
Frank F. Gellrich
Julius Upmeier zu Belzen
Lars French
Justin G. Schlager
Kamran Ghoreschi
Tabea Wilhelm
Heinz Kutzner
Carola Berking
Markus V. Heppt
Sebastian Haferkamp
Wiebke Sondermann
Dirk Schadendorf
Bastian Schilling
Benjamin Izar
Roman Maron
Max Schmitt
Stefan Fröhling
Stefan Fröhling
Daniel B. Lipka
Daniel B. Lipka
Daniel B. Lipka
Titus J. Brinker
Effects of Label Noise on Deep Learning-Based Skin Cancer Classification
Frontiers in Medicine
dermatology
artificial intelligence
label noise
skin cancer
melanoma
nevi
author_facet Achim Hekler
Jakob N. Kather
Jakob N. Kather
Eva Krieghoff-Henning
Jochen S. Utikal
Jochen S. Utikal
Friedegund Meier
Friedegund Meier
Frank F. Gellrich
Frank F. Gellrich
Julius Upmeier zu Belzen
Lars French
Justin G. Schlager
Kamran Ghoreschi
Tabea Wilhelm
Heinz Kutzner
Carola Berking
Markus V. Heppt
Sebastian Haferkamp
Wiebke Sondermann
Dirk Schadendorf
Bastian Schilling
Benjamin Izar
Roman Maron
Max Schmitt
Stefan Fröhling
Stefan Fröhling
Daniel B. Lipka
Daniel B. Lipka
Daniel B. Lipka
Titus J. Brinker
author_sort Achim Hekler
title Effects of Label Noise on Deep Learning-Based Skin Cancer Classification
title_short Effects of Label Noise on Deep Learning-Based Skin Cancer Classification
title_full Effects of Label Noise on Deep Learning-Based Skin Cancer Classification
title_fullStr Effects of Label Noise on Deep Learning-Based Skin Cancer Classification
title_full_unstemmed Effects of Label Noise on Deep Learning-Based Skin Cancer Classification
title_sort effects of label noise on deep learning-based skin cancer classification
publisher Frontiers Media S.A.
series Frontiers in Medicine
issn 2296-858X
publishDate 2020-05-01
description Recent studies have shown that deep learning is capable of classifying dermatoscopic images at least as well as dermatologists. However, many studies in skin cancer classification utilize non-biopsy-verified training images. This imperfect ground truth introduces a systematic error, but the effects on classifier performance are currently unknown. Here, we systematically examine the effects of label noise by training and evaluating convolutional neural networks (CNN) with 804 images of melanoma and nevi labeled either by dermatologists or by biopsy. The CNNs are evaluated on a test set of 384 images by means of 4-fold cross validation comparing the outputs with either the corresponding dermatological or the biopsy-verified diagnosis. With identical ground truths of training and test labels, high accuracies with 75.03% (95% CI: 74.39–75.66%) for dermatological and 73.80% (95% CI: 73.10–74.51%) for biopsy-verified labels can be achieved. However, if the CNN is trained and tested with different ground truths, accuracy drops significantly to 64.53% (95% CI: 63.12–65.94%, p < 0.01) on a non-biopsy-verified and to 64.24% (95% CI: 62.66–65.83%, p < 0.01) on a biopsy-verified test set. In conclusion, deep learning methods for skin cancer classification are highly sensitive to label noise and future work should use biopsy-verified training images to mitigate this problem.
topic dermatology
artificial intelligence
label noise
skin cancer
melanoma
nevi
url https://www.frontiersin.org/article/10.3389/fmed.2020.00177/full
work_keys_str_mv AT achimhekler effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT jakobnkather effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT jakobnkather effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT evakrieghoffhenning effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT jochensutikal effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT jochensutikal effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT friedegundmeier effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT friedegundmeier effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT frankfgellrich effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT frankfgellrich effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT juliusupmeierzubelzen effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT larsfrench effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT justingschlager effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT kamranghoreschi effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT tabeawilhelm effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT heinzkutzner effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT carolaberking effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT markusvheppt effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT sebastianhaferkamp effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT wiebkesondermann effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT dirkschadendorf effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT bastianschilling effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT benjaminizar effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT romanmaron effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT maxschmitt effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT stefanfrohling effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT stefanfrohling effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT danielblipka effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT danielblipka effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT danielblipka effectsoflabelnoiseondeeplearningbasedskincancerclassification
AT titusjbrinker effectsoflabelnoiseondeeplearningbasedskincancerclassification
_version_ 1724931717985206272
spelling doaj-7c8a3d919b4d4e7da94887fa5381fe8d2020-11-25T02:07:00ZengFrontiers Media S.A.Frontiers in Medicine2296-858X2020-05-01710.3389/fmed.2020.00177536659Effects of Label Noise on Deep Learning-Based Skin Cancer ClassificationAchim Hekler0Jakob N. Kather1Jakob N. Kather2Eva Krieghoff-Henning3Jochen S. Utikal4Jochen S. Utikal5Friedegund Meier6Friedegund Meier7Frank F. Gellrich8Frank F. Gellrich9Julius Upmeier zu Belzen10Lars French11Justin G. Schlager12Kamran Ghoreschi13Tabea Wilhelm14Heinz Kutzner15Carola Berking16Markus V. Heppt17Sebastian Haferkamp18Wiebke Sondermann19Dirk Schadendorf20Bastian Schilling21Benjamin Izar22Roman Maron23Max Schmitt24Stefan Fröhling25Stefan Fröhling26Daniel B. Lipka27Daniel B. Lipka28Daniel B. Lipka29Titus J. Brinker30National Center for Tumor Diseases, German Cancer Research Center, Heidelberg, GermanyNational Center for Tumor Diseases, German Cancer Research Center, Heidelberg, GermanyDepartment of Medicine III, RWTH University Hospital Aachen, Aachen, GermanyNational Center for Tumor Diseases, German Cancer Research Center, Heidelberg, GermanyDepartment of Dermatology, Heidelberg University, Mannheim, GermanySkin Cancer Unit, German Cancer Research Center, Heidelberg, GermanySkin Cancer Center at the University Cancer Centre and National Center for Tumor Diseases Dresden, Dresden, GermanyDepartment of Dermatology, University Hospital Carl Gustav Carus, Technische Universität Dresden, Dresden, GermanySkin Cancer Center at the University Cancer Centre and National Center for Tumor Diseases Dresden, Dresden, GermanyDepartment of Dermatology, University Hospital Carl Gustav Carus, Technische Universität Dresden, Dresden, GermanyBerlin Institute of Health (BIH), Charité, Berlin, GermanyDepartment of Dermatology and Allergology, Ludwig Maximilian University of Munich, Munich, GermanyDepartment of Dermatology and Allergology, Ludwig Maximilian University of Munich, Munich, GermanyDepartment of Dermatology, Venereology and Allergology, Charité–Universitätsmedizin Berlin, Berlin, GermanyDepartment of Dermatology, Venereology and Allergology, Charité–Universitätsmedizin Berlin, Berlin, Germany0Dermatopathology Laboratory, Friedrichshafen, Germany1Department of Dermatology, University Hospital Erlangen, Erlangen, Germany1Department of Dermatology, University Hospital Erlangen, Erlangen, Germany2Department of Dermatology, University Hospital Regensburg, Regensburg, Germany3Department of Dermatology, University Hospital Essen, Essen, Germany3Department of Dermatology, University Hospital Essen, Essen, Germany4Department of Dermatology, University Hospital Würzburg, Würzburg, Germany5Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, United StatesNational Center for Tumor Diseases, German Cancer Research Center, Heidelberg, GermanyNational Center for Tumor Diseases, German Cancer Research Center, Heidelberg, GermanyNational Center for Tumor Diseases, German Cancer Research Center, Heidelberg, Germany6Translational Cancer Epigenomics, Division of Translational Medical Oncology, German Cancer Research Center (DKFZ), Heidelberg, GermanyNational Center for Tumor Diseases, German Cancer Research Center, Heidelberg, Germany6Translational Cancer Epigenomics, Division of Translational Medical Oncology, German Cancer Research Center (DKFZ), Heidelberg, Germany7Faculty of Medicine, Medical Center, Otto-von-Guericke-University, Magdeburg, GermanyNational Center for Tumor Diseases, German Cancer Research Center, Heidelberg, GermanyRecent studies have shown that deep learning is capable of classifying dermatoscopic images at least as well as dermatologists. However, many studies in skin cancer classification utilize non-biopsy-verified training images. This imperfect ground truth introduces a systematic error, but the effects on classifier performance are currently unknown. Here, we systematically examine the effects of label noise by training and evaluating convolutional neural networks (CNN) with 804 images of melanoma and nevi labeled either by dermatologists or by biopsy. The CNNs are evaluated on a test set of 384 images by means of 4-fold cross validation comparing the outputs with either the corresponding dermatological or the biopsy-verified diagnosis. With identical ground truths of training and test labels, high accuracies with 75.03% (95% CI: 74.39–75.66%) for dermatological and 73.80% (95% CI: 73.10–74.51%) for biopsy-verified labels can be achieved. However, if the CNN is trained and tested with different ground truths, accuracy drops significantly to 64.53% (95% CI: 63.12–65.94%, p < 0.01) on a non-biopsy-verified and to 64.24% (95% CI: 62.66–65.83%, p < 0.01) on a biopsy-verified test set. In conclusion, deep learning methods for skin cancer classification are highly sensitive to label noise and future work should use biopsy-verified training images to mitigate this problem.https://www.frontiersin.org/article/10.3389/fmed.2020.00177/fulldermatologyartificial intelligencelabel noiseskin cancermelanomanevi