Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset

Testosterone is the most important male sex hormone, and its deficiency brings many physical and mental harms. Efficiently identifying individuals with low testosterone is crucial prior to starting proper treatment. However, routine monitoring of testosterone levels can be costly in many regions, re...

Full description

Bibliographic Details
Main Authors: Monique Tonani Novaes, Osmar Luiz Ferreira de Carvalho, Pedro Henrique Guimarães Ferreira, Taciana Leonel Nunes Tiraboschi, Caroline Santos Silva, Jean Carlos Zambrano, Cristiano Mendes Gomes, Eduardo de Paula Miranda, Osmar Abílio de Carvalho Júnior, José de Bessa Júnior
Format: Article
Language:English
Published: Elsevier 2021-01-01
Series:Informatics in Medicine Unlocked
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352914821000289
id doaj-7ca6f7985c494a439acfafd410365d4a
record_format Article
collection DOAJ
language English
format Article
sources DOAJ
author Monique Tonani Novaes
Osmar Luiz Ferreira de Carvalho
Pedro Henrique Guimarães Ferreira
Taciana Leonel Nunes Tiraboschi
Caroline Santos Silva
Jean Carlos Zambrano
Cristiano Mendes Gomes
Eduardo de Paula Miranda
Osmar Abílio de Carvalho Júnior
José de Bessa Júnior
spellingShingle Monique Tonani Novaes
Osmar Luiz Ferreira de Carvalho
Pedro Henrique Guimarães Ferreira
Taciana Leonel Nunes Tiraboschi
Caroline Santos Silva
Jean Carlos Zambrano
Cristiano Mendes Gomes
Eduardo de Paula Miranda
Osmar Abílio de Carvalho Júnior
José de Bessa Júnior
Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset
Informatics in Medicine Unlocked
Machine learning
Imbalanced data
Testosterone deficiency
Ensemble classifier
author_facet Monique Tonani Novaes
Osmar Luiz Ferreira de Carvalho
Pedro Henrique Guimarães Ferreira
Taciana Leonel Nunes Tiraboschi
Caroline Santos Silva
Jean Carlos Zambrano
Cristiano Mendes Gomes
Eduardo de Paula Miranda
Osmar Abílio de Carvalho Júnior
José de Bessa Júnior
author_sort Monique Tonani Novaes
title Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset
title_short Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset
title_full Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset
title_fullStr Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset
title_full_unstemmed Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset
title_sort prediction of secondary testosterone deficiency using machine learning: a comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced dataset
publisher Elsevier
series Informatics in Medicine Unlocked
issn 2352-9148
publishDate 2021-01-01
description Testosterone is the most important male sex hormone, and its deficiency brings many physical and mental harms. Efficiently identifying individuals with low testosterone is crucial prior to starting proper treatment. However, routine monitoring of testosterone levels can be costly in many regions, resulting in an underreporting of cases, especially in developing countries. Moreover, there are few studies that employ machine learning (ML) in prognosticating testosterone deficiency. This research, therefore, aims to offer a coherent comparative analysis of machine learning methods that can predict testosterone deficiency without having patients undergo costly medical tests. In doing so, we seek to provide to the urological community a publicly available dataset (https://github.com/osmarluiz/Testosterone-Deficiency-Dataset) to increase research in this yet untapped field. For this analysis, we used ten base classifiers (optimized with grid search stratified K-fold cross-validation); three ensemble methods; and eight sampling strategies to analyze a total of 3397 patients. The analysis was based on six features (age; abdominal circumference; triglycerides; high-density lipoprotein; diabetes; and hypertension), all of which were obtained by low-cost exams. We compared the sampling strategies and the classifiers' performance on an independent test set using ranking (PR-AUC), probabilistic (Brier score), and threshold metrics. We found that: (1) within the ranking metrics, sampling strategies did not enhance results in this slightly imbalanced (4:1 ratio) dataset; (2) the ensemble classifier using weighted average presented the best performance; (3) the best base classifier was XGBoost; (4) calibration showed significant improvement for the sampling strategies and slight improvements for the no sampling strategy; (5) the McNemar's test presented statistically similar results among all classifiers; and (6) abdominal circumference (AC) had by far the highest feature importance, followed by triglycerides (TG). Age showed very little significance in predicting testosterone deficiency.
topic Machine learning
Imbalanced data
Testosterone deficiency
Ensemble classifier
url http://www.sciencedirect.com/science/article/pii/S2352914821000289
work_keys_str_mv AT moniquetonaninovaes predictionofsecondarytestosteronedeficiencyusingmachinelearningacomparativeanalysisofensembleandbaseclassifiersprobabilitycalibrationandsamplingstrategiesinaslightlyimbalanceddataset
AT osmarluizferreiradecarvalho predictionofsecondarytestosteronedeficiencyusingmachinelearningacomparativeanalysisofensembleandbaseclassifiersprobabilitycalibrationandsamplingstrategiesinaslightlyimbalanceddataset
AT pedrohenriqueguimaraesferreira predictionofsecondarytestosteronedeficiencyusingmachinelearningacomparativeanalysisofensembleandbaseclassifiersprobabilitycalibrationandsamplingstrategiesinaslightlyimbalanceddataset
AT tacianaleonelnunestiraboschi predictionofsecondarytestosteronedeficiencyusingmachinelearningacomparativeanalysisofensembleandbaseclassifiersprobabilitycalibrationandsamplingstrategiesinaslightlyimbalanceddataset
AT carolinesantossilva predictionofsecondarytestosteronedeficiencyusingmachinelearningacomparativeanalysisofensembleandbaseclassifiersprobabilitycalibrationandsamplingstrategiesinaslightlyimbalanceddataset
AT jeancarloszambrano predictionofsecondarytestosteronedeficiencyusingmachinelearningacomparativeanalysisofensembleandbaseclassifiersprobabilitycalibrationandsamplingstrategiesinaslightlyimbalanceddataset
AT cristianomendesgomes predictionofsecondarytestosteronedeficiencyusingmachinelearningacomparativeanalysisofensembleandbaseclassifiersprobabilitycalibrationandsamplingstrategiesinaslightlyimbalanceddataset
AT eduardodepaulamiranda predictionofsecondarytestosteronedeficiencyusingmachinelearningacomparativeanalysisofensembleandbaseclassifiersprobabilitycalibrationandsamplingstrategiesinaslightlyimbalanceddataset
AT osmarabiliodecarvalhojunior predictionofsecondarytestosteronedeficiencyusingmachinelearningacomparativeanalysisofensembleandbaseclassifiersprobabilitycalibrationandsamplingstrategiesinaslightlyimbalanceddataset
AT josedebessajunior predictionofsecondarytestosteronedeficiencyusingmachinelearningacomparativeanalysisofensembleandbaseclassifiersprobabilitycalibrationandsamplingstrategiesinaslightlyimbalanceddataset
_version_ 1721523518563155968
spelling doaj-7ca6f7985c494a439acfafd410365d4a2021-04-18T06:28:05ZengElsevierInformatics in Medicine Unlocked2352-91482021-01-0123100538Prediction of secondary testosterone deficiency using machine learning: A comparative analysis of ensemble and base classifiers, probability calibration, and sampling strategies in a slightly imbalanced datasetMonique Tonani Novaes0Osmar Luiz Ferreira de Carvalho1Pedro Henrique Guimarães Ferreira2Taciana Leonel Nunes Tiraboschi3Caroline Santos Silva4Jean Carlos Zambrano5Cristiano Mendes Gomes6Eduardo de Paula Miranda7Osmar Abílio de Carvalho Júnior8José de Bessa Júnior9Department of Public Health and Epidemiolgy, Universidade Estadual de Feira de Santana, Avenida Transnordestina, S/n - Novo Horizonte, 44036-900, Feira de Santana, Bahia, BrazilDepartament of Electrical Engineering, University of Brasília, University Campus Darcy Ribeiro, Asa Norte, University of Brasília, DF, 70910-900 Brasília, BrazilDepartament of Electrical Engineering, University of Brasília, University Campus Darcy Ribeiro, Asa Norte, University of Brasília, DF, 70910-900 Brasília, BrazilDepartment of Public Health and Epidemiolgy, Universidade Estadual de Feira de Santana, Avenida Transnordestina, S/n - Novo Horizonte, 44036-900, Feira de Santana, Bahia, BrazilDepartment of Public Health and Epidemiolgy, Universidade Estadual de Feira de Santana, Avenida Transnordestina, S/n - Novo Horizonte, 44036-900, Feira de Santana, Bahia, BrazilDepartment of Public Health and Epidemiolgy, Universidade Estadual de Feira de Santana, Avenida Transnordestina, S/n - Novo Horizonte, 44036-900, Feira de Santana, Bahia, BrazilDivision of Urolgy, Universidade de São Paulo, São Paulo, São Paulo, BrazilDivision of Urolgy, Universidade Federal Do Ceara, Fortaleza, Ceara, BrazilDepartament of Geografia, University of Brasília, University Campus Darcy Ribeiro, Asa Norte, University of Brasília, DF, 70910-900 Brasília, Brazil; Corresponding author.Division of Urology, Universidade Estadual de Feira de Santana, Feira de Santana, Bahia, Brazil; Corresponding author.Testosterone is the most important male sex hormone, and its deficiency brings many physical and mental harms. Efficiently identifying individuals with low testosterone is crucial prior to starting proper treatment. However, routine monitoring of testosterone levels can be costly in many regions, resulting in an underreporting of cases, especially in developing countries. Moreover, there are few studies that employ machine learning (ML) in prognosticating testosterone deficiency. This research, therefore, aims to offer a coherent comparative analysis of machine learning methods that can predict testosterone deficiency without having patients undergo costly medical tests. In doing so, we seek to provide to the urological community a publicly available dataset (https://github.com/osmarluiz/Testosterone-Deficiency-Dataset) to increase research in this yet untapped field. For this analysis, we used ten base classifiers (optimized with grid search stratified K-fold cross-validation); three ensemble methods; and eight sampling strategies to analyze a total of 3397 patients. The analysis was based on six features (age; abdominal circumference; triglycerides; high-density lipoprotein; diabetes; and hypertension), all of which were obtained by low-cost exams. We compared the sampling strategies and the classifiers' performance on an independent test set using ranking (PR-AUC), probabilistic (Brier score), and threshold metrics. We found that: (1) within the ranking metrics, sampling strategies did not enhance results in this slightly imbalanced (4:1 ratio) dataset; (2) the ensemble classifier using weighted average presented the best performance; (3) the best base classifier was XGBoost; (4) calibration showed significant improvement for the sampling strategies and slight improvements for the no sampling strategy; (5) the McNemar's test presented statistically similar results among all classifiers; and (6) abdominal circumference (AC) had by far the highest feature importance, followed by triglycerides (TG). Age showed very little significance in predicting testosterone deficiency.http://www.sciencedirect.com/science/article/pii/S2352914821000289Machine learningImbalanced dataTestosterone deficiencyEnsemble classifier