A refined approach for evaluating small datasets via binary classification using machine learning.

Classical statistical analysis of data can be complemented or replaced with data analysis based on machine learning. However, in certain disciplines, such as education research, studies are frequently limited to small datasets, which raises several questions regarding biases and coincidentally posit...

Full description

Bibliographic Details
Published in:PLoS ONE
Main Authors: Steffen Steinert, Verena Ruf, David Dzsotjan, Nicolas Großmann, Albrecht Schmidt, Jochen Kuhn, Stefan Küchemann
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2024-01-01
Online Access:https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0301276&type=printable
_version_ 1850759001520734208
author Steffen Steinert
Verena Ruf
David Dzsotjan
Nicolas Großmann
Albrecht Schmidt
Jochen Kuhn
Stefan Küchemann
author_facet Steffen Steinert
Verena Ruf
David Dzsotjan
Nicolas Großmann
Albrecht Schmidt
Jochen Kuhn
Stefan Küchemann
author_sort Steffen Steinert
collection DOAJ
container_title PLoS ONE
description Classical statistical analysis of data can be complemented or replaced with data analysis based on machine learning. However, in certain disciplines, such as education research, studies are frequently limited to small datasets, which raises several questions regarding biases and coincidentally positive results. In this study, we present a refined approach for evaluating the performance of a binary classification based on machine learning for small datasets. The approach includes a non-parametric permutation test as a method to quantify the probability of the results generalising to new data. Furthermore, we found that a repeated nested cross-validation is almost free of biases and yields reliable results that are only slightly dependent on chance. Considering the advantages of several evaluation metrics, we suggest a combination of more than one metric to train and evaluate machine learning classifiers. In the specific case that both classes are equally important, the Matthews correlation coefficient exhibits the lowest bias and chance for coincidentally good results. The results indicate that it is essential to avoid several biases when analysing small datasets using machine learning.
format Article
id doaj-art-9d8d4aa6455d44e5ad35bb3fcfc1b3ec
institution Directory of Open Access Journals
issn 1932-6203
language English
publishDate 2024-01-01
publisher Public Library of Science (PLoS)
record_format Article
spelling doaj-art-9d8d4aa6455d44e5ad35bb3fcfc1b3ec2025-08-19T22:34:28ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-01195e030127610.1371/journal.pone.0301276A refined approach for evaluating small datasets via binary classification using machine learning.Steffen SteinertVerena RufDavid DzsotjanNicolas GroßmannAlbrecht SchmidtJochen KuhnStefan KüchemannClassical statistical analysis of data can be complemented or replaced with data analysis based on machine learning. However, in certain disciplines, such as education research, studies are frequently limited to small datasets, which raises several questions regarding biases and coincidentally positive results. In this study, we present a refined approach for evaluating the performance of a binary classification based on machine learning for small datasets. The approach includes a non-parametric permutation test as a method to quantify the probability of the results generalising to new data. Furthermore, we found that a repeated nested cross-validation is almost free of biases and yields reliable results that are only slightly dependent on chance. Considering the advantages of several evaluation metrics, we suggest a combination of more than one metric to train and evaluate machine learning classifiers. In the specific case that both classes are equally important, the Matthews correlation coefficient exhibits the lowest bias and chance for coincidentally good results. The results indicate that it is essential to avoid several biases when analysing small datasets using machine learning.https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0301276&type=printable
spellingShingle Steffen Steinert
Verena Ruf
David Dzsotjan
Nicolas Großmann
Albrecht Schmidt
Jochen Kuhn
Stefan Küchemann
A refined approach for evaluating small datasets via binary classification using machine learning.
title A refined approach for evaluating small datasets via binary classification using machine learning.
title_full A refined approach for evaluating small datasets via binary classification using machine learning.
title_fullStr A refined approach for evaluating small datasets via binary classification using machine learning.
title_full_unstemmed A refined approach for evaluating small datasets via binary classification using machine learning.
title_short A refined approach for evaluating small datasets via binary classification using machine learning.
title_sort refined approach for evaluating small datasets via binary classification using machine learning
url https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0301276&type=printable
work_keys_str_mv AT steffensteinert arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT verenaruf arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT daviddzsotjan arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT nicolasgroßmann arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT albrechtschmidt arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT jochenkuhn arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT stefankuchemann arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT steffensteinert refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT verenaruf refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT daviddzsotjan refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT nicolasgroßmann refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT albrechtschmidt refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT jochenkuhn refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning
AT stefankuchemann refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning