A refined approach for evaluating small datasets via binary classification using machine learning.
Classical statistical analysis of data can be complemented or replaced with data analysis based on machine learning. However, in certain disciplines, such as education research, studies are frequently limited to small datasets, which raises several questions regarding biases and coincidentally posit...
| Published in: | PLoS ONE |
|---|---|
| Main Authors: | , , , , , , |
| Format: | Article |
| Language: | English |
| Published: |
Public Library of Science (PLoS)
2024-01-01
|
| Online Access: | https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0301276&type=printable |
| _version_ | 1850759001520734208 |
|---|---|
| author | Steffen Steinert Verena Ruf David Dzsotjan Nicolas Großmann Albrecht Schmidt Jochen Kuhn Stefan Küchemann |
| author_facet | Steffen Steinert Verena Ruf David Dzsotjan Nicolas Großmann Albrecht Schmidt Jochen Kuhn Stefan Küchemann |
| author_sort | Steffen Steinert |
| collection | DOAJ |
| container_title | PLoS ONE |
| description | Classical statistical analysis of data can be complemented or replaced with data analysis based on machine learning. However, in certain disciplines, such as education research, studies are frequently limited to small datasets, which raises several questions regarding biases and coincidentally positive results. In this study, we present a refined approach for evaluating the performance of a binary classification based on machine learning for small datasets. The approach includes a non-parametric permutation test as a method to quantify the probability of the results generalising to new data. Furthermore, we found that a repeated nested cross-validation is almost free of biases and yields reliable results that are only slightly dependent on chance. Considering the advantages of several evaluation metrics, we suggest a combination of more than one metric to train and evaluate machine learning classifiers. In the specific case that both classes are equally important, the Matthews correlation coefficient exhibits the lowest bias and chance for coincidentally good results. The results indicate that it is essential to avoid several biases when analysing small datasets using machine learning. |
| format | Article |
| id | doaj-art-9d8d4aa6455d44e5ad35bb3fcfc1b3ec |
| institution | Directory of Open Access Journals |
| issn | 1932-6203 |
| language | English |
| publishDate | 2024-01-01 |
| publisher | Public Library of Science (PLoS) |
| record_format | Article |
| spelling | doaj-art-9d8d4aa6455d44e5ad35bb3fcfc1b3ec2025-08-19T22:34:28ZengPublic Library of Science (PLoS)PLoS ONE1932-62032024-01-01195e030127610.1371/journal.pone.0301276A refined approach for evaluating small datasets via binary classification using machine learning.Steffen SteinertVerena RufDavid DzsotjanNicolas GroßmannAlbrecht SchmidtJochen KuhnStefan KüchemannClassical statistical analysis of data can be complemented or replaced with data analysis based on machine learning. However, in certain disciplines, such as education research, studies are frequently limited to small datasets, which raises several questions regarding biases and coincidentally positive results. In this study, we present a refined approach for evaluating the performance of a binary classification based on machine learning for small datasets. The approach includes a non-parametric permutation test as a method to quantify the probability of the results generalising to new data. Furthermore, we found that a repeated nested cross-validation is almost free of biases and yields reliable results that are only slightly dependent on chance. Considering the advantages of several evaluation metrics, we suggest a combination of more than one metric to train and evaluate machine learning classifiers. In the specific case that both classes are equally important, the Matthews correlation coefficient exhibits the lowest bias and chance for coincidentally good results. The results indicate that it is essential to avoid several biases when analysing small datasets using machine learning.https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0301276&type=printable |
| spellingShingle | Steffen Steinert Verena Ruf David Dzsotjan Nicolas Großmann Albrecht Schmidt Jochen Kuhn Stefan Küchemann A refined approach for evaluating small datasets via binary classification using machine learning. |
| title | A refined approach for evaluating small datasets via binary classification using machine learning. |
| title_full | A refined approach for evaluating small datasets via binary classification using machine learning. |
| title_fullStr | A refined approach for evaluating small datasets via binary classification using machine learning. |
| title_full_unstemmed | A refined approach for evaluating small datasets via binary classification using machine learning. |
| title_short | A refined approach for evaluating small datasets via binary classification using machine learning. |
| title_sort | refined approach for evaluating small datasets via binary classification using machine learning |
| url | https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0301276&type=printable |
| work_keys_str_mv | AT steffensteinert arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT verenaruf arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT daviddzsotjan arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT nicolasgroßmann arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT albrechtschmidt arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT jochenkuhn arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT stefankuchemann arefinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT steffensteinert refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT verenaruf refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT daviddzsotjan refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT nicolasgroßmann refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT albrechtschmidt refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT jochenkuhn refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning AT stefankuchemann refinedapproachforevaluatingsmalldatasetsviabinaryclassificationusingmachinelearning |
