A new parallel data geometry analysis algorithm to select training data for support vector machine

Support vector machine (SVM) is one of the most powerful technologies of machine learning, which has been widely concerned because of its remarkable performance. However, when dealing with the classification problem of large-scale datasets, the high complexity of SVM model leads to low efficiency an...

Full description

Bibliographic Details
Main Authors: Yunfeng Shi, Shu Lv, Kaibo Shi
Format: Article
Language:English
Published: AIMS Press 2021-09-01
Series:AIMS Mathematics
Subjects:
Online Access:https://www.aimspress.com/article/doi/10.3934/math.2021806?viewType=HTML
id doaj-215dcf1a2db24681bfe4731e4eb1f342
record_format Article
spelling doaj-215dcf1a2db24681bfe4731e4eb1f3422021-10-11T01:27:32ZengAIMS PressAIMS Mathematics2473-69882021-09-01612139311395310.3934/math.2021806A new parallel data geometry analysis algorithm to select training data for support vector machineYunfeng Shi0Shu Lv 1Kaibo Shi21. School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China1. School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China2. Yangtze Delta Region Institute (Huzhou), University of Electronic Science and Technology of China, Huzhou 313001, China 3. School of Electronic Information and Electrical Engineering Chengdu University, Sichuan Chengdu 610106, ChinaSupport vector machine (SVM) is one of the most powerful technologies of machine learning, which has been widely concerned because of its remarkable performance. However, when dealing with the classification problem of large-scale datasets, the high complexity of SVM model leads to low efficiency and become impractical. Due to the sparsity of SVM in the sample space, this paper presents a new parallel data geometry analysis (PDGA) algorithm to reduce the training set of SVM, which helps to improve the efficiency of SVM training. The PDGA introduce Mahalanobis distance to measure the distance from each sample to its centroid. And based on this, proposes a method that can identify non support vectors and outliers at the same time to help remove redundant data. When the training set is further reduced, cosine angle distance analysis method is proposed to determine whether the samples are redundant data, ensure that the valuable data are not removed. Different from the previous data geometry analysis methods, the PDGA algorithm is implemented in parallel, which greatly saving the computational cost. Experimental results on artificial dataset and 6 real datasets show that the algorithm can adapt to different sample distributions. Which significantly reduce the training time and memory requirements without sacrificing the classification accuracy, and its performance is obviously better than the other five competitive algorithms.https://www.aimspress.com/article/doi/10.3934/math.2021806?viewType=HTMLsupport vector machinesample reductiongeometry analysismahalanobis distanceparallel
collection DOAJ
language English
format Article
sources DOAJ
author Yunfeng Shi
Shu Lv
Kaibo Shi
spellingShingle Yunfeng Shi
Shu Lv
Kaibo Shi
A new parallel data geometry analysis algorithm to select training data for support vector machine
AIMS Mathematics
support vector machine
sample reduction
geometry analysis
mahalanobis distance
parallel
author_facet Yunfeng Shi
Shu Lv
Kaibo Shi
author_sort Yunfeng Shi
title A new parallel data geometry analysis algorithm to select training data for support vector machine
title_short A new parallel data geometry analysis algorithm to select training data for support vector machine
title_full A new parallel data geometry analysis algorithm to select training data for support vector machine
title_fullStr A new parallel data geometry analysis algorithm to select training data for support vector machine
title_full_unstemmed A new parallel data geometry analysis algorithm to select training data for support vector machine
title_sort new parallel data geometry analysis algorithm to select training data for support vector machine
publisher AIMS Press
series AIMS Mathematics
issn 2473-6988
publishDate 2021-09-01
description Support vector machine (SVM) is one of the most powerful technologies of machine learning, which has been widely concerned because of its remarkable performance. However, when dealing with the classification problem of large-scale datasets, the high complexity of SVM model leads to low efficiency and become impractical. Due to the sparsity of SVM in the sample space, this paper presents a new parallel data geometry analysis (PDGA) algorithm to reduce the training set of SVM, which helps to improve the efficiency of SVM training. The PDGA introduce Mahalanobis distance to measure the distance from each sample to its centroid. And based on this, proposes a method that can identify non support vectors and outliers at the same time to help remove redundant data. When the training set is further reduced, cosine angle distance analysis method is proposed to determine whether the samples are redundant data, ensure that the valuable data are not removed. Different from the previous data geometry analysis methods, the PDGA algorithm is implemented in parallel, which greatly saving the computational cost. Experimental results on artificial dataset and 6 real datasets show that the algorithm can adapt to different sample distributions. Which significantly reduce the training time and memory requirements without sacrificing the classification accuracy, and its performance is obviously better than the other five competitive algorithms.
topic support vector machine
sample reduction
geometry analysis
mahalanobis distance
parallel
url https://www.aimspress.com/article/doi/10.3934/math.2021806?viewType=HTML
work_keys_str_mv AT yunfengshi anewparalleldatageometryanalysisalgorithmtoselecttrainingdataforsupportvectormachine
AT shulv anewparalleldatageometryanalysisalgorithmtoselecttrainingdataforsupportvectormachine
AT kaiboshi anewparalleldatageometryanalysisalgorithmtoselecttrainingdataforsupportvectormachine
AT yunfengshi newparalleldatageometryanalysisalgorithmtoselecttrainingdataforsupportvectormachine
AT shulv newparalleldatageometryanalysisalgorithmtoselecttrainingdataforsupportvectormachine
AT kaiboshi newparalleldatageometryanalysisalgorithmtoselecttrainingdataforsupportvectormachine
_version_ 1716829016392990720