Discrimination of <i>Gentiana</i> and Its Related Species Using IR Spectroscopy Combined with Feature Selection and Stacked Generalization

<i>Gentiana</i>, which is one of the largest genera of Gentianoideae, most of which had potential pharmaceutical value, and applied to local traditional medical treatment. Because of the phytochemical diversity and difference of bioactive compounds among species, which makes it crucial t...

Full description

Bibliographic Details
Main Authors: Tao Shen, Hong Yu, Yuan-Zhong Wang
Format: Article
Language:English
Published: MDPI AG 2020-03-01
Series:Molecules
Subjects:
nir
Online Access:https://www.mdpi.com/1420-3049/25/6/1442
id doaj-2832225839554379aa1eb7352f437101
record_format Article
spelling doaj-2832225839554379aa1eb7352f4371012020-11-25T03:50:59ZengMDPI AGMolecules1420-30492020-03-01256144210.3390/molecules25061442molecules25061442Discrimination of <i>Gentiana</i> and Its Related Species Using IR Spectroscopy Combined with Feature Selection and Stacked GeneralizationTao Shen0Hong Yu1Yuan-Zhong Wang2Yunnan Herbal Laboratory, Institute of Herb Biotic Resources, School of Life and Sciences, Yunnan University, Kunming 650091, ChinaYunnan Herbal Laboratory, Institute of Herb Biotic Resources, School of Life and Sciences, Yunnan University, Kunming 650091, ChinaMedicinal Plants Research Institute, Yunnan Academy of Agricultural Sciences, Kunming 650200, China<i>Gentiana</i>, which is one of the largest genera of Gentianoideae, most of which had potential pharmaceutical value, and applied to local traditional medical treatment. Because of the phytochemical diversity and difference of bioactive compounds among species, which makes it crucial to accurately identify authentic <i>Gentiana</i> species. In this paper, the feasibility of using the infrared spectroscopy technique combined with chemometrics analysis to identify <i>Gentiana</i> and its related species was studied. A total of 180 batches of raw spectral fingerprints were obtained from 18 species of <i>Gentiana</i> and <i>Tripterospermum</i> by near-infrared (NIR: 10,000&#8722;4000 cm<sup>&#8722;1</sup>) and Fourier transform mid-infrared (MIR: 4000&#8722;600 cm<sup>&#8722;1</sup>) spectrum. Firstly, principal component analysis (PCA) was utilized to explore the natural grouping of the 180 samples. Secondly, random forests (RF), support vector machine (SVM), and K-nearest neighbors (KNN) models were built while using full spectra (including 1487 NIR variables and 1214 FT-MIR variables, respectively). The MIR-SVM model had a higher classification accuracy rate than the other models that were based on the results of the calibration sets and prediction sets. The five feature selection strategies, VIP (variable importance in the projection), Boruta, GARF (genetic algorithm combined with random forest), GASVM (genetic algorithm combined with support vector machine), and Venn diagram calculation, were used to reduce the dimensions of the data variable in order to further reduce numbers of variables for modeling. Finally, 101 NIR and 73 FT-MIR bands were selected as the feature variables, respectively. Thirdly, stacking models were built based on the optimal spectral dataset. Most of the stacking models performed better than the full spectra-based models. RF and SVM (as base learners), combined with the SVM meta-classifier, was the optimal stacked generalization strategy. For the SG-Ven-MIR-SVM model, the accuracy (ACC) of the calibration set and validation set were both 100%. Sensitivity (SE), specificity (SP), efficiency (EFF), Matthews correlation coefficient (MCC), and Cohen&#8217;s kappa coefficient (K) were all 1, which showed that the model had the optimal authenticity identification performance. Those parameters indicated that stacked generalization combined with feature selection is probably an important technique for improving the classification model predictive accuracy and avoid overfitting. The study result can provide a valuable reference for the safety and effectiveness of the clinical application of medicinal <i>Gentiana</i>.https://www.mdpi.com/1420-3049/25/6/1442nirft-mirspecies identification<i>gentiana</i>chemometricsfeature selectionstacked generalization
collection DOAJ
language English
format Article
sources DOAJ
author Tao Shen
Hong Yu
Yuan-Zhong Wang
spellingShingle Tao Shen
Hong Yu
Yuan-Zhong Wang
Discrimination of <i>Gentiana</i> and Its Related Species Using IR Spectroscopy Combined with Feature Selection and Stacked Generalization
Molecules
nir
ft-mir
species identification
<i>gentiana</i>
chemometrics
feature selection
stacked generalization
author_facet Tao Shen
Hong Yu
Yuan-Zhong Wang
author_sort Tao Shen
title Discrimination of <i>Gentiana</i> and Its Related Species Using IR Spectroscopy Combined with Feature Selection and Stacked Generalization
title_short Discrimination of <i>Gentiana</i> and Its Related Species Using IR Spectroscopy Combined with Feature Selection and Stacked Generalization
title_full Discrimination of <i>Gentiana</i> and Its Related Species Using IR Spectroscopy Combined with Feature Selection and Stacked Generalization
title_fullStr Discrimination of <i>Gentiana</i> and Its Related Species Using IR Spectroscopy Combined with Feature Selection and Stacked Generalization
title_full_unstemmed Discrimination of <i>Gentiana</i> and Its Related Species Using IR Spectroscopy Combined with Feature Selection and Stacked Generalization
title_sort discrimination of <i>gentiana</i> and its related species using ir spectroscopy combined with feature selection and stacked generalization
publisher MDPI AG
series Molecules
issn 1420-3049
publishDate 2020-03-01
description <i>Gentiana</i>, which is one of the largest genera of Gentianoideae, most of which had potential pharmaceutical value, and applied to local traditional medical treatment. Because of the phytochemical diversity and difference of bioactive compounds among species, which makes it crucial to accurately identify authentic <i>Gentiana</i> species. In this paper, the feasibility of using the infrared spectroscopy technique combined with chemometrics analysis to identify <i>Gentiana</i> and its related species was studied. A total of 180 batches of raw spectral fingerprints were obtained from 18 species of <i>Gentiana</i> and <i>Tripterospermum</i> by near-infrared (NIR: 10,000&#8722;4000 cm<sup>&#8722;1</sup>) and Fourier transform mid-infrared (MIR: 4000&#8722;600 cm<sup>&#8722;1</sup>) spectrum. Firstly, principal component analysis (PCA) was utilized to explore the natural grouping of the 180 samples. Secondly, random forests (RF), support vector machine (SVM), and K-nearest neighbors (KNN) models were built while using full spectra (including 1487 NIR variables and 1214 FT-MIR variables, respectively). The MIR-SVM model had a higher classification accuracy rate than the other models that were based on the results of the calibration sets and prediction sets. The five feature selection strategies, VIP (variable importance in the projection), Boruta, GARF (genetic algorithm combined with random forest), GASVM (genetic algorithm combined with support vector machine), and Venn diagram calculation, were used to reduce the dimensions of the data variable in order to further reduce numbers of variables for modeling. Finally, 101 NIR and 73 FT-MIR bands were selected as the feature variables, respectively. Thirdly, stacking models were built based on the optimal spectral dataset. Most of the stacking models performed better than the full spectra-based models. RF and SVM (as base learners), combined with the SVM meta-classifier, was the optimal stacked generalization strategy. For the SG-Ven-MIR-SVM model, the accuracy (ACC) of the calibration set and validation set were both 100%. Sensitivity (SE), specificity (SP), efficiency (EFF), Matthews correlation coefficient (MCC), and Cohen&#8217;s kappa coefficient (K) were all 1, which showed that the model had the optimal authenticity identification performance. Those parameters indicated that stacked generalization combined with feature selection is probably an important technique for improving the classification model predictive accuracy and avoid overfitting. The study result can provide a valuable reference for the safety and effectiveness of the clinical application of medicinal <i>Gentiana</i>.
topic nir
ft-mir
species identification
<i>gentiana</i>
chemometrics
feature selection
stacked generalization
url https://www.mdpi.com/1420-3049/25/6/1442
work_keys_str_mv AT taoshen discriminationofigentianaianditsrelatedspeciesusingirspectroscopycombinedwithfeatureselectionandstackedgeneralization
AT hongyu discriminationofigentianaianditsrelatedspeciesusingirspectroscopycombinedwithfeatureselectionandstackedgeneralization
AT yuanzhongwang discriminationofigentianaianditsrelatedspeciesusingirspectroscopycombinedwithfeatureselectionandstackedgeneralization
_version_ 1724489365953970176