Sliced-based sufficient dimension reduction for binary imbalanced data

碩士 === 國立臺北大學 === 統計學系 === 107 === to high-dimensional data to find the effective DR directions benefit users to explore the intrinsic structure of high-dimensional data in the low-dimensional subspace. The dimension-reduced data could be regarded as the features of the raw data, and can further be...

Full description

Bibliographic Details
Main Authors: HSU, WEI-TSE, 徐維澤
Other Authors: WU, HAN-MING
Format: Others
Language:zh-TW
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/97bqn3
Description
Summary:碩士 === 國立臺北大學 === 統計學系 === 107 === to high-dimensional data to find the effective DR directions benefit users to explore the intrinsic structure of high-dimensional data in the low-dimensional subspace. The dimension-reduced data could be regarded as the features of the raw data, and can further be employed in the classification and/or clustering problems. It has been shown that applying binary classification rules to the imbalanced data would cause prediction bias. The so-called imbalanced data is a dataset whose numbers of observations in two categories of a response variable are significantly different. However, even though many researches have been conducted to study the effects of the imbalanced data to the classification rules, there has been very little study reported on the applications of SDR to the imbalanced data in the literature. Therefore, in this study, we are motivated to investigate the effects of the binary imbalanced data to four SDR methods including Sliced Inverse Regression (SIR), Sliced Average Variance Estimation (SAVE), Difference of Covariance (DOC), and principal Hessian direction (pHd). The performance of the selected SDR methods is evaluated by the simulation studies and a real data analysis with or without pre-balancing process. The results of these numerical experiments show that the pre-balancing process is needed for SIR when the imbalanced data is consists of two similar group means and a smaller variance of the positive class. For SAVE, the prebalancing process is optional even the bias of the DR estimates of SAVE would be larger than SIR. As for DOC and pHd, the performance is worse than those of SIR and SAVE, which suggests that DOC and pHd are not suitable for the imbalanced data.