Comparison of Regularization Methods for Variable Selection in Highly Correlated Data

碩士 === 國立臺灣大學 === 統計碩士學位學程 === 107 === The regularization methods are capable of performing variable selection for collinear data. For example, elastic net has been proven to have a grouping effect to select all or none of a group of highly correlated variables. The objective of the study is to comp...

Full description

Bibliographic Details
Main Authors: Ching-Hsuan Chang, 張靜萱
Other Authors: 周呈霙
Format: Others
Language:en_US
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/5b3838
Description
Summary:碩士 === 國立臺灣大學 === 統計碩士學位學程 === 107 === The regularization methods are capable of performing variable selection for collinear data. For example, elastic net has been proven to have a grouping effect to select all or none of a group of highly correlated variables. The objective of the study is to compare four regularization methods, i.e., LASSO, elastic net, empirical Bayesian LASSO (EBLASSO), and empirical Bayesian elastic net (EBENet), with their selection behaviors under different levels of correlations. Through simulation studies, at a fixed sample size and number of variables, I found that: (1) For data in which high correlation only exists between true variables, elastic net should be chosen for its better abilities to select, to estimate coefficients, and to predict. (2) For data in which correlations exist between true variables and irrelevant variables, regardless of the levels of correlations, EBLASSO and EBENet are good choices because of the abilities to select, to estimate coefficients, and to predict. (3) In general, as the correlations decrease, the four regularization methods improve in terms of variable selection and coefficient estimation. Finally, EBLASSO was found to outperform the other three regularization methods for the healthcare dataset. Since the correlation may exist between true variables and irrelevant variables, this result can echo the second conclusion above. In addition, for several groups of highly correlated variables in the real datasets, such as height, weight and lean body mass, the elastic net selected all variables from the variables groups into the model. This phenomenon matches the theorem of grouping effect. Therefore, this study believes that because of the consistency between the results from simulation and the outcomes from real data analysis, the findings from the simulation study may be used as a reference for selecting the variable selection method in the real data analysis.