Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis

Background: Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averag...

Full description

Bibliographic Details
Main Author: Pan, J. (Author)
Format: Article
Language:English
Published: BioMed Central Ltd 2021
Subjects:
Online Access:View Fulltext in Publisher
LEADER 03334nam a2200445Ia 4500
001 10.1186-s12859-021-04053-3
008 220427s2021 CNT 000 0 und d
020 |a 14712105 (ISSN) 
245 1 0 |a Improved two-stage model averaging for high-dimensional linear regression, with application to Riboflavin data analysis 
260 0 |b BioMed Central Ltd  |c 2021 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1186/s12859-021-04053-3 
520 3 |a Background: Model averaging has attracted increasing attention in recent years for the analysis of high-dimensional data. By weighting several competing statistical models suitably, model averaging attempts to achieve stable and improved prediction. In this paper, we develop a two-stage model averaging procedure to enhance accuracy and stability in prediction for high-dimensional linear regression. First we employ a high-dimensional variable selection method such as LASSO to screen redundant predictors and construct a class of candidate models, then we apply the jackknife cross-validation to optimize model weights for averaging. Results: In simulation studies, the proposed technique outperforms commonly used alternative methods under high-dimensional regression setting, in terms of minimizing the mean of the squared prediction error. We apply the proposed method to a riboflavin data, the result show that such method is quite efficient in forecasting the riboflavin production rate, when there are thousands of genes and only tens of subjects. Conclusions: Compared with a recent high-dimensional model averaging procedure (Ando and Li in J Am Stat Assoc 109:254–65, 2014), the proposed approach enjoys three appealing features thus has better predictive performance: (1) More suitable methods are applied for model constructing and weighting. (2) Computational flexibility is retained since each candidate model and its corresponding weight are determined in the low-dimensional setting and the quadratic programming is utilized in the cross-validation. (3) Model selection and averaging are combined in the procedure thus it makes full use of the strengths of both techniques. As a consequence, the proposed method can achieve stable and accurate predictions in high-dimensional linear models, and can greatly help practical researchers analyze genetic data in medical research. © 2021, The Author(s). 
650 0 4 |a Accurate prediction 
650 0 4 |a Clustering algorithms 
650 0 4 |a computer simulation 
650 0 4 |a Computer Simulation 
650 0 4 |a Corresponding weights 
650 0 4 |a Cross-validation 
650 0 4 |a data analysis 
650 0 4 |a Data Analysis 
650 0 4 |a Forecasting 
650 0 4 |a High dimensional data 
650 0 4 |a High-dimensional models 
650 0 4 |a High-dimensional regression 
650 0 4 |a High-dimensional regressions 
650 0 4 |a Jackknife 
650 0 4 |a Linear Models 
650 0 4 |a Model averaging 
650 0 4 |a Models, Statistical 
650 0 4 |a Predictive performance 
650 0 4 |a Quadratic programming 
650 0 4 |a Regression analysis 
650 0 4 |a riboflavin 
650 0 4 |a Riboflavin 
650 0 4 |a Squared prediction errors 
650 0 4 |a statistical model 
650 0 4 |a Variable selection 
650 0 4 |a Variable selection methods 
700 1 |a Pan, J.  |e author 
773 |t BMC Bioinformatics