Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches

Abstract Background Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with rese...

Full description

Bibliographic Details
Main Authors: Elizabeth Handorf, Yinuo Yin, Michael Slifker, Shannon Lynch
Format: Article
Language:English
Published: BMC 2020-12-01
Series:BMC Medical Research Methodology
Subjects:
Online Access:https://doi.org/10.1186/s12874-020-01183-9
id doaj-d872c8d3d9694be78d4f1deefed4b04a
record_format Article
spelling doaj-d872c8d3d9694be78d4f1deefed4b04a2020-12-13T12:02:08ZengBMCBMC Medical Research Methodology1471-22882020-12-0120111010.1186/s12874-020-01183-9Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approachesElizabeth Handorf0Yinuo Yin1Michael Slifker2Shannon Lynch3Biostatistics and Bioinformatics Facility, Fox Chase Cancer CenterCancer Prevention and Control, Fox Chase Cancer CenterBiostatistics and Bioinformatics Facility, Fox Chase Cancer CenterCancer Prevention and Control, Fox Chase Cancer CenterAbstract Background Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome. Methods We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods’ ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p = 14,663 variables) linked to prostate cancer registry data (n = 76,186 cases) to identify social-environmental factors associated with advanced prostate cancer. Results In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman’s correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings. Conclusions This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries.https://doi.org/10.1186/s12874-020-01183-9Variable selectionSocial environment
collection DOAJ
language English
format Article
sources DOAJ
author Elizabeth Handorf
Yinuo Yin
Michael Slifker
Shannon Lynch
spellingShingle Elizabeth Handorf
Yinuo Yin
Michael Slifker
Shannon Lynch
Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches
BMC Medical Research Methodology
Variable selection
Social environment
author_facet Elizabeth Handorf
Yinuo Yin
Michael Slifker
Shannon Lynch
author_sort Elizabeth Handorf
title Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches
title_short Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches
title_full Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches
title_fullStr Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches
title_full_unstemmed Variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches
title_sort variable selection in social-environmental data: sparse regression and tree ensemble machine learning approaches
publisher BMC
series BMC Medical Research Methodology
issn 1471-2288
publishDate 2020-12-01
description Abstract Background Social-environmental data obtained from the US Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome. Methods We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods’ ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1000 total variables). We applied the most promising method to the full census data (p = 14,663 variables) linked to prostate cancer registry data (n = 76,186 cases) to identify social-environmental factors associated with advanced prostate cancer. Results In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman’s correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings. Conclusions This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods. Sparse clustered regression models performed best, as they identified many true positive variables while controlling false positive discoveries.
topic Variable selection
Social environment
url https://doi.org/10.1186/s12874-020-01183-9
work_keys_str_mv AT elizabethhandorf variableselectioninsocialenvironmentaldatasparseregressionandtreeensemblemachinelearningapproaches
AT yinuoyin variableselectioninsocialenvironmentaldatasparseregressionandtreeensemblemachinelearningapproaches
AT michaelslifker variableselectioninsocialenvironmentaldatasparseregressionandtreeensemblemachinelearningapproaches
AT shannonlynch variableselectioninsocialenvironmentaldatasparseregressionandtreeensemblemachinelearningapproaches
_version_ 1724385466651770880