Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States

Background: Spatial linear Land-Use Regression (LUR) is commonly used for long-term modeling of air pollution in support of exposure and epidemiological assessments. Machine Learning (ML) methods in conjunction with spatiotemporal modeling can provide more flexible exposure-relevant metrics and have...

Full description

Bibliographic Details
Main Authors:	Xiang Ren, Zhongyuan Mi, Panos G. Georgopoulos
Format:	Article
Language:	English
Published:	Elsevier 2020-09-01
Series:	Environment International
Subjects:	Machine learning Land use regression Ozone Spatiotemporal modeling Black-box model interpretation
Online Access:	http://www.sciencedirect.com/science/article/pii/S0160412020317827

id	doaj-6453560655fc486aa9eb81dd61ce754c
record_format	Article
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Xiang Ren Zhongyuan Mi Panos G. Georgopoulos
spellingShingle	Xiang Ren Zhongyuan Mi Panos G. Georgopoulos Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States Environment International Machine learning Land use regression Ozone Spatiotemporal modeling Black-box model interpretation
author_facet	Xiang Ren Zhongyuan Mi Panos G. Georgopoulos
author_sort	Xiang Ren
title	Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
title_short	Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
title_full	Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
title_fullStr	Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
title_full_unstemmed	Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
title_sort	comparison of machine learning and land use regression for fine scale spatiotemporal estimation of ambient air pollution: modeling ozone concentrations across the contiguous united states
publisher	Elsevier
series	Environment International
issn	0160-4120
publishDate	2020-09-01
description	Background: Spatial linear Land-Use Regression (LUR) is commonly used for long-term modeling of air pollution in support of exposure and epidemiological assessments. Machine Learning (ML) methods in conjunction with spatiotemporal modeling can provide more flexible exposure-relevant metrics and have been studied using different model structures. There is however a lack of comparisons of methods available within these two modeling frameworks, that can guide model/algorithm selection in air quality epidemiology. Objective: The present study compares thirteen algorithms for spatial/spatiotemporal modeling applied for daily maxima of 8-hour running averages of ambient ozone concentrations at spatial resolutions corresponding to census tracts, to support estimation of annual ozone design values across the contiguous US. These algorithms were selected from nine representative categories and trained using predictors that included chemistry-transport model predictions, meteorological factors, land use and land cover, and stationary and mobile emissions. Methods: To obtain the best predictive performance, model structures were optimized through a repeated coarse/fine grid search with expert knowledge. Six target-oriented validation strategies were used to prevent overfitting and avoid over-optimistic model evaluation results. In order to take full advantage of the power of different algorithms, we introduced tuning sample weights in spatiotemporal modeling to ensure predictive accuracy of peak concentrations, that is crucial for exposure assessments. In spatial modeling, four interpretation and visualization tools were introduced to explain predictions from different algorithms. Results: Nonlinear ML methods achieved higher prediction accuracy than linear LUR, and the improvements were more significant for spatiotemporal modeling (nearly 10%-40% decrease of predicted RMSE). By tuning the sample weights, spatiotemporal models can predict concentrations used to calculate ozone design values that are comparable or even better than spatial models (nearly 30% decrease of cross-validated RMSE). We visualized the underlying nonlinear relationships, heterogeneous associations and complex interactions from the two best performing ML algorithms, i.e., Random Forest and Extreme Gradient Boosting, and found that the complex patterns were relatively less significant with respect to model accuracy for spatial modeling. Conclusion: Machine Learning can provide estimates that are actually more interpretable and practical than linear regression to improve accuracy in modeling human exposures. A careful design of hyperparameter tuning and flexible data splitting and validations is crucial to obtain reliable and stable results. Desirable/successful nonlinear models are expected to capture similar nonlinear patterns and interactions using different ML algorithms.
topic	Machine learning Land use regression Ozone Spatiotemporal modeling Black-box model interpretation
url	http://www.sciencedirect.com/science/article/pii/S0160412020317827
work_keys_str_mv	AT xiangren comparisonofmachinelearningandlanduseregressionforfinescalespatiotemporalestimationofambientairpollutionmodelingozoneconcentrationsacrossthecontiguousunitedstates AT zhongyuanmi comparisonofmachinelearningandlanduseregressionforfinescalespatiotemporalestimationofambientairpollutionmodelingozoneconcentrationsacrossthecontiguousunitedstates AT panosggeorgopoulos comparisonofmachinelearningandlanduseregressionforfinescalespatiotemporalestimationofambientairpollutionmodelingozoneconcentrationsacrossthecontiguousunitedstates
_version_	1724730773661024256
spelling	doaj-6453560655fc486aa9eb81dd61ce754c2020-11-25T02:52:19ZengElsevierEnvironment International0160-41202020-09-01142105827Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United StatesXiang Ren0Zhongyuan Mi1Panos G. Georgopoulos2Environmental and Occupational Health Sciences Institute (EOHSI), Rutgers University, Piscataway, NJ 08854, USA; Department of Chemical and Biochemical Engineering, Rutgers University, Piscataway, NJ 08854, USAEnvironmental and Occupational Health Sciences Institute (EOHSI), Rutgers University, Piscataway, NJ 08854, USA; Department of Environmental Sciences, Rutgers University, New Brunswick, NJ 08901, USAEnvironmental and Occupational Health Sciences Institute (EOHSI), Rutgers University, Piscataway, NJ 08854, USA; Department of Chemical and Biochemical Engineering, Rutgers University, Piscataway, NJ 08854, USA; Department of Environmental Sciences, Rutgers University, New Brunswick, NJ 08901, USA; Department of Environmental and Occupational Health, Rutgers School of Public Health, Piscataway, NJ 08854, USA; Corresponding author at: Environmental and Occupational Health Sciences Institute (EOHSI), Rutgers University, Piscataway, NJ 08854, USA.Background: Spatial linear Land-Use Regression (LUR) is commonly used for long-term modeling of air pollution in support of exposure and epidemiological assessments. Machine Learning (ML) methods in conjunction with spatiotemporal modeling can provide more flexible exposure-relevant metrics and have been studied using different model structures. There is however a lack of comparisons of methods available within these two modeling frameworks, that can guide model/algorithm selection in air quality epidemiology. Objective: The present study compares thirteen algorithms for spatial/spatiotemporal modeling applied for daily maxima of 8-hour running averages of ambient ozone concentrations at spatial resolutions corresponding to census tracts, to support estimation of annual ozone design values across the contiguous US. These algorithms were selected from nine representative categories and trained using predictors that included chemistry-transport model predictions, meteorological factors, land use and land cover, and stationary and mobile emissions. Methods: To obtain the best predictive performance, model structures were optimized through a repeated coarse/fine grid search with expert knowledge. Six target-oriented validation strategies were used to prevent overfitting and avoid over-optimistic model evaluation results. In order to take full advantage of the power of different algorithms, we introduced tuning sample weights in spatiotemporal modeling to ensure predictive accuracy of peak concentrations, that is crucial for exposure assessments. In spatial modeling, four interpretation and visualization tools were introduced to explain predictions from different algorithms. Results: Nonlinear ML methods achieved higher prediction accuracy than linear LUR, and the improvements were more significant for spatiotemporal modeling (nearly 10%-40% decrease of predicted RMSE). By tuning the sample weights, spatiotemporal models can predict concentrations used to calculate ozone design values that are comparable or even better than spatial models (nearly 30% decrease of cross-validated RMSE). We visualized the underlying nonlinear relationships, heterogeneous associations and complex interactions from the two best performing ML algorithms, i.e., Random Forest and Extreme Gradient Boosting, and found that the complex patterns were relatively less significant with respect to model accuracy for spatial modeling. Conclusion: Machine Learning can provide estimates that are actually more interpretable and practical than linear regression to improve accuracy in modeling human exposures. A careful design of hyperparameter tuning and flexible data splitting and validations is crucial to obtain reliable and stable results. Desirable/successful nonlinear models are expected to capture similar nonlinear patterns and interactions using different ML algorithms.http://www.sciencedirect.com/science/article/pii/S0160412020317827Machine learningLand use regressionOzoneSpatiotemporal modelingBlack-box model interpretation

Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States

Similar Items