Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States

Background: Spatial linear Land-Use Regression (LUR) is commonly used for long-term modeling of air pollution in support of exposure and epidemiological assessments. Machine Learning (ML) methods in conjunction with spatiotemporal modeling can provide more flexible exposure-relevant metrics and have...

Full description

Bibliographic Details
Main Authors: Xiang Ren, Zhongyuan Mi, Panos G. Georgopoulos
Format: Article
Language:English
Published: Elsevier 2020-09-01
Series:Environment International
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S0160412020317827
id doaj-6453560655fc486aa9eb81dd61ce754c
record_format Article
collection DOAJ
language English
format Article
sources DOAJ
author Xiang Ren
Zhongyuan Mi
Panos G. Georgopoulos
spellingShingle Xiang Ren
Zhongyuan Mi
Panos G. Georgopoulos
Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
Environment International
Machine learning
Land use regression
Ozone
Spatiotemporal modeling
Black-box model interpretation
author_facet Xiang Ren
Zhongyuan Mi
Panos G. Georgopoulos
author_sort Xiang Ren
title Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
title_short Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
title_full Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
title_fullStr Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
title_full_unstemmed Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United States
title_sort comparison of machine learning and land use regression for fine scale spatiotemporal estimation of ambient air pollution: modeling ozone concentrations across the contiguous united states
publisher Elsevier
series Environment International
issn 0160-4120
publishDate 2020-09-01
description Background: Spatial linear Land-Use Regression (LUR) is commonly used for long-term modeling of air pollution in support of exposure and epidemiological assessments. Machine Learning (ML) methods in conjunction with spatiotemporal modeling can provide more flexible exposure-relevant metrics and have been studied using different model structures. There is however a lack of comparisons of methods available within these two modeling frameworks, that can guide model/algorithm selection in air quality epidemiology. Objective: The present study compares thirteen algorithms for spatial/spatiotemporal modeling applied for daily maxima of 8-hour running averages of ambient ozone concentrations at spatial resolutions corresponding to census tracts, to support estimation of annual ozone design values across the contiguous US. These algorithms were selected from nine representative categories and trained using predictors that included chemistry-transport model predictions, meteorological factors, land use and land cover, and stationary and mobile emissions. Methods: To obtain the best predictive performance, model structures were optimized through a repeated coarse/fine grid search with expert knowledge. Six target-oriented validation strategies were used to prevent overfitting and avoid over-optimistic model evaluation results. In order to take full advantage of the power of different algorithms, we introduced tuning sample weights in spatiotemporal modeling to ensure predictive accuracy of peak concentrations, that is crucial for exposure assessments. In spatial modeling, four interpretation and visualization tools were introduced to explain predictions from different algorithms. Results: Nonlinear ML methods achieved higher prediction accuracy than linear LUR, and the improvements were more significant for spatiotemporal modeling (nearly 10%-40% decrease of predicted RMSE). By tuning the sample weights, spatiotemporal models can predict concentrations used to calculate ozone design values that are comparable or even better than spatial models (nearly 30% decrease of cross-validated RMSE). We visualized the underlying nonlinear relationships, heterogeneous associations and complex interactions from the two best performing ML algorithms, i.e., Random Forest and Extreme Gradient Boosting, and found that the complex patterns were relatively less significant with respect to model accuracy for spatial modeling. Conclusion: Machine Learning can provide estimates that are actually more interpretable and practical than linear regression to improve accuracy in modeling human exposures. A careful design of hyperparameter tuning and flexible data splitting and validations is crucial to obtain reliable and stable results. Desirable/successful nonlinear models are expected to capture similar nonlinear patterns and interactions using different ML algorithms.
topic Machine learning
Land use regression
Ozone
Spatiotemporal modeling
Black-box model interpretation
url http://www.sciencedirect.com/science/article/pii/S0160412020317827
work_keys_str_mv AT xiangren comparisonofmachinelearningandlanduseregressionforfinescalespatiotemporalestimationofambientairpollutionmodelingozoneconcentrationsacrossthecontiguousunitedstates
AT zhongyuanmi comparisonofmachinelearningandlanduseregressionforfinescalespatiotemporalestimationofambientairpollutionmodelingozoneconcentrationsacrossthecontiguousunitedstates
AT panosggeorgopoulos comparisonofmachinelearningandlanduseregressionforfinescalespatiotemporalestimationofambientairpollutionmodelingozoneconcentrationsacrossthecontiguousunitedstates
_version_ 1724730773661024256
spelling doaj-6453560655fc486aa9eb81dd61ce754c2020-11-25T02:52:19ZengElsevierEnvironment International0160-41202020-09-01142105827Comparison of Machine Learning and Land Use Regression for fine scale spatiotemporal estimation of ambient air pollution: Modeling ozone concentrations across the contiguous United StatesXiang Ren0Zhongyuan Mi1Panos G. Georgopoulos2Environmental and Occupational Health Sciences Institute (EOHSI), Rutgers University, Piscataway, NJ 08854, USA; Department of Chemical and Biochemical Engineering, Rutgers University, Piscataway, NJ 08854, USAEnvironmental and Occupational Health Sciences Institute (EOHSI), Rutgers University, Piscataway, NJ 08854, USA; Department of Environmental Sciences, Rutgers University, New Brunswick, NJ 08901, USAEnvironmental and Occupational Health Sciences Institute (EOHSI), Rutgers University, Piscataway, NJ 08854, USA; Department of Chemical and Biochemical Engineering, Rutgers University, Piscataway, NJ 08854, USA; Department of Environmental Sciences, Rutgers University, New Brunswick, NJ 08901, USA; Department of Environmental and Occupational Health, Rutgers School of Public Health, Piscataway, NJ 08854, USA; Corresponding author at: Environmental and Occupational Health Sciences Institute (EOHSI), Rutgers University, Piscataway, NJ 08854, USA.Background: Spatial linear Land-Use Regression (LUR) is commonly used for long-term modeling of air pollution in support of exposure and epidemiological assessments. Machine Learning (ML) methods in conjunction with spatiotemporal modeling can provide more flexible exposure-relevant metrics and have been studied using different model structures. There is however a lack of comparisons of methods available within these two modeling frameworks, that can guide model/algorithm selection in air quality epidemiology. Objective: The present study compares thirteen algorithms for spatial/spatiotemporal modeling applied for daily maxima of 8-hour running averages of ambient ozone concentrations at spatial resolutions corresponding to census tracts, to support estimation of annual ozone design values across the contiguous US. These algorithms were selected from nine representative categories and trained using predictors that included chemistry-transport model predictions, meteorological factors, land use and land cover, and stationary and mobile emissions. Methods: To obtain the best predictive performance, model structures were optimized through a repeated coarse/fine grid search with expert knowledge. Six target-oriented validation strategies were used to prevent overfitting and avoid over-optimistic model evaluation results. In order to take full advantage of the power of different algorithms, we introduced tuning sample weights in spatiotemporal modeling to ensure predictive accuracy of peak concentrations, that is crucial for exposure assessments. In spatial modeling, four interpretation and visualization tools were introduced to explain predictions from different algorithms. Results: Nonlinear ML methods achieved higher prediction accuracy than linear LUR, and the improvements were more significant for spatiotemporal modeling (nearly 10%-40% decrease of predicted RMSE). By tuning the sample weights, spatiotemporal models can predict concentrations used to calculate ozone design values that are comparable or even better than spatial models (nearly 30% decrease of cross-validated RMSE). We visualized the underlying nonlinear relationships, heterogeneous associations and complex interactions from the two best performing ML algorithms, i.e., Random Forest and Extreme Gradient Boosting, and found that the complex patterns were relatively less significant with respect to model accuracy for spatial modeling. Conclusion: Machine Learning can provide estimates that are actually more interpretable and practical than linear regression to improve accuracy in modeling human exposures. A careful design of hyperparameter tuning and flexible data splitting and validations is crucial to obtain reliable and stable results. Desirable/successful nonlinear models are expected to capture similar nonlinear patterns and interactions using different ML algorithms.http://www.sciencedirect.com/science/article/pii/S0160412020317827Machine learningLand use regressionOzoneSpatiotemporal modelingBlack-box model interpretation