Significance of changes in medium-range forecast scores

The impact of developments in weather forecasting is measured using forecast verification, but many developments, though useful, have impacts of less than 0.5 % on medium-range forecast scores. Chaotic variability in the quality of individual forecasts is so large that it can be hard to achieve stat...

Full description

Bibliographic Details
Main Author:	Alan J. Geer
Format:	Article
Language:	English
Published:	Taylor & Francis Group 2016-09-01
Series:	Tellus: Series A, Dynamic Meteorology and Oceanography
Subjects:	statistical significance forecast verification Student's t temporal autocorrelation paired differences multiple comparisons
Online Access:	http://www.tellusa.net/index.php/tellusa/article/view/30229/48654

id	doaj-9f90ec898f164a99bea2bb5af4ac437a
record_format	Article
spelling	doaj-9f90ec898f164a99bea2bb5af4ac437a2020-11-25T01:52:32ZengTaylor & Francis GroupTellus: Series A, Dynamic Meteorology and Oceanography1600-08702016-09-0168012110.3402/tellusa.v68.3022930229Significance of changes in medium-range forecast scoresAlan J. Geer0European Centre for Medium-Range Weather Forecasts, Shinfield Park, Reading, RG2 9AX, United KingdomThe impact of developments in weather forecasting is measured using forecast verification, but many developments, though useful, have impacts of less than 0.5 % on medium-range forecast scores. Chaotic variability in the quality of individual forecasts is so large that it can be hard to achieve statistical significance when comparing these ‘smaller’ developments to a control. For example, with 60 separate forecasts and requiring a 95 % confidence level, a change in quality of the day-5 forecast needs to be larger than 1 % to be statistically significant using a Student's t-test. The first aim of this study is simply to illustrate the importance of significance testing in forecast verification, and to point out the surprisingly large sample sizes that are required to attain significance. The second aim is to see how reliable are current approaches to significance testing, following suspicion that apparently significant results may actually have been generated by chaotic variability. An independent realisation of the null hypothesis can be created using a forecast experiment containing a purely numerical perturbation, and comparing it to a control. With 1885 paired differences from about 2.5 yr of testing, an alternative significance test can be constructed that makes no statistical assumptions about the data. This is used to experimentally test the validity of the normal statistical framework for forecast scores, and it shows that the naive application of Student's t-test does generate too many false positives (i.e. false rejections of the null hypothesis). A known issue is temporal autocorrelation in forecast scores, which can be corrected by an inflation in the size of confidence range, but typical inflation factors, such as those based on an AR(1) model, are not big enough and they are affected by sampling uncertainty. Further, the importance of statistical multiplicity has not been appreciated, and this becomes particularly dangerous when many experiments are compared together. For example, across three forecast experiments, there could be roughly a 1 in 2 chance of getting a false positive. However, if correctly adjusted for autocorrelation, and when the effects of multiplicity are properly treated using a Šidák correction, the t-test is a reliable way of finding the significance of changes in forecast scores.http://www.tellusa.net/index.php/tellusa/article/view/30229/48654statistical significanceforecast verificationStudent's ttemporal autocorrelationpaired differencesmultiple comparisons
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Alan J. Geer
spellingShingle	Alan J. Geer Significance of changes in medium-range forecast scores Tellus: Series A, Dynamic Meteorology and Oceanography statistical significance forecast verification Student's t temporal autocorrelation paired differences multiple comparisons
author_facet	Alan J. Geer
author_sort	Alan J. Geer
title	Significance of changes in medium-range forecast scores
title_short	Significance of changes in medium-range forecast scores
title_full	Significance of changes in medium-range forecast scores
title_fullStr	Significance of changes in medium-range forecast scores
title_full_unstemmed	Significance of changes in medium-range forecast scores
title_sort	significance of changes in medium-range forecast scores
publisher	Taylor & Francis Group
series	Tellus: Series A, Dynamic Meteorology and Oceanography
issn	1600-0870
publishDate	2016-09-01
description	The impact of developments in weather forecasting is measured using forecast verification, but many developments, though useful, have impacts of less than 0.5 % on medium-range forecast scores. Chaotic variability in the quality of individual forecasts is so large that it can be hard to achieve statistical significance when comparing these ‘smaller’ developments to a control. For example, with 60 separate forecasts and requiring a 95 % confidence level, a change in quality of the day-5 forecast needs to be larger than 1 % to be statistically significant using a Student's t-test. The first aim of this study is simply to illustrate the importance of significance testing in forecast verification, and to point out the surprisingly large sample sizes that are required to attain significance. The second aim is to see how reliable are current approaches to significance testing, following suspicion that apparently significant results may actually have been generated by chaotic variability. An independent realisation of the null hypothesis can be created using a forecast experiment containing a purely numerical perturbation, and comparing it to a control. With 1885 paired differences from about 2.5 yr of testing, an alternative significance test can be constructed that makes no statistical assumptions about the data. This is used to experimentally test the validity of the normal statistical framework for forecast scores, and it shows that the naive application of Student's t-test does generate too many false positives (i.e. false rejections of the null hypothesis). A known issue is temporal autocorrelation in forecast scores, which can be corrected by an inflation in the size of confidence range, but typical inflation factors, such as those based on an AR(1) model, are not big enough and they are affected by sampling uncertainty. Further, the importance of statistical multiplicity has not been appreciated, and this becomes particularly dangerous when many experiments are compared together. For example, across three forecast experiments, there could be roughly a 1 in 2 chance of getting a false positive. However, if correctly adjusted for autocorrelation, and when the effects of multiplicity are properly treated using a Šidák correction, the t-test is a reliable way of finding the significance of changes in forecast scores.
topic	statistical significance forecast verification Student's t temporal autocorrelation paired differences multiple comparisons
url	http://www.tellusa.net/index.php/tellusa/article/view/30229/48654
work_keys_str_mv	AT alanjgeer significanceofchangesinmediumrangeforecastscores
_version_	1724994669781188608

Significance of changes in medium-range forecast scores

Similar Items