Significance of changes in medium-range forecast scores

The impact of developments in weather forecasting is measured using forecast verification, but many developments, though useful, have impacts of less than 0.5 % on medium-range forecast scores. Chaotic variability in the quality of individual forecasts is so large that it can be hard to achieve stat...

Full description

Bibliographic Details
Main Author: Alan J. Geer
Format: Article
Language:English
Published: Taylor & Francis Group 2016-09-01
Series:Tellus: Series A, Dynamic Meteorology and Oceanography
Subjects:
Online Access:http://www.tellusa.net/index.php/tellusa/article/view/30229/48654
id doaj-9f90ec898f164a99bea2bb5af4ac437a
record_format Article
spelling doaj-9f90ec898f164a99bea2bb5af4ac437a2020-11-25T01:52:32ZengTaylor & Francis GroupTellus: Series A, Dynamic Meteorology and Oceanography1600-08702016-09-0168012110.3402/tellusa.v68.3022930229Significance of changes in medium-range forecast scoresAlan J. Geer0European Centre for Medium-Range Weather Forecasts, Shinfield Park, Reading, RG2 9AX, United KingdomThe impact of developments in weather forecasting is measured using forecast verification, but many developments, though useful, have impacts of less than 0.5 % on medium-range forecast scores. Chaotic variability in the quality of individual forecasts is so large that it can be hard to achieve statistical significance when comparing these ‘smaller’ developments to a control. For example, with 60 separate forecasts and requiring a 95 % confidence level, a change in quality of the day-5 forecast needs to be larger than 1 % to be statistically significant using a Student's t-test. The first aim of this study is simply to illustrate the importance of significance testing in forecast verification, and to point out the surprisingly large sample sizes that are required to attain significance. The second aim is to see how reliable are current approaches to significance testing, following suspicion that apparently significant results may actually have been generated by chaotic variability. An independent realisation of the null hypothesis can be created using a forecast experiment containing a purely numerical perturbation, and comparing it to a control. With 1885 paired differences from about 2.5 yr of testing, an alternative significance test can be constructed that makes no statistical assumptions about the data. This is used to experimentally test the validity of the normal statistical framework for forecast scores, and it shows that the naive application of Student's t-test does generate too many false positives (i.e. false rejections of the null hypothesis). A known issue is temporal autocorrelation in forecast scores, which can be corrected by an inflation in the size of confidence range, but typical inflation factors, such as those based on an AR(1) model, are not big enough and they are affected by sampling uncertainty. Further, the importance of statistical multiplicity has not been appreciated, and this becomes particularly dangerous when many experiments are compared together. For example, across three forecast experiments, there could be roughly a 1 in 2 chance of getting a false positive. However, if correctly adjusted for autocorrelation, and when the effects of multiplicity are properly treated using a Šidák correction, the t-test is a reliable way of finding the significance of changes in forecast scores.http://www.tellusa.net/index.php/tellusa/article/view/30229/48654statistical significanceforecast verificationStudent's ttemporal autocorrelationpaired differencesmultiple comparisons
collection DOAJ
language English
format Article
sources DOAJ
author Alan J. Geer
spellingShingle Alan J. Geer
Significance of changes in medium-range forecast scores
Tellus: Series A, Dynamic Meteorology and Oceanography
statistical significance
forecast verification
Student's t
temporal autocorrelation
paired differences
multiple comparisons
author_facet Alan J. Geer
author_sort Alan J. Geer
title Significance of changes in medium-range forecast scores
title_short Significance of changes in medium-range forecast scores
title_full Significance of changes in medium-range forecast scores
title_fullStr Significance of changes in medium-range forecast scores
title_full_unstemmed Significance of changes in medium-range forecast scores
title_sort significance of changes in medium-range forecast scores
publisher Taylor & Francis Group
series Tellus: Series A, Dynamic Meteorology and Oceanography
issn 1600-0870
publishDate 2016-09-01
description The impact of developments in weather forecasting is measured using forecast verification, but many developments, though useful, have impacts of less than 0.5 % on medium-range forecast scores. Chaotic variability in the quality of individual forecasts is so large that it can be hard to achieve statistical significance when comparing these ‘smaller’ developments to a control. For example, with 60 separate forecasts and requiring a 95 % confidence level, a change in quality of the day-5 forecast needs to be larger than 1 % to be statistically significant using a Student's t-test. The first aim of this study is simply to illustrate the importance of significance testing in forecast verification, and to point out the surprisingly large sample sizes that are required to attain significance. The second aim is to see how reliable are current approaches to significance testing, following suspicion that apparently significant results may actually have been generated by chaotic variability. An independent realisation of the null hypothesis can be created using a forecast experiment containing a purely numerical perturbation, and comparing it to a control. With 1885 paired differences from about 2.5 yr of testing, an alternative significance test can be constructed that makes no statistical assumptions about the data. This is used to experimentally test the validity of the normal statistical framework for forecast scores, and it shows that the naive application of Student's t-test does generate too many false positives (i.e. false rejections of the null hypothesis). A known issue is temporal autocorrelation in forecast scores, which can be corrected by an inflation in the size of confidence range, but typical inflation factors, such as those based on an AR(1) model, are not big enough and they are affected by sampling uncertainty. Further, the importance of statistical multiplicity has not been appreciated, and this becomes particularly dangerous when many experiments are compared together. For example, across three forecast experiments, there could be roughly a 1 in 2 chance of getting a false positive. However, if correctly adjusted for autocorrelation, and when the effects of multiplicity are properly treated using a Šidák correction, the t-test is a reliable way of finding the significance of changes in forecast scores.
topic statistical significance
forecast verification
Student's t
temporal autocorrelation
paired differences
multiple comparisons
url http://www.tellusa.net/index.php/tellusa/article/view/30229/48654
work_keys_str_mv AT alanjgeer significanceofchangesinmediumrangeforecastscores
_version_ 1724994669781188608