Checking the verification notebook again we detected that you first compute RPS and then compute the RPPS and finally the mean over forecast_time.
Shouldn't we compute first the mean over forecast_time of RPS and then compute RPSS ?
# penalize penalize=obs_p.where(fct_p!=1,other=-10).mean('category')rpss=rpss.where(penalize!=0,other=-10)# clip rpss=rpss.clip(-10,1)# average over all forecasts rpss=rpss.mean('forecast_time')
Hi @aaron.spring , sorry that we raise this issue late, but just noticed it today.
From Weigel et al. 2017 you see that RPS has to be averaged for a number of samples before computing RPSS:
We believe that the current RPSS computation makes no sense, and I'll try to explain why.
A probabilistic score for a single forecast is useless. Or alternatively, "a probabilistic forecast is never wrong". For instance, a high RPS for a forecast issued 01/01/2020 for weeks 3&4 can indicate that something unlikely according to the forecasts happened. That doesn't necessarily mean the forecast is wrong. Even with good forecasts, if you collect a large number of samples (RPS values), one would expect some large values sometimes (unlikely things have to happen sometimes). This argument to the limit (very large number of samples) means that the "single-sample RPSS" for a very unlikely event will be close to -infinity, and then averaging over all the scores will be biased by these few very negative values.
We think this should be fixed in the verification notebook.
To add a bit more on this: when one performs a verification over an S2S hindcast with many years, one can compute an RPSS for each week of the year by averaging the RPS over many years. This is useful to evaluate the yearly evolution of skill (e.g. extratropical forecasts in winter tend to be more skillful). However in this case we only have 53 weeks for 2020 for the verification, and a single RPSS value summarizing 2020 performance over the 53 weeks has to be computed. The "single-sample RPSS" value for a single week and year is meaningless.
Dear @sergi.bechsala and @sergi.bechsala,
thank you for reporting this.
Your explanations sound reasonable and you are referencing to well cited references.
Have you checked how much of a difference it makes (in terms of numerical score) when doing the mean before or after RPSS formula? EDIT: I did below.
If we change, we need to decide how we penalise missing values. For now we penalised them with -10 in rpss, for rps we need to find a positive penalty value.
ah, but RPS can take a missing value for certain 2020 weeks in the submissions, right? I see. I guess the idea is to penalize NAs? Maybe 5*<RPS_clim> would work?
yes, the maximum RPS one can get for a single forecast is when you have 100% prob of above normal and below normal is observed or vice-versa, which produces an RPS of 2:
a = xr.DataArray([[1,1,1],[0,0,1]], dims=["time","category_edge"])b = xr.DataArray([[[0,0,1],[0,0,1]],[[1,1,1],[1,1,1]]], dims=["time","member","category_edge"])xs.rps(observations=a,forecasts=b,category_edges=None,dim=[]) Out[51]: <xarray.DataArray (time: 2)>array([2., 2.])Dimensions without coordinates: time
I agree with the proposal above (compute RPSS from mean of RPS instead of averaging RPSS + penalyze NaN values by a RPS of 2 (100% forecast in wrong category))