Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend scope of alignment="same_verifs" #699

Open
dougiesquire opened this issue Dec 3, 2021 · 6 comments
Open

Extend scope of alignment="same_verifs" #699

dougiesquire opened this issue Dec 3, 2021 · 6 comments

Comments

@dougiesquire
Copy link
Collaborator

The "same_verifs" alignment generates a list of times from verif that are present in forecast at any init but all leads. This list will always be empty when the init frequency is lower than the lead frequency. Is there scope to extend "same_verifs" to instead deal appropriately with such cases? I'll try to give a concrete example of what I mean below.

Consider the following hindcasts:

import cftime
import climpred
import numpy as np
import xarray as xr

# Hindcasts initialised every year with monthly lead
init = xr.cftime_range(start="2000-01-01", end="2002-01-01", freq="AS")
lead = range(0, 24)
data = np.random.random((len(init), len(lead)))
hind = xr.DataArray(data, coords=[init, lead], dims=["init", "lead"], name="var")
hind["lead"].attrs["units"] = "months"
hind = climpred.utils.add_time_from_init_lead(hind)

I currently can't use "same_verifs" with this data because there are no common times available at all leads.

But, users may still want to align based on a common verification period. I.e., in this example, "valid_time"s [2001-01-01 and 2002-01-01] are available at all possible leads for which they can occur (leads 0 and 12 months). Similarly,

  • [2001-02-01 and 2002-02-01] are available at leads 1 and 13 months,
  • [2001-03-01 and 2002-03-01] are available at leads 2 and 14 months,
    ...
  • [2001-12-01 and 2002-12-01] are available at leads 11 and 23 months.

That is, by performing verification over the period 2001-01-01 - 2002-12-01 one includes:

  • the same dates at each lead where possible, given the init/lead frequencies
  • the same number of samples at each lead
period = [cftime.DatetimeGregorian(2001, 1, 1), cftime.DatetimeGregorian(2002, 12, 1)]

hind.where(
    np.logical_and(hind["valid_time"] >= period[0], hind["valid_time"] <= period[1])
).plot()

Screen Shot 2021-12-03 at 1 17 42 pm

How do folks feel about trying to restructure cftime.utils._same_verifs_alignment() to use the above alignment dates in the above example? We would obviously do this such that the current behaviour is preserved for datasets that have common verification times across all leads.

@aaronspring
Copy link
Collaborator

aaronspring commented Dec 4, 2021

#702 will help to visualize the discussion

@aaronspring
Copy link
Collaborator

Thank you for this extension proposal issue @dougiesquire

In #702, I played around with your use case and indeed same_verifs doesn't work here:

init = xr.cftime_range(start="2000-01-01", end="2002-01-01", freq="AS")
lead = range(0, 24)
data = np.random.random((len(init), len(lead)))
hind = xr.DataArray(data, coords=[init, lead], dims=["init", "lead"], name="var")
hind["lead"].attrs["units"] = "months"

time = xr.cftime_range(
    start="2000-01-01", periods=len(init) * 12 + len(lead), freq="MS"
)
data = np.random.random(len(time))
obs = xr.DataArray(data, coords=dict(time=time), dims="time", name="var")

h = climpred.HindcastEnsemble(hind).add_observations(obs)
h.coords["valid_time"]

h.plot()

h.plot_alignment()

image

Some comments:

common verification period
We would obviously do this such that the current behaviour is preserved for datasets that have common verification times across all leads.

What about a new alignment method same_period or probably better name? This way same_verifs can stay as is. As long as it is clearly documented and distinguishable everything works for me.

the same number of samples at each lead

Thats what same_inits and same_verifs follow and maximize ignores.


I am still not quite understanding how this new alignment would look like. Would it essentially take 12 out of 24 lead months and slide from earlier leads at late inits to later leads at ealier inits? (12 depends on some other specifics I guess or is that because of the monthly freqs in a year?)
In plot_alignment the new approach would result in white spaces (=no verification dates) in the lower left corner (small inits, small leads) and upper right corner (large inits, large leads).

@bradyrx thoughts (on a new alignment)?

@aaronspring
Copy link
Collaborator

so this alignment would be the first where the number of leads gets reduced.
I am still unsure what this approach does to the interpretation of lead in results.

@dougiesquire
Copy link
Collaborator Author

dougiesquire commented Dec 6, 2021

Sorry, I think my description is unclear. And I'm not sure I've fully thought through my suggestion. I'm not meaning to suggest that the number of leads should be reduced.

I'm proposing an alignment that finds the maximum period that:

  • maintains equal numbers of samples at each lead
  • includes the same verification dates at each lead where possible

All valid_times that fall within this period would then be used. For hindcasts that have the same verification dates at every lead (e.g. where the lead is annual) this would be equivalent to "same_verifs". However, in cases like the one above (where the init frequency is lower than the lead frequency) a different set of verification dates may be used at one lead relative to another lead.

Consider the following examples with four hindcasts each

  1. init freq: 3 month, lead freq: 3 month:

    lead 0 lead 1 lead 2
    2001-10 2002-01 2002-04
    2001-07 2001-10 2002-01
    2001-04 2001-07 2001-10
    2001-01 2001-04 2001-07

    Here the period that satisfies the above conditions is 2001-07 -> 2001-10. Keeping everything in this range is equivalent to what "same_verifs" currently does.

  2. init freq: 3 month, lead freq: 1 month:

    lead 0 lead 1 lead 2 lead 3
    2001-10 2001-11 2001-12 2002-01
    2001-07 2001-08 2001-09 2001-10
    2001-04 2001-05 2001-06 2001-07
    2001-01 2001-02 2001-03 2001-04

    Here the period that satisfies the above conditions is 2001-04 -> 2001-10. This case would currently fail with "same_verifs" because the combination of init and lead frequencies means that we can never get the same verification dates at all leads.

Does this make sense?

@aaronspring
Copy link
Collaborator

aaronspring commented Dec 6, 2021

Thanks @dougiesquire. Now I get your approach. So valid_times do not need to match across lead but is between on upper and lower bound.
It reminds me a bit of sel(method='nearest') but with a upper and lower bound.

Note: For your example to work you definitely need a monthly observation.

So for your second example, striked do not verify:

init freq: 3 month, lead freq: 1 month:

lead 0 lead 1 lead 2 lead 3
2001-10 2001-11 2001-12 2002-01
2001-07 2001-08 2001-09 2001-10
2001-04 2001-05 2001-06 2001-07
2001-01 2001-02 2001-03 2001-04

The number of sample isnt equal but wont differ more than +/- 1 IMO. Taking 2001-03 - 2001-11 makes three sample each.

I'd still prefer to make a new alignment keyword. maybe same_verifs_nearest or same_verifs_fill?


Would you lead a PR? Entrypoint is

def _same_verifs_alignment(init_lead_matrix, valid_inits, all_verifs, leads, n, freq):

I am happy to give feedback and test.

@dougiesquire
Copy link
Collaborator Author

Note: For your example to work you definitely need a monthly observation.

Yes exactly - sorry should've made that clearer

The number of sample isnt equal but wont differ more than +/- 1 IMO.

Good point. I messed that up, sorry. Now I realise there isn't a single solution to the constraints I've posed.

I think there'd be value in an alignment something like what I'm suggesting. But it seems like I still need to work out the best approach for climpred in my head. In the past for my own work I've just specified a period over which to verify and kept all dates within that period. I chose this period judiciously to make sure that there are equal numbers of samples at each lead.

Happy to open a PR where I can flesh this out a little better. But it might take me a little while to get to it sorry.

@aaronspring aaronspring changed the title Extend scope of "same_verifs" alignment Extend scope of alignment="same_verifs" Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants