-
-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix smooth_data axis loo_pit #1720
base: main
Are you sure you want to change the base?
Conversation
All these transpose is making me a bit nervous - seems it is possible to modify the |
Also nervous about the transposes :-) |
I think the root cause is in arviz/arviz/stats/stats_utils.py Lines 570 to 572 in 57aeded
where smoothing should be done in the data dim, but the usage in In arviz/arviz/plots/backends/bokeh/bpvplot.py Lines 90 to 94 in 57aeded
Data dimension is at the end, so the current behavior of In |
@@ -561,14 +561,14 @@ def _circular_standard_deviation(samples, high=2 * np.pi, low=0, skipna=False, a | |||
return ((high - low) / 2.0 / np.pi) * np.sqrt(-2 * np.log(r_r)) | |||
|
|||
|
|||
def smooth_data(obs_vals, pp_vals): | |||
def smooth_data(obs_vals, pp_vals, axis=1): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def smooth_data(obs_vals, pp_vals, axis=1): | |
def smooth_data(obs_vals, pp_vals, axis=-1): | |
Codecov Report
@@ Coverage Diff @@
## main #1720 +/- ##
=======================================
Coverage 90.86% 90.86%
=======================================
Files 108 108
Lines 11821 11821
=======================================
Hits 10741 10741
Misses 1080 1080
Continue to review full report at Codecov.
|
@@ -1635,7 +1635,7 @@ def loo_pit(idata=None, *, y=None, y_hat=None, log_weights=None): | |||
ufunc_kwargs = {"n_dims": 1} | |||
|
|||
if y.dtype.kind == "i" or y_hat.dtype.kind == "i": | |||
y, y_hat = smooth_data(y, y_hat) | |||
y, y_hat = smooth_data(y, y_hat, axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will only work for 1d observations. I don't think CubicSpline can work without flattening, so if data is not 2d at this point (i.e. obs dims plus __sample__
dim only) we have to flatten and issue a warning that labels and shape info will be lost or raise an error that this is unsupported and they should manually flatten the data for discrete loo pit to work.
I think for now it's best to raise an error and fix this in a follow-up PR.
The use of smooth data directly without wrapping it in xarray.apply_ufunc
will probably mean loosing labels too (but not size) and will not be Dask compatible. It should be possible to do something like:
- Check shape. if needed stack all observed dimensions to effectively flatten them, store need to stack in a variable. Otherwise do nothing
- Call smooth_data wrapped by
apply_ufunc
- If data was stacked, unstack right before returning pit values
Things to take into account that come to mind: check why the first example in docstring in https://arviz-devs.github.io/arviz/api/generated/arviz.loo_pit.html returns an array, the need for both dimensions of the flattened will mean dask can't be used except if rechunking, is it worth to consider rechunking within the function? If so how? cc @MgeeeeK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think being a bit more explicit and always reshape so that the observation is only 1d might be good, like what we are doing in bpvplot
:
arviz/arviz/plots/backends/bokeh/bpvplot.py
Lines 90 to 94 in 57aeded
obs_vals = obs_vals.flatten() | |
pp_vals = pp_vals.reshape(total_pp_samples, -1) | |
if obs_vals.dtype.kind == "i" or pp_vals.dtype.kind == "i": | |
obs_vals, pp_vals = smooth_data(obs_vals, pp_vals) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The approach in bpvplot has two main drawbacks, which is what I was trying to point out above, and they are still relevant there, but quite negligible, imo here they are more important.
First one is loosing all the shape and label info when doing that. It's not clear if and/or how pointwise or group loo pits can be useful but it seems like it could be useful so I would rather not loose it for good in all cases (float data already works with 4d observations and keeps shape and labels!), i.e. hierarchical model with 3 groups, each with multiple observations, 1st group has positive bias, 2nd group has good loo pit and 3rd group has negative bias so overall loo pit looks good but group level ones do not.
Second one is that by doing the computation via xarray, it can easily be done under the hood by numpy or by dask, so everything will still work if we have many observations so that posterior predictive array doesn't fit in memory anymore.
@aloctavodia what would you like to do next with this PR? If there are any new developments to this discussion please share. |
This PR computed loo-pit using a numpy vectorized version, instead of the wrapped _loo_pit function. Not necessary in favor of keeping this version, just pointing that the current version may be failing for some corner cases. Below, blue current version, black numpy version, gray
az.plot_bpv()
For these two there is a large difference (both are logistic regressions)
For the rest the blue curve is hidden behind the black one (current and numpy versions agree)
.