Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix smooth_data axis loo_pit #1720

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Fix smooth_data axis loo_pit #1720

wants to merge 2 commits into from

Conversation

aloctavodia
Copy link
Contributor

@aloctavodia aloctavodia commented Jun 8, 2021

This PR computed loo-pit using a numpy vectorized version, instead of the wrapped _loo_pit function. Not necessary in favor of keeping this version, just pointing that the current version may be failing for some corner cases. Below, blue current version, black numpy version, gray az.plot_bpv()

For these two there is a large difference (both are logistic regressions)
logistic
test_loo_pit

For the rest the blue curve is hidden behind the black one (current and numpy versions agree)
miss_normal.
radon
non_centered_eight

arviz/stats/stats.py Outdated Show resolved Hide resolved
@junpenglao
Copy link
Contributor

All these transpose is making me a bit nervous - seems it is possible to modify the _loo_pit to include mlw, why would that not fix the edge case?

@aloctavodia
Copy link
Contributor Author

aloctavodia commented Jun 8, 2021

Also nervous about the transposes :-) _logsumexp used in _loo_pit is a stable version of logsumexp including the mlw term

@junpenglao
Copy link
Contributor

I think the root cause is in smooth_data:

x = np.linspace(0, 1, pp_vals.shape[1])
csi = CubicSpline(x, pp_vals, axis=1)
pp_vals = csi(np.linspace(0.01, 0.99, pp_vals.shape[1]))

where smoothing should be done in the data dim, but the usage in loo_pit is incorrect.

In bpvplot, we have:

obs_vals = obs_vals.flatten()
pp_vals = pp_vals.reshape(total_pp_samples, -1)
if obs_vals.dtype.kind == "i" or pp_vals.dtype.kind == "i":
obs_vals, pp_vals = smooth_data(obs_vals, pp_vals)

Data dimension is at the end, so the current behavior of smooth_data makes sense (the smoothing is at the axis=1)

In loo_pit however the Data dimension is at axis=0, this is why your PR works by transposing. I think we should modify the smooth_data instead, and use the same reshaping as bpvplot

@aloctavodia aloctavodia changed the title numpy loo-pit computation Fix smooth_data axis loo_pit Jun 8, 2021
@@ -561,14 +561,14 @@ def _circular_standard_deviation(samples, high=2 * np.pi, low=0, skipna=False, a
return ((high - low) / 2.0 / np.pi) * np.sqrt(-2 * np.log(r_r))


def smooth_data(obs_vals, pp_vals):
def smooth_data(obs_vals, pp_vals, axis=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def smooth_data(obs_vals, pp_vals, axis=1):
def smooth_data(obs_vals, pp_vals, axis=-1):

@codecov
Copy link

codecov bot commented Jun 8, 2021

Codecov Report

Merging #1720 (602da26) into main (075e07f) will not change coverage.
The diff coverage is 80.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1720   +/-   ##
=======================================
  Coverage   90.86%   90.86%           
=======================================
  Files         108      108           
  Lines       11821    11821           
=======================================
  Hits        10741    10741           
  Misses       1080     1080           
Impacted Files Coverage Δ
arviz/stats/stats.py 96.60% <0.00%> (ø)
arviz/stats/stats_utils.py 96.71% <100.00%> (ø)
arviz/data/io_numpyro.py 93.89% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 075e07f...602da26. Read the comment docs.

@@ -1635,7 +1635,7 @@ def loo_pit(idata=None, *, y=None, y_hat=None, log_weights=None):
ufunc_kwargs = {"n_dims": 1}

if y.dtype.kind == "i" or y_hat.dtype.kind == "i":
y, y_hat = smooth_data(y, y_hat)
y, y_hat = smooth_data(y, y_hat, axis=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only work for 1d observations. I don't think CubicSpline can work without flattening, so if data is not 2d at this point (i.e. obs dims plus __sample__ dim only) we have to flatten and issue a warning that labels and shape info will be lost or raise an error that this is unsupported and they should manually flatten the data for discrete loo pit to work.

I think for now it's best to raise an error and fix this in a follow-up PR.

The use of smooth data directly without wrapping it in xarray.apply_ufunc will probably mean loosing labels too (but not size) and will not be Dask compatible. It should be possible to do something like:

  • Check shape. if needed stack all observed dimensions to effectively flatten them, store need to stack in a variable. Otherwise do nothing
  • Call smooth_data wrapped by apply_ufunc
  • If data was stacked, unstack right before returning pit values

Things to take into account that come to mind: check why the first example in docstring in https://arviz-devs.github.io/arviz/api/generated/arviz.loo_pit.html returns an array, the need for both dimensions of the flattened will mean dask can't be used except if rechunking, is it worth to consider rechunking within the function? If so how? cc @MgeeeeK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think being a bit more explicit and always reshape so that the observation is only 1d might be good, like what we are doing in bpvplot:

obs_vals = obs_vals.flatten()
pp_vals = pp_vals.reshape(total_pp_samples, -1)
if obs_vals.dtype.kind == "i" or pp_vals.dtype.kind == "i":
obs_vals, pp_vals = smooth_data(obs_vals, pp_vals)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach in bpvplot has two main drawbacks, which is what I was trying to point out above, and they are still relevant there, but quite negligible, imo here they are more important.

First one is loosing all the shape and label info when doing that. It's not clear if and/or how pointwise or group loo pits can be useful but it seems like it could be useful so I would rather not loose it for good in all cases (float data already works with 4d observations and keeps shape and labels!), i.e. hierarchical model with 3 groups, each with multiple observations, 1st group has positive bias, 2nd group has good loo pit and 3rd group has negative bias so overall loo pit looks good but group level ones do not.

Second one is that by doing the computation via xarray, it can easily be done under the hood by numpy or by dask, so everything will still work if we have many observations so that posterior predictive array doesn't fit in memory anymore.

@agustinaarroyuelo
Copy link
Contributor

@aloctavodia what would you like to do next with this PR? If there are any new developments to this discussion please share.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants