Fix smooth_data axis loo_pit #1720

aloctavodia · 2021-06-08T06:38:05Z

This PR computed loo-pit using a numpy vectorized version, instead of the wrapped _loo_pit function. Not necessary in favor of keeping this version, just pointing that the current version may be failing for some corner cases. Below, blue current version, black numpy version, gray az.plot_bpv()

For these two there is a large difference (both are logistic regressions)

For the rest the blue curve is hidden behind the black one (current and numpy versions agree)
.

arviz/stats/stats.py

junpenglao · 2021-06-08T07:42:07Z

All these transpose is making me a bit nervous - seems it is possible to modify the _loo_pit to include mlw, why would that not fix the edge case?

aloctavodia · 2021-06-08T07:45:45Z

Also nervous about the transposes :-) _logsumexp used in _loo_pit is a stable version of logsumexp including the mlw term

junpenglao · 2021-06-08T08:38:30Z

I think the root cause is in smooth_data:

arviz/arviz/stats/stats_utils.py

Lines 570 to 572 in 57aeded

    
           x = np.linspace(0, 1, pp_vals.shape[1]) 
        
           csi = CubicSpline(x, pp_vals, axis=1) 
        
           pp_vals = csi(np.linspace(0.01, 0.99, pp_vals.shape[1]))

where smoothing should be done in the data dim, but the usage in loo_pit is incorrect.

In bpvplot, we have:

arviz/arviz/plots/backends/bokeh/bpvplot.py

Lines 90 to 94 in 57aeded

    
           obs_vals = obs_vals.flatten() 
        
           pp_vals = pp_vals.reshape(total_pp_samples, -1) 
        
           if obs_vals.dtype.kind == "i" or pp_vals.dtype.kind == "i": 
        
               obs_vals, pp_vals = smooth_data(obs_vals, pp_vals)

Data dimension is at the end, so the current behavior of smooth_data makes sense (the smoothing is at the axis=1)

In loo_pit however the Data dimension is at axis=0, this is why your PR works by transposing. I think we should modify the smooth_data instead, and use the same reshaping as bpvplot

junpenglao · 2021-06-08T09:17:22Z

arviz/stats/stats_utils.py

@@ -561,14 +561,14 @@ def _circular_standard_deviation(samples, high=2 * np.pi, low=0, skipna=False, a
    return ((high - low) / 2.0 / np.pi) * np.sqrt(-2 * np.log(r_r))


-def smooth_data(obs_vals, pp_vals):
+def smooth_data(obs_vals, pp_vals, axis=1):


Suggested change

def smooth_data(obs_vals, pp_vals, axis=1):

def smooth_data(obs_vals, pp_vals, axis=-1):

codecov · 2021-06-08T09:39:09Z

Codecov Report

Merging #1720 (602da26) into main (075e07f) will not change coverage.
The diff coverage is 80.00%.

@@           Coverage Diff           @@
##             main    #1720   +/-   ##
=======================================
  Coverage   90.86%   90.86%           
=======================================
  Files         108      108           
  Lines       11821    11821           
=======================================
  Hits        10741    10741           
  Misses       1080     1080

Impacted Files	Coverage Δ
arviz/stats/stats.py	`96.60% <0.00%> (ø)`
arviz/stats/stats_utils.py	`96.71% <100.00%> (ø)`
arviz/data/io_numpyro.py	`93.89% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 075e07f...602da26. Read the comment docs.

OriolAbril · 2021-06-08T09:51:07Z

arviz/stats/stats.py

@@ -1635,7 +1635,7 @@ def loo_pit(idata=None, *, y=None, y_hat=None, log_weights=None):
    ufunc_kwargs = {"n_dims": 1}

    if y.dtype.kind == "i" or y_hat.dtype.kind == "i":
-        y, y_hat = smooth_data(y, y_hat)
+        y, y_hat = smooth_data(y, y_hat, axis=0)


This will only work for 1d observations. I don't think CubicSpline can work without flattening, so if data is not 2d at this point (i.e. obs dims plus __sample__ dim only) we have to flatten and issue a warning that labels and shape info will be lost or raise an error that this is unsupported and they should manually flatten the data for discrete loo pit to work.

I think for now it's best to raise an error and fix this in a follow-up PR.

The use of smooth data directly without wrapping it in xarray.apply_ufunc will probably mean loosing labels too (but not size) and will not be Dask compatible. It should be possible to do something like:

Check shape. if needed stack all observed dimensions to effectively flatten them, store need to stack in a variable. Otherwise do nothing

Call smooth_data wrapped by apply_ufunc

If data was stacked, unstack right before returning pit values

Things to take into account that come to mind: check why the first example in docstring in https://arviz-devs.github.io/arviz/api/generated/arviz.loo_pit.html returns an array, the need for both dimensions of the flattened will mean dask can't be used except if rechunking, is it worth to consider rechunking within the function? If so how? cc @MgeeeeK

I think being a bit more explicit and always reshape so that the observation is only 1d might be good, like what we are doing in bpvplot:

arviz/arviz/plots/backends/bokeh/bpvplot.py

Lines 90 to 94 in 57aeded

obs_vals = obs_vals.flatten()

pp_vals = pp_vals.reshape(total_pp_samples, -1)

if obs_vals.dtype.kind == "i" or pp_vals.dtype.kind == "i":

obs_vals, pp_vals = smooth_data(obs_vals, pp_vals)

The approach in bpvplot has two main drawbacks, which is what I was trying to point out above, and they are still relevant there, but quite negligible, imo here they are more important.

First one is loosing all the shape and label info when doing that. It's not clear if and/or how pointwise or group loo pits can be useful but it seems like it could be useful so I would rather not loose it for good in all cases (float data already works with 4d observations and keeps shape and labels!), i.e. hierarchical model with 3 groups, each with multiple observations, 1st group has positive bias, 2nd group has good loo pit and 3rd group has negative bias so overall loo pit looks good but group level ones do not.

Second one is that by doing the computation via xarray, it can easily be done under the hood by numpy or by dask, so everything will still work if we have many observations so that posterior predictive array doesn't fit in memory anymore.

agustinaarroyuelo · 2022-06-25T13:53:42Z

@aloctavodia what would you like to do next with this PR? If there are any new developments to this discussion please share.

numpy loo-pit computation

f010860

junpenglao reviewed Jun 8, 2021

View reviewed changes

arviz/stats/stats.py Outdated Show resolved Hide resolved

add axis argument smooth_data

602da26

aloctavodia changed the title ~~numpy loo-pit computation~~ Fix smooth_data axis loo_pit Jun 8, 2021

junpenglao reviewed Jun 8, 2021

View reviewed changes

OriolAbril reviewed Jun 8, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix smooth_data axis loo_pit #1720

Fix smooth_data axis loo_pit #1720

aloctavodia commented Jun 8, 2021 •

edited

junpenglao commented Jun 8, 2021

aloctavodia commented Jun 8, 2021 •

edited

junpenglao commented Jun 8, 2021

junpenglao Jun 8, 2021

codecov bot commented Jun 8, 2021

OriolAbril Jun 8, 2021

junpenglao Jun 9, 2021

OriolAbril Jun 9, 2021

agustinaarroyuelo commented Jun 25, 2022

	def smooth_data(obs_vals, pp_vals, axis=1):
	def smooth_data(obs_vals, pp_vals, axis=-1):

	obs_vals = obs_vals.flatten()
	pp_vals = pp_vals.reshape(total_pp_samples, -1)

	if obs_vals.dtype.kind == "i" or pp_vals.dtype.kind == "i":
	obs_vals, pp_vals = smooth_data(obs_vals, pp_vals)

Fix smooth_data axis loo_pit #1720

Are you sure you want to change the base?

Fix smooth_data axis loo_pit #1720

Conversation

aloctavodia commented Jun 8, 2021 • edited

junpenglao commented Jun 8, 2021

aloctavodia commented Jun 8, 2021 • edited

junpenglao commented Jun 8, 2021

junpenglao Jun 8, 2021

Choose a reason for hiding this comment

codecov bot commented Jun 8, 2021

Codecov Report

OriolAbril Jun 8, 2021

Choose a reason for hiding this comment

junpenglao Jun 9, 2021

Choose a reason for hiding this comment

OriolAbril Jun 9, 2021

Choose a reason for hiding this comment

agustinaarroyuelo commented Jun 25, 2022

aloctavodia commented Jun 8, 2021 •

edited

aloctavodia commented Jun 8, 2021 •

edited