Unexpected save behaviour with cached files when underlying file is deleted #1061

parneshraniga · 2021-10-01T07:27:13Z

If I read in a temp file and cache it (using get_data() or get_fdata() ) and then delete the underlying file, nibabel tries to read from (the memory mapped) file when I try and save the image and thus throws an error.

Example program is:

import sys
import nibabel
import os
import shutil

if __name__ == "__main__":

    fname = sys.argv[1]

    shutil.copy(fname,"/tmp")
    img = nibabel.load("/tmp/"+os.path.basename(fname),mmap=False)
    img.get_data()
    img.get_fdata()
    os.remove("/tmp/"+os.path.basename(fname))

    nibabel.save(img,"temp.nii.gz")

on running gives:

test_nibabel.py:12: DeprecationWarning: get_data() is deprecated in favor of get_fdata(), which has a more predictable return type. To obtain get_data() behavior going forward, use numpy.asanyarray(img.dataobj).

* deprecated from version: 3.0
* Will raise <class 'nibabel.deprecator.ExpiredDeprecationError'> as of version: 5.0
  img.get_data()
Traceback (most recent call last):
  File "test_nibabel.py", line 16, in <module>
    nibabel.save(img,"temp.nii.gz")
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/loadsave.py", line 99, in save
    img.to_filename(filename)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/filebasedimages.py", line 333, in to_filename
    self.to_file_map()
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/analyze.py", line 1007, in to_file_map
    data = np.asanyarray(self.dataobj)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/arrayproxy.py", line 391, in __array__
    arr = self._get_scaled(dtype=dtype, slicer=())
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/arrayproxy.py", line 358, in _get_scaled
    scaled = apply_read_scaling(self._get_unscaled(slicer=slicer), scl_slope, scl_inter)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/arrayproxy.py", line 331, in _get_unscaled
    with self._get_fileobj() as fileobj, self._lock:
  File "/home/ran112/anaconda3/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/arrayproxy.py", line 324, in _get_fileobj
    with openers.ImageOpener(
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/openers.py", line 113, in __init__
    self.fobj = opener(fileish, *args, **kwargs)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/openers.py", line 53, in _gzip_open
    gzip_file = gzip.GzipFile(filename, mode, compresslevel)
  File "/home/ran112/anaconda3/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/Example_nifti.nii.gz'

I believe this might be due to fact when the image is saved, the data is grabbed from the imageproxy object (self.dataobj), without checking the caches (line 1003 in analyze.py of github version).

This is different behavior from (much) older versions of nibabel that were python2.7 compatible. I am not familiar enough with the new codebase to do a pull request atm.

The text was updated successfully, but these errors were encountered:

effigies · 2021-11-05T17:21:23Z

Hi, thanks for submitting this. So what's happened is that when writing out a file, we would use img.get_data(). With the deprecation of get_data(), we switched to np.asanyarray(img.dataobj), which is equivalent when reading from disk, but not if the data has been cached and modified.

While using the data cache to treat the loaded image like a modified image worked, I wouldn't consider it an intended use (If we indicate so in the docs, please let me know...). The approach I always use is to create a new image and simply copy the header over:

import sys
import nibabel
import os
import shutil

if __name__ == "__main__":

    fname = sys.argv[1]

    shutil.copy(fname,"/tmp")
    img = nibabel.load("/tmp/"+os.path.basename(fname),mmap=False)
    data = img.get_fdata()  # Or `np.asanyarray(img.dataobj)`
    os.remove("/tmp/"+os.path.basename(fname))

    new_img = img.__class__(data, img.affine, img.header)
    nibabel.save(new_img,"temp.nii.gz")

I can't think of a good way to restore the old behavior without producing more surprises (unnecessary typecasts or a hidden cache), so my suggestion would be to move to this approach.

Happy to talk through alternatives, though.

parneshraniga · 2021-11-12T00:35:44Z

Thanks for getting back to me. Your suggestion works for us and so we can can close this issue. I don't think that the docs anywhere suggest that the old behavior is expected, apologies if I suggested it did. It's just that this an old piece of code that has been working across several older versions of nibabel so it was unexpected for us.

matthew-brett · 2021-11-12T10:02:02Z

Sorry - can I reopen this one? Thinking about it, I think we used to imply, but perhaps not say, that the cached array becomes definitive after it has reached memory. For example, you can change the cached array and get the changed version back on repeated calls to get_fdata.

So, I think someone, probably me, made a mistake by not checking the cache before writing, and I agree that this is a bit surprising. Do you think we can fix that, at this stage? Or - do we need to fix it?

effigies · 2021-11-12T14:48:38Z

So, I think someone, probably me, made a mistake by not checking the cache before writing, and I agree that this is a bit surprising.

This was me in c8d93e6, when we finally deprecated get_data() and removed its use internally.

Do you think we can fix that, at this stage?

I think we can do it, though it would be an API change (we're overdue for 4.0 anyway). We just need the logic to be clear and documented.

_fdata_cache
asanyarray(dataobj)

We would also need to check that, in the case of cached fdata, if somebody opens an int16 (with no scale factors) and adds an integer value, we don't end up rescaling because the data array is float.

Or - do we need to fix it?

I've never depended on the behavior, so I'm not sure I'm the right person to ask.

matthew-brett · 2021-11-12T17:27:30Z

The reason we might fix it is to maintain the model, which is that the image appears to contain the array returned by get_fdata or get_data, and therefore, if you modify the returned array, the image appears to contain the modified array.

I mean that, here we see that modifying the returned array modifies the return value of get_fdata:

[ins] In [8]: img = nib.load('example.nii')                                                                                                

[ins] In [9]: arr = img.get_fdata()                                                                                                        

[ins] In [10]: arr                                                                                                                         
Out[10]: 
memmap([[[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]],

        [[12., 13., 14., 15.],
         [16., 17., 18., 19.],
         [20., 21., 22., 23.]]])

[ins] In [11]: arr[0, 0, 0] = 99                                                                                                           

[nav] In [12]: img.get_fdata()                                                                                                             
Out[12]: 
memmap([[[99.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]],

        [[12., 13., 14., 15.],
         [16., 17., 18., 19.],
         [20., 21., 22., 23.]]])

But then, it's a bit surprising that these changes have disappeared when we save the image and reload:

[ins] In [13]: nib.save(img, 'maybe_changed.nii')                                                                                          

[ins] In [14]: nib.load('maybe_changed.nii').get_fdata()                                                                                   
Out[14]: 
memmap([[[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]],

        [[12., 13., 14., 15.],
         [16., 17., 18., 19.],
         [20., 21., 22., 23.]]])

Now - what to do? I think the right fix is changing:

data = np.asanyarray(self.dataobj)

to

if img.in_memory:
    data = img.get_fdata()
else:
    data = np.asanyarray(self.dataobj)

See the current code.

Does that look right to you.

We should definitely add some tests to assert the behavior we think is right.

effigies · 2021-11-12T20:13:22Z

LGTM, Matthew.

matthew-brett · 2021-11-15T18:13:30Z

I'm afraid I think we have to email out about this, to see what people think - because it's possible people's results will change, without them realizing. We could certainly warn, as in:

if img.in_memory:
    warning('Using cached data to save image; if you have change the image data in memory'
                  'your results may differ from versions of Nibabel prior to 4.0')
    data = img.get_fdata()
else:
    data = np.asanyarray(self.dataobj)

effigies · 2021-11-15T21:48:08Z

Agree on consulting the hive mind and warning. Two wrinkles, though:

For a cached proxy image, we drop the cache and recompute if somebody calls get_fdata() with a different dtype. So if I call img.get_fdata(np.float32), then img.get_fdata() will discard my cache and load a float64 array.
For an array image with integer data (e.g., a segmentation or mask), then get_fdata() will needlessly coerce to float.

We can check for these cases and use img._fdata_cache and img.dataobj, respectively. Just wanted to make sure this isn't complicating things enough to justify leaving it alone.

matthew-brett · 2021-11-15T22:05:47Z

Ouch - I'm glad you remembered these.

Am I right that get_fdata() will only needlessly coerce an int-backed image to float if the header has scaling factors? Otherwise, np.array(img.dataobj) will do that coercion - right?

The img.get_fdata(np.float32) is a serious problem, because it means that it is possible to get a different output image by doing:

img.get_fdata(np.float32)
nib.save(img, 'array_from_float32.nii')

instead of:

img.get_fdata()   # Or no get_fdata call.
nib.save(img, 'array_from_float64.nii')

That would be surprising, and difficult to explain.

effigies · 2021-11-16T01:21:02Z

Am I right that get_fdata() will only needlessly coerce an int-backed image to float if the header has scaling factors?

get_fdata() always coerces int-backed images to floats, regardless of scale factors. get_data() had the type + scale-factor dependent behavior.

Otherwise, np.array(img.dataobj) will do that coercion - right?

This one only coerces if there are scale factors.

The img.get_fdata(np.float32) is a serious problem, because it means that it is possible to get a different output image by doing:
img.get_fdata(np.float32)
nib.save(img, 'array_from_float32.nii')
instead of:
img.get_fdata()   # Or no get_fdata call.
nib.save(img, 'array_from_float64.nii')
That would be surprising, and difficult to explain.

Unless img.get_data_dtype() == np.float64, there should be no real difference in precision loss. float32 -> float64 -> float32 (on-disk, get_fdata(), save) shouldn't be more (or less) lossy than float32 -> float32 -> float32. Similarly int16 -> float32 -> int16 and int16 -> float64 -> int16 should be within our usual np.allclose() constraints.

I might not be reasoning properly about floating point error here, but it seems to me that if you request get_fdata(np.float32) (or even np.float16), you're asserting that you are willing to accept the precision loss.

matthew-brett · 2021-11-16T10:43:15Z

Sorry - yes - I meant that get_fdata will always coerce to float, but this will be needless only in the case where there are no scalefactors.

I agree that the person doing arr = img.get_fdata(np.float16) is willing to accept the precision loss for their working array - but they probably don't realize they are accepting precision loss for the array saved to disk.

And yes, for float32 or float64, the precision loss will be - relatively small - but still - any difference would be surprising, surely.

effigies · 2021-11-16T13:36:27Z

I think the use case you're imagining is loading an array at low precision, making some determination based on the data but not changing it, and then resaving the image to a new location. And we'd want it essentially unchanged in that case.

I was imagining loading the data, making modifications, and saving, in which case I'd expect the precision I chose to be relevant.

We could monkey patch the setitem on the array to set a dirty bit in the image, to distinguish the cases. A bit magical for my taste. At this point, I feel like explicitly creating new images is less ambiguous.

Maybe instead we should just warn on saving a proxy image with cached data. Another approach is to warn when the on disk precision is higher than the array precision.

matthew-brett · 2021-11-17T10:54:41Z

Another option would be to add

out_arr.setflags(write=False)

on the way out of get_fdata. Yes it would be breaking change, potentially, but at least it would mean that no-one's data would change silently beneath them - and it's pretty easy to explain.

effigies · 2021-11-17T13:49:04Z

Could also make that dependent on whether the requested dtype is lower precision than get_data_dtype(), which would limit the changed behavior to the cases of actual concern.

matthew-brett · 2021-11-19T14:56:12Z

Hmm - but that would be rather confusing too - no - that you could get a read-only array sometimes? I mean, that would be rather hard to explain.

Honestly, I would be happy to make the output array read-only by default. What do you think? Obviously would need some discussion on the mailing list ...

parneshraniga closed this as completed Nov 12, 2021

matthew-brett reopened this Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected save behaviour with cached files when underlying file is deleted #1061

Unexpected save behaviour with cached files when underlying file is deleted #1061

parneshraniga commented Oct 1, 2021 •

edited

effigies commented Nov 5, 2021

parneshraniga commented Nov 12, 2021

matthew-brett commented Nov 12, 2021

effigies commented Nov 12, 2021

matthew-brett commented Nov 12, 2021

effigies commented Nov 12, 2021

matthew-brett commented Nov 15, 2021

effigies commented Nov 15, 2021

matthew-brett commented Nov 15, 2021 •

edited

effigies commented Nov 16, 2021

matthew-brett commented Nov 16, 2021

effigies commented Nov 16, 2021

matthew-brett commented Nov 17, 2021

effigies commented Nov 17, 2021

matthew-brett commented Nov 19, 2021

Unexpected save behaviour with cached files when underlying file is deleted #1061

Unexpected save behaviour with cached files when underlying file is deleted #1061

Comments

parneshraniga commented Oct 1, 2021 • edited

effigies commented Nov 5, 2021

parneshraniga commented Nov 12, 2021

matthew-brett commented Nov 12, 2021

effigies commented Nov 12, 2021

matthew-brett commented Nov 12, 2021

effigies commented Nov 12, 2021

matthew-brett commented Nov 15, 2021

effigies commented Nov 15, 2021

matthew-brett commented Nov 15, 2021 • edited

effigies commented Nov 16, 2021

matthew-brett commented Nov 16, 2021

effigies commented Nov 16, 2021

matthew-brett commented Nov 17, 2021

effigies commented Nov 17, 2021

matthew-brett commented Nov 19, 2021

parneshraniga commented Oct 1, 2021 •

edited

matthew-brett commented Nov 15, 2021 •

edited