Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected save behaviour with cached files when underlying file is deleted #1061

Open
parneshraniga opened this issue Oct 1, 2021 · 15 comments

Comments

@parneshraniga
Copy link

parneshraniga commented Oct 1, 2021

If I read in a temp file and cache it (using get_data() or get_fdata() ) and then delete the underlying file, nibabel tries to read from (the memory mapped) file when I try and save the image and thus throws an error.

Example program is:

import sys
import nibabel
import os
import shutil

if __name__ == "__main__":

    fname = sys.argv[1]

    shutil.copy(fname,"/tmp")
    img = nibabel.load("/tmp/"+os.path.basename(fname),mmap=False)
    img.get_data()
    img.get_fdata()
    os.remove("/tmp/"+os.path.basename(fname))

    nibabel.save(img,"temp.nii.gz")

on running gives:

test_nibabel.py:12: DeprecationWarning: get_data() is deprecated in favor of get_fdata(), which has a more predictable return type. To obtain get_data() behavior going forward, use numpy.asanyarray(img.dataobj).

* deprecated from version: 3.0
* Will raise <class 'nibabel.deprecator.ExpiredDeprecationError'> as of version: 5.0
  img.get_data()
Traceback (most recent call last):
  File "test_nibabel.py", line 16, in <module>
    nibabel.save(img,"temp.nii.gz")
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/loadsave.py", line 99, in save
    img.to_filename(filename)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/filebasedimages.py", line 333, in to_filename
    self.to_file_map()
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/analyze.py", line 1007, in to_file_map
    data = np.asanyarray(self.dataobj)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/numpy/core/_asarray.py", line 136, in asanyarray
    return array(a, dtype, copy=False, order=order, subok=True)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/arrayproxy.py", line 391, in __array__
    arr = self._get_scaled(dtype=dtype, slicer=())
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/arrayproxy.py", line 358, in _get_scaled
    scaled = apply_read_scaling(self._get_unscaled(slicer=slicer), scl_slope, scl_inter)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/arrayproxy.py", line 331, in _get_unscaled
    with self._get_fileobj() as fileobj, self._lock:
  File "/home/ran112/anaconda3/lib/python3.8/contextlib.py", line 113, in __enter__
    return next(self.gen)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/arrayproxy.py", line 324, in _get_fileobj
    with openers.ImageOpener(
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/openers.py", line 113, in __init__
    self.fobj = opener(fileish, *args, **kwargs)
  File "/home/ran112/anaconda3/lib/python3.8/site-packages/nibabel/openers.py", line 53, in _gzip_open
    gzip_file = gzip.GzipFile(filename, mode, compresslevel)
  File "/home/ran112/anaconda3/lib/python3.8/gzip.py", line 173, in __init__
    fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/Example_nifti.nii.gz'

I believe this might be due to fact when the image is saved, the data is grabbed from the imageproxy object (self.dataobj), without checking the caches (line 1003 in analyze.py of github version).

This is different behavior from (much) older versions of nibabel that were python2.7 compatible. I am not familiar enough with the new codebase to do a pull request atm.

@effigies
Copy link
Member

effigies commented Nov 5, 2021

Hi, thanks for submitting this. So what's happened is that when writing out a file, we would use img.get_data(). With the deprecation of get_data(), we switched to np.asanyarray(img.dataobj), which is equivalent when reading from disk, but not if the data has been cached and modified.

While using the data cache to treat the loaded image like a modified image worked, I wouldn't consider it an intended use (If we indicate so in the docs, please let me know...). The approach I always use is to create a new image and simply copy the header over:

import sys
import nibabel
import os
import shutil

if __name__ == "__main__":

    fname = sys.argv[1]

    shutil.copy(fname,"/tmp")
    img = nibabel.load("/tmp/"+os.path.basename(fname),mmap=False)
    data = img.get_fdata()  # Or `np.asanyarray(img.dataobj)`
    os.remove("/tmp/"+os.path.basename(fname))

    new_img = img.__class__(data, img.affine, img.header)
    nibabel.save(new_img,"temp.nii.gz")

I can't think of a good way to restore the old behavior without producing more surprises (unnecessary typecasts or a hidden cache), so my suggestion would be to move to this approach.

Happy to talk through alternatives, though.

@parneshraniga
Copy link
Author

Thanks for getting back to me. Your suggestion works for us and so we can can close this issue. I don't think that the docs anywhere suggest that the old behavior is expected, apologies if I suggested it did. It's just that this an old piece of code that has been working across several older versions of nibabel so it was unexpected for us.

@matthew-brett
Copy link
Member

Sorry - can I reopen this one? Thinking about it, I think we used to imply, but perhaps not say, that the cached array becomes definitive after it has reached memory. For example, you can change the cached array and get the changed version back on repeated calls to get_fdata.

So, I think someone, probably me, made a mistake by not checking the cache before writing, and I agree that this is a bit surprising. Do you think we can fix that, at this stage? Or - do we need to fix it?

@effigies
Copy link
Member

So, I think someone, probably me, made a mistake by not checking the cache before writing, and I agree that this is a bit surprising.

This was me in c8d93e6, when we finally deprecated get_data() and removed its use internally.

Do you think we can fix that, at this stage?

I think we can do it, though it would be an API change (we're overdue for 4.0 anyway). We just need the logic to be clear and documented.

  • _fdata_cache
  • asanyarray(dataobj)

We would also need to check that, in the case of cached fdata, if somebody opens an int16 (with no scale factors) and adds an integer value, we don't end up rescaling because the data array is float.

Or - do we need to fix it?

I've never depended on the behavior, so I'm not sure I'm the right person to ask.

@matthew-brett
Copy link
Member

The reason we might fix it is to maintain the model, which is that the image appears to contain the array returned by get_fdata or get_data, and therefore, if you modify the returned array, the image appears to contain the modified array.

I mean that, here we see that modifying the returned array modifies the return value of get_fdata:

[ins] In [8]: img = nib.load('example.nii')                                                                                                

[ins] In [9]: arr = img.get_fdata()                                                                                                        

[ins] In [10]: arr                                                                                                                         
Out[10]: 
memmap([[[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]],

        [[12., 13., 14., 15.],
         [16., 17., 18., 19.],
         [20., 21., 22., 23.]]])

[ins] In [11]: arr[0, 0, 0] = 99                                                                                                           

[nav] In [12]: img.get_fdata()                                                                                                             
Out[12]: 
memmap([[[99.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]],

        [[12., 13., 14., 15.],
         [16., 17., 18., 19.],
         [20., 21., 22., 23.]]])

But then, it's a bit surprising that these changes have disappeared when we save the image and reload:

[ins] In [13]: nib.save(img, 'maybe_changed.nii')                                                                                          

[ins] In [14]: nib.load('maybe_changed.nii').get_fdata()                                                                                   
Out[14]: 
memmap([[[ 0.,  1.,  2.,  3.],
         [ 4.,  5.,  6.,  7.],
         [ 8.,  9., 10., 11.]],

        [[12., 13., 14., 15.],
         [16., 17., 18., 19.],
         [20., 21., 22., 23.]]])

Now - what to do? I think the right fix is changing:

data = np.asanyarray(self.dataobj)

to

if img.in_memory:
    data = img.get_fdata()
else:
    data = np.asanyarray(self.dataobj)

See the current code.

Does that look right to you.

We should definitely add some tests to assert the behavior we think is right.

@effigies
Copy link
Member

LGTM, Matthew.

@matthew-brett
Copy link
Member

I'm afraid I think we have to email out about this, to see what people think - because it's possible people's results will change, without them realizing. We could certainly warn, as in:

if img.in_memory:
    warning('Using cached data to save image; if you have change the image data in memory'
                  'your results may differ from versions of Nibabel prior to 4.0')
    data = img.get_fdata()
else:
    data = np.asanyarray(self.dataobj)

@effigies
Copy link
Member

Agree on consulting the hive mind and warning. Two wrinkles, though:

  1. For a cached proxy image, we drop the cache and recompute if somebody calls get_fdata() with a different dtype. So if I call img.get_fdata(np.float32), then img.get_fdata() will discard my cache and load a float64 array.

  2. For an array image with integer data (e.g., a segmentation or mask), then get_fdata() will needlessly coerce to float.

We can check for these cases and use img._fdata_cache and img.dataobj, respectively. Just wanted to make sure this isn't complicating things enough to justify leaving it alone.

@matthew-brett
Copy link
Member

matthew-brett commented Nov 15, 2021

Ouch - I'm glad you remembered these.

Am I right that get_fdata() will only needlessly coerce an int-backed image to float if the header has scaling factors? Otherwise, np.array(img.dataobj) will do that coercion - right?

The img.get_fdata(np.float32) is a serious problem, because it means that it is possible to get a different output image by doing:

img.get_fdata(np.float32)
nib.save(img, 'array_from_float32.nii')

instead of:

img.get_fdata()   # Or no get_fdata call.
nib.save(img, 'array_from_float64.nii')

That would be surprising, and difficult to explain.

@effigies
Copy link
Member

Am I right that get_fdata() will only needlessly coerce an int-backed image to float if the header has scaling factors?

get_fdata() always coerces int-backed images to floats, regardless of scale factors. get_data() had the type + scale-factor dependent behavior.

Otherwise, np.array(img.dataobj) will do that coercion - right?

This one only coerces if there are scale factors.

The img.get_fdata(np.float32) is a serious problem, because it means that it is possible to get a different output image by doing:

img.get_fdata(np.float32)
nib.save(img, 'array_from_float32.nii')

instead of:

img.get_fdata()   # Or no get_fdata call.
nib.save(img, 'array_from_float64.nii')

That would be surprising, and difficult to explain.

Unless img.get_data_dtype() == np.float64, there should be no real difference in precision loss. float32 -> float64 -> float32 (on-disk, get_fdata(), save) shouldn't be more (or less) lossy than float32 -> float32 -> float32. Similarly int16 -> float32 -> int16 and int16 -> float64 -> int16 should be within our usual np.allclose() constraints.

I might not be reasoning properly about floating point error here, but it seems to me that if you request get_fdata(np.float32) (or even np.float16), you're asserting that you are willing to accept the precision loss.

@matthew-brett
Copy link
Member

Sorry - yes - I meant that get_fdata will always coerce to float, but this will be needless only in the case where there are no scalefactors.

I agree that the person doing arr = img.get_fdata(np.float16) is willing to accept the precision loss for their working array - but they probably don't realize they are accepting precision loss for the array saved to disk.

And yes, for float32 or float64, the precision loss will be - relatively small - but still - any difference would be surprising, surely.

@effigies
Copy link
Member

I think the use case you're imagining is loading an array at low precision, making some determination based on the data but not changing it, and then resaving the image to a new location. And we'd want it essentially unchanged in that case.

I was imagining loading the data, making modifications, and saving, in which case I'd expect the precision I chose to be relevant.

We could monkey patch the setitem on the array to set a dirty bit in the image, to distinguish the cases. A bit magical for my taste. At this point, I feel like explicitly creating new images is less ambiguous.

Maybe instead we should just warn on saving a proxy image with cached data. Another approach is to warn when the on disk precision is higher than the array precision.

@matthew-brett
Copy link
Member

Another option would be to add

out_arr.setflags(write=False)

on the way out of get_fdata. Yes it would be breaking change, potentially, but at least it would mean that no-one's data would change silently beneath them - and it's pretty easy to explain.

@effigies
Copy link
Member

Could also make that dependent on whether the requested dtype is lower precision than get_data_dtype(), which would limit the changed behavior to the cases of actual concern.

@matthew-brett
Copy link
Member

Hmm - but that would be rather confusing too - no - that you could get a read-only array sometimes? I mean, that would be rather hard to explain.

Honestly, I would be happy to make the output array read-only by default. What do you think? Obviously would need some discussion on the mailing list ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants