dataset.filename is a BytesIO for Deflated Explicit VR Little Endian objects, breaking codify and other uses #1937

blairconrad · 2023-11-08T15:55:35Z

Describe the bug

When opening a Deflated Explicit VR Little Endian (1.2.840.10008.1.2.1.99) object from a file (or even from pydicom internal test data), the dataset.filename is not a filename, but is a BytesIO object. First noticed in the context of blairconrad/dicognito#159, where I was relying on the .filename to report to the user which file failed to anonymize, but then I noticed that it breaks pydicom codify as well.

Expected behavior

dataset.filename should, when the dataset was read from a file at least, contain the filename.

Steps To Reproduce

Load any Deflated Explicit VR Little Endian dataset from a file and examine .filename.

>>> import pydicom
>>> ds = pydicom.dcmread("src/pydicom/data/test_files/image_dfl.dcm")
>>> ds.filename
<_io.BytesIO object at 0x0000013BFEA2A2A0>

More unsettling (and possibly the subject of a separate issue) is how this breaks codify:

$ pydicom codify src/pydicom/data/test_files/image_dfl.dcm
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Dev\pydicom\.venv\pydicom\Scripts\pydicom.exe\__main__.py", line 7, in <module>
  File "C:\Dev\pydicom\src\pydicom\cli\main.py", line 234, in main
    ns.func(ns)
  File "C:\Dev\pydicom\src\pydicom\util\codify.py", line 488, in do_codify
    base, _ = os.path.splitext(filename)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen ntpath>", line 232, in splitext
TypeError: expected str, bytes or os.PathLike object, not BytesIO

I don't blame codify for thinking that filename is the name of a file…

Your environment

module	version
platform	Windows-10-10.0.22621-SP0
Python	3.11.3 (tags/v3.11.3:f3909b8, Apr 4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)]
pydicom	3.0.0.dev0
gdcm	module not found
jpeg_ls	module not found
numpy	module not found
PIL	module not found
pylibjpeg	module not found
openjpeg	module not found
libjpeg	module not found

$ git log -1 --oneline
335934d5a (HEAD -> main, origin/main, origin/HEAD) Bump mypy from 1.5.1 to 1.6.1

occurs with many recent and older released versions as well

Now this may all be intentional and I'm wasting your time. Heck, it's even accounted for, in some sense:

pydicom/src/pydicom/dataset.py

Lines 2767 to 2769 in af5f54e

    
           else: 
        
               # e.g. came from BytesIO or something file-like 
        
               self.filename = filename_or_obj

and we can see the the BytesIO object was created from the inflated content:

pydicom/src/pydicom/filereader.py

Lines 851 to 863 in af5f54e

    
           elif transfer_syntax == pydicom.uid.DeflatedExplicitVRLittleEndian: 
        
               # See PS3.5 section A.5 
        
               # when written, the entire dataset following 
        
               #     the file metadata was prepared the normal way, 
        
               #     then "deflate" compression applied. 
        
               #  All that is needed here is to decompress and then 
        
               #     use as normal in a file-like object 
        
               zipped = fileobj.read() 
        
               # -MAX_WBITS part is from comp.lang.python answer: 
        
               # groups.google.com/group/comp.lang.python/msg/e95b3b38a71e6799 
        
               unzipped = zlib.decompress(zipped, -zlib.MAX_WBITS) 
        
               fileobj = BytesIO(unzipped)  # a file-like object 
        
               is_implicit_VR = False

The text was updated successfully, but these errors were encountered:

blairconrad · 2023-11-08T15:56:32Z

A naive "fix" would be to attach the original filename to the BytesIO object (or a proxy for same), to then be attached to the dataset, but it's perhaps a little crude. And I've admittedly not checked the ramifications. I'm not sure if there's a downside to the filename all of a sudden being a string when the contents had been read from a BytesIO (I'm guessing not, but have only a shallow understanding.)

If the maintainers are interested in addressing this issue (either the root "problem" or the "codify" problem specifically), and have a preferred direction, I'd likely be available to implement.

mrbean-bremen · 2023-11-08T18:19:29Z

You are right, this is indeed intended to be able to handle data streams or data coming from zip or tar files. The name filename is a bit unfortunate in these cases.
OTOH, the case for DeflatedExplicitVRLittleEndian is indeed a special case and I agree that the behavior is unexpected here. I'm undecided if and how this shall be changed though. A separate field for this special case seems like overkill to me, but I don't have a good solution either.

blairconrad · 2023-11-08T19:05:41Z

A separate field for this special case seems like overkill to me, but I don't have a good solution either.

Understood. The jerk in me suggests that abusing an attribute called filename for the special case isn't the best either. 😉 Easy for me to say as an outsider.

On a more serious note, I may not be doing a sufficiently rigorous search, but it seems to me that the filename attribute is lightly-used internally. Once set, I don't see anyone accessing it except to

set into a new dataset when copying the original (in which case it is Noned out if not a string)
to load a dataset, in which case we fail if it's not a string
use in codify, as discussed (where it needs to be a PathLike)
pass it to read_deferred_data_element where a string would work (although with a performance hit for checking that the file exists and reopening?)

The last one looks to me to be the biggest cause for concern. Am I close?

Unless the concern (valid, IMO) is that some clients will have begun to expect a non-filename filename attribute in the cases where it's currently a BytesIO (or None).

mrbean-bremen · 2023-11-08T19:50:36Z

The jerk in me suggests that abusing an attribute called filename for the special case isn't the best either.

True. That was in the name of upwards compatibility, I guess, which is now, before a major version change, not that much of a concern. In principle, I'm not against two separate attributes, and/or separate methods for the different use cases. I'm just not sure if there isn't a better solution design-wise - waiting for @darcymason or @scaramallion to chime in...

darcymason · 2023-11-08T20:57:50Z

So, yeah, this is kind of a case of getting "caught" doing something purely for convenience (way way back carrying the filename along) and not really thinking through the "API" consequences. Likewise the BytesIO use was just kind of pasted on... sometimes just getting things done is the way to go, but sometimes it comes back later.

The comments about filename being lightly used within pydicom seem valid too - I did a very quick search and don't see anything beyond what you noted. As to codify, it should be quite easy to fix that - it doesn't use it much, I think just for a default output name and comment lines in the output. Also, the filespec... parsing could just be modified to return the filename separately from the Dataset.

However, my first thoughts are to try to change the deflated dataset behavior. Seems like it should be relatively simple to keep filename a true filename, and use a separate instance attribute for the BytesIO object. As @mrbean-bremen said, it is a good time to make breaking changes. But I guess we'd have to look into a bit - maybe there was a reason for not doing that originally.

blairconrad changed the title ~~dataset.filename is a BytesIO for deflated explicit VR little-endian objects, breaking codify and other uses~~ dataset.filename is a BytesIO for Deflated Explicit VR Little Endian objects, breaking codify and other uses Nov 8, 2023

blairconrad mentioned this issue Nov 15, 2023

Anonymization failure cites BytesIO object instead of input filename blairconrad/dicognito#159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset.filename is a BytesIO for Deflated Explicit VR Little Endian objects, breaking codify and other uses #1937

dataset.filename is a BytesIO for Deflated Explicit VR Little Endian objects, breaking codify and other uses #1937

blairconrad commented Nov 8, 2023 •

edited

blairconrad commented Nov 8, 2023 •

edited

mrbean-bremen commented Nov 8, 2023

blairconrad commented Nov 8, 2023 •

edited

mrbean-bremen commented Nov 8, 2023

darcymason commented Nov 8, 2023

dataset.filename is a BytesIO for Deflated Explicit VR Little Endian objects, breaking codify and other uses #1937

dataset.filename is a BytesIO for Deflated Explicit VR Little Endian objects, breaking codify and other uses #1937

Comments

blairconrad commented Nov 8, 2023 • edited

blairconrad commented Nov 8, 2023 • edited

mrbean-bremen commented Nov 8, 2023

blairconrad commented Nov 8, 2023 • edited

mrbean-bremen commented Nov 8, 2023

darcymason commented Nov 8, 2023

blairconrad commented Nov 8, 2023 •

edited

blairconrad commented Nov 8, 2023 •

edited

blairconrad commented Nov 8, 2023 •

edited