Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset.filename is a BytesIO for Deflated Explicit VR Little Endian objects, breaking codify and other uses #1937

Open
blairconrad opened this issue Nov 8, 2023 · 5 comments

Comments

@blairconrad
Copy link
Contributor

blairconrad commented Nov 8, 2023

Describe the bug

When opening a Deflated Explicit VR Little Endian (1.2.840.10008.1.2.1.99) object from a file (or even from pydicom internal test data), the dataset.filename is not a filename, but is a BytesIO object. First noticed in the context of blairconrad/dicognito#159, where I was relying on the .filename to report to the user which file failed to anonymize, but then I noticed that it breaks pydicom codify as well.

Expected behavior

dataset.filename should, when the dataset was read from a file at least, contain the filename.

Steps To Reproduce

Load any Deflated Explicit VR Little Endian dataset from a file and examine .filename.

>>> import pydicom
>>> ds = pydicom.dcmread("src/pydicom/data/test_files/image_dfl.dcm")
>>> ds.filename
<_io.BytesIO object at 0x0000013BFEA2A2A0>

More unsettling (and possibly the subject of a separate issue) is how this breaks codify:

$ pydicom codify src/pydicom/data/test_files/image_dfl.dcm
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Dev\pydicom\.venv\pydicom\Scripts\pydicom.exe\__main__.py", line 7, in <module>
  File "C:\Dev\pydicom\src\pydicom\cli\main.py", line 234, in main
    ns.func(ns)
  File "C:\Dev\pydicom\src\pydicom\util\codify.py", line 488, in do_codify
    base, _ = os.path.splitext(filename)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen ntpath>", line 232, in splitext
TypeError: expected str, bytes or os.PathLike object, not BytesIO

I don't blame codify for thinking that filename is the name of a file…

Your environment

module version
platform Windows-10-10.0.22621-SP0
Python 3.11.3 (tags/v3.11.3:f3909b8, Apr 4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)]
pydicom 3.0.0.dev0
gdcm module not found
jpeg_ls module not found
numpy module not found
PIL module not found
pylibjpeg module not found
openjpeg module not found
libjpeg module not found
$ git log -1 --oneline
335934d5a (HEAD -> main, origin/main, origin/HEAD) Bump mypy from 1.5.1 to 1.6.1

occurs with many recent and older released versions as well


Now this may all be intentional and I'm wasting your time. Heck, it's even accounted for, in some sense:

pydicom/src/pydicom/dataset.py

Lines 2767 to 2769 in af5f54e

else:
# e.g. came from BytesIO or something file-like
self.filename = filename_or_obj

and we can see the the BytesIO object was created from the inflated content:

elif transfer_syntax == pydicom.uid.DeflatedExplicitVRLittleEndian:
# See PS3.5 section A.5
# when written, the entire dataset following
# the file metadata was prepared the normal way,
# then "deflate" compression applied.
# All that is needed here is to decompress and then
# use as normal in a file-like object
zipped = fileobj.read()
# -MAX_WBITS part is from comp.lang.python answer:
# groups.google.com/group/comp.lang.python/msg/e95b3b38a71e6799
unzipped = zlib.decompress(zipped, -zlib.MAX_WBITS)
fileobj = BytesIO(unzipped) # a file-like object
is_implicit_VR = False

@blairconrad
Copy link
Contributor Author

blairconrad commented Nov 8, 2023

A naive "fix" would be to attach the original filename to the BytesIO object (or a proxy for same), to then be attached to the dataset, but it's perhaps a little crude. And I've admittedly not checked the ramifications. I'm not sure if there's a downside to the filename all of a sudden being a string when the contents had been read from a BytesIO (I'm guessing not, but have only a shallow understanding.)

If the maintainers are interested in addressing this issue (either the root "problem" or the "codify" problem specifically), and have a preferred direction, I'd likely be available to implement.

@blairconrad blairconrad changed the title dataset.filename is a BytesIO for deflated explicit VR little-endian objects, breaking codify and other uses dataset.filename is a BytesIO for Deflated Explicit VR Little Endian objects, breaking codify and other uses Nov 8, 2023
@mrbean-bremen
Copy link
Member

You are right, this is indeed intended to be able to handle data streams or data coming from zip or tar files. The name filename is a bit unfortunate in these cases.
OTOH, the case for DeflatedExplicitVRLittleEndian is indeed a special case and I agree that the behavior is unexpected here. I'm undecided if and how this shall be changed though. A separate field for this special case seems like overkill to me, but I don't have a good solution either.

@blairconrad
Copy link
Contributor Author

blairconrad commented Nov 8, 2023

A separate field for this special case seems like overkill to me, but I don't have a good solution either.

Understood. The jerk in me suggests that abusing an attribute called filename for the special case isn't the best either. 😉 Easy for me to say as an outsider.

On a more serious note, I may not be doing a sufficiently rigorous search, but it seems to me that the filename attribute is lightly-used internally. Once set, I don't see anyone accessing it except to

  1. set into a new dataset when copying the original (in which case it is Noned out if not a string)
  2. to load a dataset, in which case we fail if it's not a string
  3. use in codify, as discussed (where it needs to be a PathLike)
  4. pass it to read_deferred_data_element where a string would work (although with a performance hit for checking that the file exists and reopening?)

The last one looks to me to be the biggest cause for concern. Am I close?

Unless the concern (valid, IMO) is that some clients will have begun to expect a non-filename filename attribute in the cases where it's currently a BytesIO (or None).

@mrbean-bremen
Copy link
Member

The jerk in me suggests that abusing an attribute called filename for the special case isn't the best either.

True. That was in the name of upwards compatibility, I guess, which is now, before a major version change, not that much of a concern. In principle, I'm not against two separate attributes, and/or separate methods for the different use cases. I'm just not sure if there isn't a better solution design-wise - waiting for @darcymason or @scaramallion to chime in...

@darcymason
Copy link
Member

So, yeah, this is kind of a case of getting "caught" doing something purely for convenience (way way back carrying the filename along) and not really thinking through the "API" consequences. Likewise the BytesIO use was just kind of pasted on... sometimes just getting things done is the way to go, but sometimes it comes back later.

The comments about filename being lightly used within pydicom seem valid too - I did a very quick search and don't see anything beyond what you noted. As to codify, it should be quite easy to fix that - it doesn't use it much, I think just for a default output name and comment lines in the output. Also, the filespec... parsing could just be modified to return the filename separately from the Dataset.

However, my first thoughts are to try to change the deflated dataset behavior. Seems like it should be relatively simple to keep filename a true filename, and use a separate instance attribute for the BytesIO object. As @mrbean-bremen said, it is a good time to make breaking changes. But I guess we'd have to look into a bit - maybe there was a reason for not doing that originally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants