Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add image file reader class #1447

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

hackermd
Copy link
Contributor

@hackermd hackermd commented Jul 22, 2021

Describe the changes

Adds support for reading efficiently individual frames of a (multi-frame) image without loading the entire Pixel Data element into memory (see also #534, #1263, #1243).

Tasks

  • Unit tests added that reproduce the issue or prove feature is working
  • Fix or feature added
  • Code typed and mypy shows no errors
  • Documentation updated (if relevant)
    • No warnings during build
    • Preview link (CircleCI -> Artifacts -> doc/_build/html/index.html)
  • Unit tests passing and overall coverage the same or better

@pep8speaks
Copy link

pep8speaks commented Jul 22, 2021

Hello @hackermd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-22 16:44:27 UTC

@codecov
Copy link

codecov bot commented Jul 22, 2021

Codecov Report

Merging #1447 (c31faf9) into master (4693362) will decrease coverage by 0.97%.
The diff coverage is 55.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1447      +/-   ##
==========================================
- Coverage   97.42%   96.45%   -0.98%     
==========================================
  Files          66       66              
  Lines       10182    10425     +243     
==========================================
+ Hits         9920    10055     +135     
- Misses        262      370     +108     
Impacted Files Coverage Δ
pydicom/filereader.py 79.43% <55.73%> (-14.92%) ⬇️
pydicom/_dicom_dict.py 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4693362...c31faf9. Read the comment docs.

Copy link
Member

@darcymason darcymason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few thoughts from a first pass...

Re my comment on metadata, I wonder if there is a way to integrate this into Dataset - just off the cuff here, could we have something like a frames property as an alternate to pixel_array? Generally I don't like to complicate Dataset with new code, but this is quite a common need, and important for large file sizes.

Examples
--------
>>> from highdicom.io import ImageFileReader
>>> with ImageFileReader('/path/to/file.dcm') as image:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference to highdicom needs replacing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via 8b58178

"""

def __init__(self, filename: str):
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also include a file object rather than just a filename? Inevitably someone needing to work with streams will ask for it, although some may be limited if there are negative seeks, as seen for example in #753.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally the idea of allowing the user to alternatively provide an existing file object and have refactored the class accordingly via e1056bd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have decided to not set the is_little_endian and is_implicit_VR attributes on the file object, so that the object does not get modified. Instead an error is raised if their values are not correct.

This is one of the reasons I initially implemented the class such that the file object gets created under the hood and the attributes are set correctly.

@darcymason What are your thoughts on this?



class ImageFileReader(object):

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subclassing from object only needed for Python 2, which we have dropped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via 0b5b68c

>>> from highdicom.io import ImageFileReader
>>> with ImageFileReader('/path/to/file.dcm') as image:
... print(image.metadata)
... for i in range(image.number_of_frames):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of the term metadata here, especially because it sounds too much like the File Meta Information defined for part 10 files in the Standard. But also because it is really a Dataset (from dcmread on open call) of all data elements except the pixel data; pydicom (and DICOM itself) don't give unique status to pixel data, unlike other file types with a 'header' or 'metadata'; in DICOM they are all just data elements.

Having said that, I'm not sure what to call it. In pydicom, using stop_before_pixels=True just gives a Dataset, simply with the pixel data element missing. I'll mull this over a little more...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DICOMweb, the term metadata is used to refer to the subset of Data Elements in a Data Set that are not considered bulkdata (i.e., that have a size smaller than a defined threshold).

I generally use to term to indicate that a given pydicom.dataset.Dataset may not contain the Pixel Data element.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have heard that term too, and have also been confused by it. I think it comes from image formats like TIFF who have embedded metadata tags.
What I encounter more often is "DICOM header data" - not sure if that would be a better term.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DICOM header data

I think 'header' applies to other formats where the pixel data is not tagged, but whatever is left after some initial data is just raw pixel bytes. It assumes every file is basically pixel data, plus a few extra things.

In DICOM, many file types (SOP Classes) do not have pixel data, so the whole file would be 'header', which doesn't really make sense. Same goes for metadata in the sense of everything that is not pixels.

could we have something like a frames property as an alternate to pixel_array?

Any thoughts on the above general comment? If this could work as an add-on to Dataset, then the nomenclature issue could go away.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DICOM, many file types (SOP Classes) do not have pixel data, so the whole file would be 'header', which doesn't really make sense.

Fair enough. I didn't really like the term too much, was just thinking of an alternative to metadata.

Any thoughts on the above general comment?

I think it makes sense. Frames are an inherent property of DICOM files, so it would fit IMHO. Also, I think that the ability to read single frames is an important feature that belongs into the core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding terminology, I think that anyone who uses DICOMweb will be familiar with the term "metadata" and the value returned by the metadata property will (more or less) correspond to what one would get via a DICOMweb /metadata call. Similarly, the value returned by read_frame() corresponds to what one would get via a DICOMweb /frames/{frameNumber} call.

@hackermd
Copy link
Contributor Author

Re my comment on metadata, I wonder if there is a way to integrate this into Dataset - just off the cuff here, could we have something like a frames property as an alternate to pixel_array? Generally I don't like to complicate Dataset with new code, but this is quite a common need, and important for large file sizes.

I initially also thought about adding this functionality to the Dataset class. However, I meanwhile came to the conclusion that this would not be desirable for the following reasons:

  • Efficient frame-level access would require setting stop_before_pixels to True when using pydicom.fileread.dcmread() to read the data set from a file, which would tightly couple the two abstractions. Otherwise, the entire Pixel Data element would be fully read into memory anyways.
  • If the Dataset does not get read from a file, but is instead retrieved over network, the whole Pixel Data element will already be available in memory. One could avoid decoding of all frames and only decode an individual frame, but this could be a separate feature to add in the future.
  • Not every Dataset actually represents an image and I therefore find adding more image-specific methods to the class problematic. We could discuss the implementation of an Image class that gets derived from Dataset, which would get us very close to the approach we've chosen in highdicom and construct the subclass when reading data set from a file. However, this would break backwards compatibility.

@hackermd
Copy link
Contributor Author

@darcymason not sure whether you noticed the color correction feature (see here), which depends on #1446. This feature is potentially a bit tricky, because it requires Pillow. We could also leave that out for now.

@darcymason
Copy link
Member

darcymason commented Aug 23, 2021

  • Efficient frame-level access would require setting stop_before_pixels to True when using pydicom.fileread.dcmread() to read the data set from a file

This had crossed my mind, but I was thinking (without saying it 'out loud' of course) that we might want to move in this direction anyway with memory mapping, as noted in previous discussions like #1267. I think it might be interesting to explore whether memmap'ng can be used on all files (or at least ones above a modest size) transparently, without impacting performance of reading smaller files. I think everyone dealing with large files would like that behavior, and if was transparent in other cases, no one would complain.

I've also was re-reading the discussion about combining config items (e.g. from around here #1232 (comment)) and thinking about the previous request to make config non-global, and thinking perhaps we can incorporate that into the dcmread call. Very early thinking though. A more general config system could make it easier to turn memmapping on/off with fine control.

  • If the Dataset does not get read from a file, but is instead retrieved over network, the whole Pixel Data element will already be available in memory.

I'm not clear on how the separate class is different in this respect.

  • Not every Dataset actually represents an image and I therefore find adding more image-specific methods to the class problematic.

I generally agree very much with avoiding image-specific methods, but we already have pixel_array which breaks this distinction, so to me adding a frames property is just an evolution which may be worth the complication.

I'll keep mulling all this though, happy to hear other thoughts...

@hackermd
Copy link
Contributor Author

Not every Dataset actually represents an image and I therefore find adding more image-specific methods to the class problematic.

I generally agree very much with avoiding image-specific methods, but we already have pixel_array which breaks this distinction, so to me adding a frames property is just an evolution which may be worth the complication.

I would consider implementing a get_frames() method rather than a frames property. This will allow parameterization if this should become necessary (for example, to correct color of SM image frames or HU values of CT image frames).

@hackermd
Copy link
Contributor Author

Efficient frame-level access would require setting stop_before_pixels to True when using pydicom.fileread.dcmread() to read the data set from a file

This had crossed my mind, but I was thinking (without saying it 'out loud' of course) that we might want to move in this direction anyway with memory mapping, as noted in previous discussions like #1267. I think it might be interesting to explore whether memmap'ng can be used on all files (or at least ones above a modest size) transparently, without impacting performance of reading smaller files. I think everyone dealing with large files would like that behavior, and if was transparent in other cases, no one would complain.

Personally, I would distinguish between the representation of a DICOM Data Set in memory and a DICOM Part 10 file stored on disk. We should not assume that every pydicom.dataset.Dataset was read from a file on disk or that the file still exists after the data set was (partially) read. For example, the dicomweb-client retrieves data sets from a server via DICOMweb. I tend argue that the pydicom.dataset.Dataset class should not deal with file I/O, but this should be left to separate file reader classes/functions.

However, independent of whether we'll further pursue the memory mapping, the ImageFileReader class would be nice to have in my opinion. The class hides all the implementation details for reading and decoding individual frames, but those could be factored out and reused for other purposes.

@darcymason
Copy link
Member

would consider implementing a get_frames() method rather than a frames property. This will allow parameterization if this should become necessary

Good point.

I tend argue that the pydicom.dataset.Dataset class should not deal with file I/O, but this should be left to separate file reader classes/functions.

However, independent of whether we'll further pursue the memory mapping, the ImageFileReader class would be nice to have in my opinion. The class hides all the implementation details for reading and decoding individual frames, but those could be factored out and reused for other purposes.

That seems reasonable - we could start here and perhaps incorporate into Dataset at some point in future, but with the logic kept in the separate class.

@hackermd
Copy link
Contributor Author

@darcymason I removed the color management code from the ImageFileReader for now, since it depends on #1446, which is probably more WIP than MRG at this point (I updated the PR title to reflect that sentiment).

We will need Pillow for color management (at least in its current implementation). However, Pillow is an optional dependency and therefore requires (not so elegant) workarounds to avoid Pillow symbols from being exposed globally. I would suggest working on getting the ImageFileReader into the library without color correction. We can revisit this later should #1446 get merged. However, the correction could (and maybe even should) be done separately outside of the ImageFileReader.

@hackermd
Copy link
Contributor Author

hackermd commented Sep 9, 2021

@darcymason are there any outstanding TODOs preventing this PR from being merged and released?

@darcymason
Copy link
Member

@darcymason are there any outstanding TODOs preventing this PR from being merged and released?

Not from me, but I was hoping @scaramallion could review and comment on how/whether this fits in with the various handlers concepts.

@hackermd
Copy link
Contributor Author

hackermd commented Sep 24, 2021

@scaramallion what are your thoughts?

@hackermd
Copy link
Contributor Author

hackermd commented Oct 5, 2021

@scaramallion I was wondering whether you had a chance to take a look. Please let me know if you have any questions or comments. It would be great if we could get clarity on whether we can include this functionality sometime soon.

@hackermd
Copy link
Contributor Author

@darcymason @scaramallion this is currently blocking a feature I would like to add to the dicomweb-client library. The library already depends on pydicom and I don't want to introduce a new dependency on highdicom just for the low-level file I/O functionality of ImageFileReader, which to me conceptually belongs into pydicom. However, if you have concerns adding it to pydicom, I will need to reconsider.

@darcymason
Copy link
Member

I'll have another look at this today - my only concern is to have some thought about the principles of how this is worked in, to do our best not to have to rework the 'API' later to fit a more general idea of specific handlers.

@darcymason
Copy link
Member

First, apologies that this PR has dragged on for a long time - it is harder when it comes to potentially 'structural'/API-like issues. I keep thinking that I'll go over it all again and try to think through the full implications/alternatives. And, sometimes hope that a kind of eureka moment will just happen.

Having said that, I had a good look again through this PR, and existing pydicom code, and had a good think about it. Then I read through the discussion above again and see that I came up with basically the same concerns as I had before.

Before I had kind of accepted the idea we could adapt downstream, but on further thought it is usually much easier to avoid adding something which is deprecated and removed later.

I do think we need some more discussion of a frames() method for Dataset. I still have concerns about adding a new object which is mostly a Dataset (through a metadata attribute), making a different path to getting at usual Dataset items. I think of someone looking through the docs to find a way to get at frames, and it really seems natural to me to be part of Dataset. It's very similar to pixel_array, just asking for a part of the pixel data instead of all of it at once.

The question of how/when items are loaded into memory is still there. On that I came back to my previous idea of a config passed to dcmread to allow some kind of mem-mapped/deferred reading.

If you like, I could put together an alternate branch (try to commit to doing so this week) which adapts the code to the frames() method idea. Perhaps it would be easier to see in practice what that might look like.

And I'm still interested in @scaramallion's opinion - this seems very related to the existing pydicom pixel_data_handlers and encoders submodules (and the encaps.py module, used in part in this PR) which already deal with frames and could perhaps be adapted to meet these needs.

@hackermd
Copy link
Contributor Author

hackermd commented Oct 20, 2021

Thanks for your review and feedback @darcymason!

I do think we need some more discussion of a frames() method for Dataset.

Personally, I would not add more methods to the Dataset class that are specific to the Image information entity. The implementation of pixel_array is already messy enough (different pixel handlers that have different optional dependencies, etc.). It could be easily replaced with decode_pixel_data(dataset: pydicom.Dataset) -> numpy.ndarray, which could raise an exception if the data set does not represent an Image.

However, reading individual frames efficiently will require partial reading of the data set and building (and caching!) a Basic Offset Table. This is all functionality that should not be something the Dataset should be concerned with in my opinion, because it would couple I/O operations, decoding, and data access. This is very problematic, because nothing in the standard (or the pydicom library) says that a Dataset should be read from a file. In fact, I tend argue that a data set is generally retrieved over network.

I still have concerns about adding a new object which is mostly a Dataset (through a metadata attribute), making a different path to getting at usual Dataset items. I think of someone looking through the docs to find a way to get at frames, and it really seems natural to me to be part of Dataset. It's very similar to pixel_array, just asking for a part of the pixel data instead of all of it at once.

I don't consider the ImageFileReader class an alternative API to Dataset. It is a more sophisticated version of dcmread(), which has been specifically designed for reading (multi-frame) images from files (its name makes that very clear in my opinion). It won't work for any other type of information entity (document, etc.) and one can also not construct it from an existing data set retrieved over network (e.g., DICOMweb services). The metadata attribute provides access to a Dataset instance, which matches what one would get via pydicom.dcmread(..., stop_before_pixels=True). For anyone with DICOMweb experience, the term "metadata" should be very intuitive.

@darcymason
Copy link
Member

This is very problematic, because nothing in the standard (or the pydicom library) says that a Dataset should be read from a file

Actually, reading from a file is basically all pydicom ever claimed to do... it's the first line in our README:

pydicom is a pure Python package for working with DICOM files

And I should say that when I say Dataset I'm usually talking about pydicom's FileDataset class, which was carved off at one point to try to allow Dataset to be separated from the concept of files somewhat.

I do generally like the idea of decoupling IO from data. It's a nice principle, but sometimes practicality can trump principle. We did historically offer a 'deferred read' concept, which was not well-tested and was removed a while ago because of that, but could be brought back. And IMO the proposed class here doesn't really do it either, except for chunking off the pre-read data into metadata and then dealing with the rest. As in terms of decoding, we already couple the decoding into (File)Dataset with the deferred decoding represented by the RawDataElement to DataElement conversion only when a data element value is accessed.

I'm busy through the weekend, but I'll try to think this over some more and make a decision next week.

@hackermd
Copy link
Contributor Author

I'm busy through the weekend, but I'll try to think this over some more and make a decision next week.

Thanks @darcymason. Much appreciated!

Actually, reading from a file is basically all pydicom ever claimed to do... it's the first line in our README:
And I should say that when I say Dataset I'm usually talking about pydicom's FileDataset class, which was carved off at one point to try to allow Dataset to be separated from the concept of files somewhat.

I strongly disagree with that notion and tend argue that it goes fundamentally against the DICOM standard, which is after all primarily concerned with communication of data over network. If the library has the intention of limiting pydicom.Dataset to data sets that were read from files, it would significantly limit its usefulness in my opinion and I will have to stop using it.

The API of the library so far has also nicely separated the representation of the Data Set (pydicom.dataset.py) and I/O (pydicom.filereader.py and pydicom.filewriter.py). The pixel handlers partially violate that principle by relying on the presence of the file_meta.TransferSyntaxUID. My suggestion has therefore been to not just pass an entire pydicom.Dataset instance to the decoding functions (see #1243 (comment)), but rather the individual attributes (see highdicom.frame.decode_frame). The decoders of the pixel handlers should really not assume that the frame items come from a data set that was read from a file.

cc @dclunie

I do generally like the idea of decoupling IO from data. It's a nice principle, but sometimes practicality can trump principle. We did historically offer a 'deferred read' concept, which was not well-tested and was removed a while ago because of that, but could be brought back. And IMO the proposed class here doesn't really do it either, except for chunking off the pre-read data into metadata and then dealing with the rest.

Good point! The imageFileReader class reads part of the data set from the file, but the constructed Dataset instance exposed by the metadata property doesn't need (and shouldn't have) a reference to the file. I made sure that the Dataset instance is fully de-coupled from the file by removing file_meta via 39e2c83.

@hackermd
Copy link
Contributor Author

hackermd commented Mar 3, 2022

@darcymason @scaramallion is this something you still think would be useful to add to pydicom?

@mrbean-bremen
Copy link
Member

mrbean-bremen commented Jul 2, 2022

is this something you still think would be useful to add to pydicom?

In my opinion, this would be very useful. I just stumbled over this PR because of a related SO question and I'm not sure in what state it is, but I had to access single frames of large multi-frame images in the past and find it an important feature (we had helped dcmtk to add a similar feature at the time).
@scaramallion - can be revive this, or have you been working on a similar feature? I dimly remember some related discussion, but could not find it...

@darcymason darcymason added this to the 2.4 milestone Nov 4, 2022
@darcymason darcymason modified the milestones: 2.4, v3.0 Jun 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants