[MRG] Add image file reader class #1447

hackermd · 2021-07-22T00:13:09Z

Describe the changes

Adds support for reading efficiently individual frames of a (multi-frame) image without loading the entire Pixel Data element into memory (see also #534, #1263, #1243).

Tasks

Unit tests added that reproduce the issue or prove feature is working
Fix or feature added
Code typed and mypy shows no errors
Documentation updated (if relevant)
- No warnings during build
- Preview link (CircleCI -> Artifacts -> doc/_build/html/index.html)
Unit tests passing and overall coverage the same or better

pep8speaks · 2021-07-22T00:13:12Z

Hello @hackermd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-10-22 16:44:27 UTC

codecov · 2021-07-22T00:19:50Z

Codecov Report

Merging #1447 (c31faf9) into master (4693362) will decrease coverage by 0.97%.
The diff coverage is 55.73%.

@@            Coverage Diff             @@
##           master    #1447      +/-   ##
==========================================
- Coverage   97.42%   96.45%   -0.98%     
==========================================
  Files          66       66              
  Lines       10182    10425     +243     
==========================================
+ Hits         9920    10055     +135     
- Misses        262      370     +108

Impacted Files	Coverage Δ
pydicom/filereader.py	`79.43% <55.73%> (-14.92%)`	⬇️
pydicom/_dicom_dict.py	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4693362...c31faf9. Read the comment docs.

darcymason

Just a few thoughts from a first pass...

Re my comment on metadata, I wonder if there is a way to integrate this into Dataset - just off the cuff here, could we have something like a frames property as an alternate to pixel_array? Generally I don't like to complicate Dataset with new code, but this is quite a common need, and important for large file sizes.

darcymason · 2021-08-23T16:15:35Z

pydicom/filereader.py

+    Examples
+    --------
+    >>> from highdicom.io import ImageFileReader
+    >>> with ImageFileReader('/path/to/file.dcm') as image:


Reference to highdicom needs replacing

Addressed via 8b58178

darcymason · 2021-08-23T16:23:52Z

pydicom/filereader.py

+    """
+
+    def __init__(self, filename: str):
+        """


Can we also include a file object rather than just a filename? Inevitably someone needing to work with streams will ask for it, although some may be limited if there are negative seeks, as seen for example in #753.

I generally the idea of allowing the user to alternatively provide an existing file object and have refactored the class accordingly via e1056bd.

I have decided to not set the is_little_endian and is_implicit_VR attributes on the file object, so that the object does not get modified. Instead an error is raised if their values are not correct.

This is one of the reasons I initially implemented the class such that the file object gets created under the hood and the attributes are set correctly.

@darcymason What are your thoughts on this?

darcymason · 2021-08-23T16:27:53Z

pydicom/filereader.py

+
+
+class ImageFileReader(object):
+


Subclassing from object only needed for Python 2, which we have dropped.

Addressed via 0b5b68c

darcymason · 2021-08-23T16:36:44Z

pydicom/filereader.py

+    >>> from highdicom.io import ImageFileReader
+    >>> with ImageFileReader('/path/to/file.dcm') as image:
+    ...     print(image.metadata)
+    ...     for i in range(image.number_of_frames):


I'm not a fan of the term metadata here, especially because it sounds too much like the File Meta Information defined for part 10 files in the Standard. But also because it is really a Dataset (from dcmread on open call) of all data elements except the pixel data; pydicom (and DICOM itself) don't give unique status to pixel data, unlike other file types with a 'header' or 'metadata'; in DICOM they are all just data elements.

Having said that, I'm not sure what to call it. In pydicom, using stop_before_pixels=True just gives a Dataset, simply with the pixel data element missing. I'll mull this over a little more...

In DICOMweb, the term metadata is used to refer to the subset of Data Elements in a Data Set that are not considered bulkdata (i.e., that have a size smaller than a defined threshold).

I generally use to term to indicate that a given pydicom.dataset.Dataset may not contain the Pixel Data element.

I have heard that term too, and have also been confused by it. I think it comes from image formats like TIFF who have embedded metadata tags.
What I encounter more often is "DICOM header data" - not sure if that would be a better term.

DICOM header data

I think 'header' applies to other formats where the pixel data is not tagged, but whatever is left after some initial data is just raw pixel bytes. It assumes every file is basically pixel data, plus a few extra things.

In DICOM, many file types (SOP Classes) do not have pixel data, so the whole file would be 'header', which doesn't really make sense. Same goes for metadata in the sense of everything that is not pixels.

could we have something like a frames property as an alternate to pixel_array?

Any thoughts on the above general comment? If this could work as an add-on to Dataset, then the nomenclature issue could go away.

In DICOM, many file types (SOP Classes) do not have pixel data, so the whole file would be 'header', which doesn't really make sense.

Fair enough. I didn't really like the term too much, was just thinking of an alternative to metadata.

Any thoughts on the above general comment?

I think it makes sense. Frames are an inherent property of DICOM files, so it would fit IMHO. Also, I think that the ability to read single frames is an important feature that belongs into the core.

Regarding terminology, I think that anyone who uses DICOMweb will be familiar with the term "metadata" and the value returned by the metadata property will (more or less) correspond to what one would get via a DICOMweb /metadata call. Similarly, the value returned by read_frame() corresponds to what one would get via a DICOMweb /frames/{frameNumber} call.

hackermd · 2021-08-23T18:45:11Z

Re my comment on metadata, I wonder if there is a way to integrate this into Dataset - just off the cuff here, could we have something like a frames property as an alternate to pixel_array? Generally I don't like to complicate Dataset with new code, but this is quite a common need, and important for large file sizes.

I initially also thought about adding this functionality to the Dataset class. However, I meanwhile came to the conclusion that this would not be desirable for the following reasons:

Efficient frame-level access would require setting stop_before_pixels to True when using pydicom.fileread.dcmread() to read the data set from a file, which would tightly couple the two abstractions. Otherwise, the entire Pixel Data element would be fully read into memory anyways.
If the Dataset does not get read from a file, but is instead retrieved over network, the whole Pixel Data element will already be available in memory. One could avoid decoding of all frames and only decode an individual frame, but this could be a separate feature to add in the future.
Not every Dataset actually represents an image and I therefore find adding more image-specific methods to the class problematic. We could discuss the implementation of an Image class that gets derived from Dataset, which would get us very close to the approach we've chosen in highdicom and construct the subclass when reading data set from a file. However, this would break backwards compatibility.

hackermd · 2021-08-23T18:59:42Z

@darcymason not sure whether you noticed the color correction feature (see here), which depends on #1446. This feature is potentially a bit tricky, because it requires Pillow. We could also leave that out for now.

darcymason · 2021-08-23T21:25:02Z

Efficient frame-level access would require setting stop_before_pixels to True when using pydicom.fileread.dcmread() to read the data set from a file

This had crossed my mind, but I was thinking (without saying it 'out loud' of course) that we might want to move in this direction anyway with memory mapping, as noted in previous discussions like #1267. I think it might be interesting to explore whether memmap'ng can be used on all files (or at least ones above a modest size) transparently, without impacting performance of reading smaller files. I think everyone dealing with large files would like that behavior, and if was transparent in other cases, no one would complain.

I've also was re-reading the discussion about combining config items (e.g. from around here #1232 (comment)) and thinking about the previous request to make config non-global, and thinking perhaps we can incorporate that into the dcmread call. Very early thinking though. A more general config system could make it easier to turn memmapping on/off with fine control.

If the Dataset does not get read from a file, but is instead retrieved over network, the whole Pixel Data element will already be available in memory.

I'm not clear on how the separate class is different in this respect.

Not every Dataset actually represents an image and I therefore find adding more image-specific methods to the class problematic.

I generally agree very much with avoiding image-specific methods, but we already have pixel_array which breaks this distinction, so to me adding a frames property is just an evolution which may be worth the complication.

I'll keep mulling all this though, happy to hear other thoughts...

hackermd · 2021-08-23T22:25:14Z

Not every Dataset actually represents an image and I therefore find adding more image-specific methods to the class problematic.

I generally agree very much with avoiding image-specific methods, but we already have pixel_array which breaks this distinction, so to me adding a frames property is just an evolution which may be worth the complication.

I would consider implementing a get_frames() method rather than a frames property. This will allow parameterization if this should become necessary (for example, to correct color of SM image frames or HU values of CT image frames).

hackermd · 2021-08-23T22:39:17Z

Efficient frame-level access would require setting stop_before_pixels to True when using pydicom.fileread.dcmread() to read the data set from a file

This had crossed my mind, but I was thinking (without saying it 'out loud' of course) that we might want to move in this direction anyway with memory mapping, as noted in previous discussions like #1267. I think it might be interesting to explore whether memmap'ng can be used on all files (or at least ones above a modest size) transparently, without impacting performance of reading smaller files. I think everyone dealing with large files would like that behavior, and if was transparent in other cases, no one would complain.

Personally, I would distinguish between the representation of a DICOM Data Set in memory and a DICOM Part 10 file stored on disk. We should not assume that every pydicom.dataset.Dataset was read from a file on disk or that the file still exists after the data set was (partially) read. For example, the dicomweb-client retrieves data sets from a server via DICOMweb. I tend argue that the pydicom.dataset.Dataset class should not deal with file I/O, but this should be left to separate file reader classes/functions.

However, independent of whether we'll further pursue the memory mapping, the ImageFileReader class would be nice to have in my opinion. The class hides all the implementation details for reading and decoding individual frames, but those could be factored out and reused for other purposes.

darcymason · 2021-08-23T23:37:33Z

would consider implementing a get_frames() method rather than a frames property. This will allow parameterization if this should become necessary

Good point.

I tend argue that the pydicom.dataset.Dataset class should not deal with file I/O, but this should be left to separate file reader classes/functions.

However, independent of whether we'll further pursue the memory mapping, the ImageFileReader class would be nice to have in my opinion. The class hides all the implementation details for reading and decoding individual frames, but those could be factored out and reused for other purposes.

That seems reasonable - we could start here and perhaps incorporate into Dataset at some point in future, but with the logic kept in the separate class.

hackermd · 2021-08-24T23:29:42Z

@darcymason I removed the color management code from the ImageFileReader for now, since it depends on #1446, which is probably more WIP than MRG at this point (I updated the PR title to reflect that sentiment).

We will need Pillow for color management (at least in its current implementation). However, Pillow is an optional dependency and therefore requires (not so elegant) workarounds to avoid Pillow symbols from being exposed globally. I would suggest working on getting the ImageFileReader into the library without color correction. We can revisit this later should #1446 get merged. However, the correction could (and maybe even should) be done separately outside of the ImageFileReader.

hackermd · 2021-09-09T12:45:11Z

@darcymason are there any outstanding TODOs preventing this PR from being merged and released?

darcymason · 2021-09-09T14:33:28Z

@darcymason are there any outstanding TODOs preventing this PR from being merged and released?

Not from me, but I was hoping @scaramallion could review and comment on how/whether this fits in with the various handlers concepts.

hackermd · 2021-09-24T10:02:18Z

@scaramallion what are your thoughts?

hackermd · 2021-10-05T09:36:41Z

@scaramallion I was wondering whether you had a chance to take a look. Please let me know if you have any questions or comments. It would be great if we could get clarity on whether we can include this functionality sometime soon.

hackermd · 2021-10-15T15:00:12Z

@darcymason @scaramallion this is currently blocking a feature I would like to add to the dicomweb-client library. The library already depends on pydicom and I don't want to introduce a new dependency on highdicom just for the low-level file I/O functionality of ImageFileReader, which to me conceptually belongs into pydicom. However, if you have concerns adding it to pydicom, I will need to reconsider.

darcymason · 2021-10-18T14:10:19Z

I'll have another look at this today - my only concern is to have some thought about the principles of how this is worked in, to do our best not to have to rework the 'API' later to fit a more general idea of specific handlers.

darcymason · 2021-10-18T22:06:25Z

First, apologies that this PR has dragged on for a long time - it is harder when it comes to potentially 'structural'/API-like issues. I keep thinking that I'll go over it all again and try to think through the full implications/alternatives. And, sometimes hope that a kind of eureka moment will just happen.

Having said that, I had a good look again through this PR, and existing pydicom code, and had a good think about it. Then I read through the discussion above again and see that I came up with basically the same concerns as I had before.

Before I had kind of accepted the idea we could adapt downstream, but on further thought it is usually much easier to avoid adding something which is deprecated and removed later.

I do think we need some more discussion of a frames() method for Dataset. I still have concerns about adding a new object which is mostly a Dataset (through a metadata attribute), making a different path to getting at usual Dataset items. I think of someone looking through the docs to find a way to get at frames, and it really seems natural to me to be part of Dataset. It's very similar to pixel_array, just asking for a part of the pixel data instead of all of it at once.

The question of how/when items are loaded into memory is still there. On that I came back to my previous idea of a config passed to dcmread to allow some kind of mem-mapped/deferred reading.

If you like, I could put together an alternate branch (try to commit to doing so this week) which adapts the code to the frames() method idea. Perhaps it would be easier to see in practice what that might look like.

And I'm still interested in @scaramallion's opinion - this seems very related to the existing pydicom pixel_data_handlers and encoders submodules (and the encaps.py module, used in part in this PR) which already deal with frames and could perhaps be adapted to meet these needs.

hackermd · 2021-10-20T15:58:57Z

Thanks for your review and feedback @darcymason!

I do think we need some more discussion of a frames() method for Dataset.

Personally, I would not add more methods to the Dataset class that are specific to the Image information entity. The implementation of pixel_array is already messy enough (different pixel handlers that have different optional dependencies, etc.). It could be easily replaced with decode_pixel_data(dataset: pydicom.Dataset) -> numpy.ndarray, which could raise an exception if the data set does not represent an Image.

However, reading individual frames efficiently will require partial reading of the data set and building (and caching!) a Basic Offset Table. This is all functionality that should not be something the Dataset should be concerned with in my opinion, because it would couple I/O operations, decoding, and data access. This is very problematic, because nothing in the standard (or the pydicom library) says that a Dataset should be read from a file. In fact, I tend argue that a data set is generally retrieved over network.

I still have concerns about adding a new object which is mostly a Dataset (through a metadata attribute), making a different path to getting at usual Dataset items. I think of someone looking through the docs to find a way to get at frames, and it really seems natural to me to be part of Dataset. It's very similar to pixel_array, just asking for a part of the pixel data instead of all of it at once.

I don't consider the ImageFileReader class an alternative API to Dataset. It is a more sophisticated version of dcmread(), which has been specifically designed for reading (multi-frame) images from files (its name makes that very clear in my opinion). It won't work for any other type of information entity (document, etc.) and one can also not construct it from an existing data set retrieved over network (e.g., DICOMweb services). The metadata attribute provides access to a Dataset instance, which matches what one would get via pydicom.dcmread(..., stop_before_pixels=True). For anyone with DICOMweb experience, the term "metadata" should be very intuitive.

darcymason · 2021-10-22T13:43:32Z

This is very problematic, because nothing in the standard (or the pydicom library) says that a Dataset should be read from a file

Actually, reading from a file is basically all pydicom ever claimed to do... it's the first line in our README:

pydicom is a pure Python package for working with DICOM files

And I should say that when I say Dataset I'm usually talking about pydicom's FileDataset class, which was carved off at one point to try to allow Dataset to be separated from the concept of files somewhat.

I do generally like the idea of decoupling IO from data. It's a nice principle, but sometimes practicality can trump principle. We did historically offer a 'deferred read' concept, which was not well-tested and was removed a while ago because of that, but could be brought back. And IMO the proposed class here doesn't really do it either, except for chunking off the pre-read data into metadata and then dealing with the rest. As in terms of decoding, we already couple the decoding into (File)Dataset with the deferred decoding represented by the RawDataElement to DataElement conversion only when a data element value is accessed.

I'm busy through the weekend, but I'll try to think this over some more and make a decision next week.

…feature/frame-reading

hackermd · 2021-10-22T16:45:20Z

I'm busy through the weekend, but I'll try to think this over some more and make a decision next week.

Thanks @darcymason. Much appreciated!

Actually, reading from a file is basically all pydicom ever claimed to do... it's the first line in our README:
And I should say that when I say Dataset I'm usually talking about pydicom's FileDataset class, which was carved off at one point to try to allow Dataset to be separated from the concept of files somewhat.

I strongly disagree with that notion and tend argue that it goes fundamentally against the DICOM standard, which is after all primarily concerned with communication of data over network. If the library has the intention of limiting pydicom.Dataset to data sets that were read from files, it would significantly limit its usefulness in my opinion and I will have to stop using it.

The API of the library so far has also nicely separated the representation of the Data Set (pydicom.dataset.py) and I/O (pydicom.filereader.py and pydicom.filewriter.py). The pixel handlers partially violate that principle by relying on the presence of the file_meta.TransferSyntaxUID. My suggestion has therefore been to not just pass an entire pydicom.Dataset instance to the decoding functions (see #1243 (comment)), but rather the individual attributes (see highdicom.frame.decode_frame). The decoders of the pixel handlers should really not assume that the frame items come from a data set that was read from a file.

cc @dclunie

I do generally like the idea of decoupling IO from data. It's a nice principle, but sometimes practicality can trump principle. We did historically offer a 'deferred read' concept, which was not well-tested and was removed a while ago because of that, but could be brought back. And IMO the proposed class here doesn't really do it either, except for chunking off the pre-read data into metadata and then dealing with the rest.

Good point! The imageFileReader class reads part of the data set from the file, but the constructed Dataset instance exposed by the metadata property doesn't need (and shouldn't have) a reference to the file. I made sure that the Dataset instance is fully de-coupled from the file by removing file_meta via 39e2c83.

hackermd · 2022-03-03T20:41:16Z

@darcymason @scaramallion is this something you still think would be useful to add to pydicom?

mrbean-bremen · 2022-07-02T17:45:39Z

is this something you still think would be useful to add to pydicom?

In my opinion, this would be very useful. I just stumbled over this PR because of a related SO question and I'm not sure in what state it is, but I had to access single frames of large multi-frame images in the past and find it an important feature (we had helped dcmtk to add a similar feature at the time).
@scaramallion - can be revive this, or have you been working on a similar feature? I dimly remember some related discussion, but could not find it...

Add image file reader class

50449ee

Fix PEP8 errors

27220d7

hackermd mentioned this pull request Jul 22, 2021

Refactor pixel handlers to provide interface for decoding individual frame items #1243

Closed

darcymason reviewed Aug 23, 2021

View reviewed changes

hackermd added 3 commits August 23, 2021 13:02

Don't subclass ImageFileReader from object

0b5b68c

Update example for using ImageFileReader

8b58178

Allow ImageFileReader construction with file object

e1056bd

Drop support of color management by ImageFileReader

f6292ba

hackermd mentioned this pull request Sep 7, 2021

How to read a single frame from the Multi frame Dicom file #1263

Closed

Merge branch 'master' into feature/frame-reading

f8618a1

hackermd added 2 commits October 22, 2021 12:42

Remove file_meta from metadata

39e2c83

Merge branch 'feature/frame-reading' of github:hackermd/pydicom into …

c31faf9

…feature/frame-reading

hackermd mentioned this pull request Feb 18, 2022

[MRG] Improvements for JPEG decoding #1570

Closed

4 tasks

darcymason added this to the 2.4 milestone Nov 4, 2022

kalebdfischer mentioned this pull request Nov 7, 2022

[MRG] Add framereader module #1725

Open

7 tasks

darcymason modified the milestones: 2.4, v3.0 Jun 12, 2023

[MRG] Add image file reader class #1447

Are you sure you want to change the base?

[MRG] Add image file reader class #1447

Conversation

hackermd commented Jul 22, 2021 • edited

Describe the changes

Tasks

pep8speaks commented Jul 22, 2021 • edited

Comment last updated at 2021-10-22 16:44:27 UTC

codecov bot commented Jul 22, 2021 • edited

Codecov Report

darcymason left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hackermd commented Aug 23, 2021

hackermd commented Aug 23, 2021

darcymason commented Aug 23, 2021 • edited

hackermd commented Aug 23, 2021

hackermd commented Aug 23, 2021

darcymason commented Aug 23, 2021

hackermd commented Aug 24, 2021

hackermd commented Sep 9, 2021

darcymason commented Sep 9, 2021

hackermd commented Sep 24, 2021 • edited

hackermd commented Oct 5, 2021

hackermd commented Oct 15, 2021

darcymason commented Oct 18, 2021

darcymason commented Oct 18, 2021

hackermd commented Oct 20, 2021 • edited

darcymason commented Oct 22, 2021

hackermd commented Oct 22, 2021

hackermd commented Mar 3, 2022

mrbean-bremen commented Jul 2, 2022 • edited

hackermd commented Jul 22, 2021 •

edited

pep8speaks commented Jul 22, 2021 •

edited

codecov bot commented Jul 22, 2021 •

edited

darcymason commented Aug 23, 2021 •

edited

hackermd commented Sep 24, 2021 •

edited

hackermd commented Oct 20, 2021 •

edited

mrbean-bremen commented Jul 2, 2022 •

edited