[MRG] Add framereader module #1725

kalebdfischer · 2022-11-07T21:16:16Z

Describe the changes

Adds support for reading efficiently individual frames of a (multi-frame) image without loading the entire Pixel Data element into memory (see also #534, #1263, #1243).

This work is based on @hackermd 's ImageFileReader MR, with the following noteworthy changes:

renamed ImageFileReader to FrameReader and moved to its own module so that rebasing is not a pain in the future
refactoring to allow for caching of metadata information, especially the basic offset table and to avoid parsing datasets just for FrameReader functionality
renamed private functions _get_bot, _read_bot, _build_bot to clarify that they are for encapsulated data
fixed basic offset table building for RLE files that lack a readable basic offset table (FrameReader can parse frames for test data files OBXXXX1A_rle_2frame.dcm and rtdose_rle.dcm)
added some logic to accommodate files like SC_rgb_rle.dcm which claim to be explicit, but are actually implicit.
wrote tests from ground up
added private methods for compressed/uncompressed reading to break up long modules under if encapsulated
added tolerance of files that are missing NumberOfFrames data element, but have an intact basic offset table
made log statements more informative
removed ambiguous usage of metadata
allow for non-path file-likes to be provided as input to FrameReader
allow for specification of read_partial params defer_size, force, specific_tags
added validation logic to confirm PixelData is present and dataset had attributes necessary to read frames

Tasks

Unit tests added that reproduce the issue or prove feature is working
Fix or feature added
Code typed and mypy shows no errors
Documentation updated (if relevant)
- No warnings during build
- Preview link (CircleCI -> Artifacts -> doc/_build/html/index.html)
Unit tests passing and overall coverage the same or better

codecov · 2022-11-07T21:38:47Z

Codecov Report

Merging #1725 (6d2c73a) into master (01ae3fc) will increase coverage by 0.09%.
The diff coverage is 99.73%.

❗ Current head 6d2c73a differs from pull request most recent head 39e58ac. Consider uploading reports for the commit 39e58ac to get more accurate results

@@            Coverage Diff             @@
##           master    #1725      +/-   ##
==========================================
+ Coverage   97.59%   97.68%   +0.09%     
==========================================
  Files          66       67       +1     
  Lines       10769    11125     +356     
==========================================
+ Hits        10510    10868     +358     
+ Misses        259      257       -2

Impacted Files	Coverage Δ
pydicom/framereader.py	`99.73% <99.73%> (ø)`
pydicom/config.py	`97.94% <0.00%> (-0.10%)`	⬇️
pydicom/valuerep.py	`99.41% <0.00%> (-0.02%)`	⬇️
pydicom/dataset.py	`99.06% <0.00%> (-0.01%)`	⬇️
pydicom/filereader.py	`94.67% <0.00%> (+0.25%)`	⬆️
pydicom/filebase.py	`99.19% <0.00%> (+1.61%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

kalebdfischer · 2022-11-08T17:51:07Z

Not sure why mypy is complaining about files that are not changed by this MR - I haven't been able to reproduce locally.

kalebdfischer · 2022-11-08T17:53:33Z

pydicom/framereader.py

+        fragment_offsets.append(frame_position - first_frame_location)
+        first_two_bytes = fp.read(2, True)
+        if not fp.is_little_endian:
+            first_two_bytes = first_two_bytes[::-1]


I could remove this since it appears that encapsulated should always be little endian, but leaving in case there's some other reason @hackermd included the if statement.

Yes, this is probably not needed, but maybe @hackermd can comment on this. Or generally review this PR, for that...

mrbean-bremen

I had only a cursory glance so far, will have a closer look probably in a couple of days. Generally, I think this is a very valuable addition, and it is good to continue the work @hackermd had started here. If he did not already do this, I suggest that he also reviews this.
This also needs an entry in the release notes, and ideally a description in the documentation related to reading pixel data.

mrbean-bremen · 2022-11-08T19:03:28Z

pydicom/framereader.py

+        fragment_offsets.append(frame_position - first_frame_location)
+        first_two_bytes = fp.read(2, True)
+        if not fp.is_little_endian:
+            first_two_bytes = first_two_bytes[::-1]


Yes, this is probably not needed, but maybe @hackermd can comment on this. Or generally review this PR, for that...

hackermd · 2022-11-08T19:52:51Z

Thanks for all your great work @kalebdfischer! I will take a closer look in the next couple of days.

darcymason

Thanks for putting this together. I've done a quick pass through, will try to look again in more detail later. Just some minor comments at this point.

darcymason · 2022-11-08T19:42:31Z

pydicom/framereader.py

@@ -0,0 +1,1132 @@
+# Copyright 2008-2022 pydicom authors. See LICENSE file for details.
+"""Utilities for parsing DICOM PixelData frames.


Nitpick: copyright here can just use 2022, helps to flag the file is newer.

updated here

darcymason · 2022-11-08T19:45:15Z

pydicom/framereader.py

+}
+_UINT_PIXEL_DATA_TAGS = {
+    0x7FE00010,
+}


Perhaps should be INT without the U since in general pixels can be signed or unsigned.

updated here

darcymason · 2022-11-08T19:47:49Z

pydicom/framereader.py

+    """
+    if fp.tell() != pixel_data_location:
+        fp.seek(pixel_data_location, 0)
+


Is the doc string about IOError correct? Seems it ignores and just moves the pointer itself.

Good catch - missed the discrepancy in the refactor. Updated to validate and raise here

darcymason · 2022-11-08T19:55:07Z

pydicom/framereader.py

+    ds_copy.file_meta = FileMetaDataset()
+    ds_copy.file_meta.TransferSyntaxUID = getattr(
+        original_dataset.file_meta, "TransferSyntaxUID"
+    )


Without a getattr default, this is same as original_dataset.file_meta.TransferSyntaxUID. Did you mean to default to None or something else? In which case can use original_dataset.file_meta.get("TransferSyntaxUID")

It was lazy code that counted on the AttributeError exceptions. I've updated the function to raise ValueError for missing required attributes.

darcymason · 2022-11-08T20:45:52Z

pydicom/framereader.py

+        """
+        logger.debug("read File Meta Information")
+        self.fp
+


This bare statement seems fragile. I assume it triggers the fp property to open the file, but would not on a second call if the self._fp had not been reset.

self.fp will raise if the file doesn't exist or another exception is raised when trying to open a BufferedReader which will trigger the exit logic. It's a fair consideration that there may be an issue that goes undetected if a bad file-like is provided instead of a path.
I've done a couple of things to improve behavior and readability:

added a try/except (with re-raise) to open with an extra logging statement to improve readability

called self.dicom_file_like instead since it actually requires valid frame_info and DicomFileLike instantiation

Let me know if there are other improvements that you'd like made to this.

darcymason · 2022-11-08T21:23:36Z

pydicom/tests/test_framereader.py

+    for f in get_testdata_files("*.dcm")
+    if not any([f.endswith(x) for x in cannot_read])
+]
+


Easier and more readable, I think, to make can_read a set, and then all_paths=(Path(f) for f in get_testdata_files("*.dcm")), and can_read_paths = [p for p in all_paths if p.name not in cannot_read].

Really pydicom should have a get_testdata_paths function since we've been trying to move more towards pathlib; something for a future PR if we can remember...

Changed here

kalebdfischer · 2022-11-10T00:45:08Z

I had only a cursory glance so far, will have a closer look probably in a couple of days. Generally, I think this is a very valuable addition, and it is good to continue the work @hackermd had started here. If he did not already do this, I suggest that he also reviews this. This also needs an entry in the release notes, and ideally a description in the documentation related to reading pixel data.

I added updates to the release notes, working_with_pixel_data, and added a framereader.rst file to reference. Happy to wait for @hackermd's review.

hackermd

I started to review this PR and added comments. However, I ultimately concluded that we should have additional design discussions.

The PR adds several classes (FrameDataset, FrameInfo, etc.) and methods (to_dict()) that do not directly map to DICOM information entities and are therefore problematic in my opinion. The rest of the library closely follows and directly implements the DICOM information model (e.g., DataElement, Dataset, etc.). The changes introduce via this PR divert from that approach.

Furthermore, I think the functionality mainly belongs into the filereader module. The main class is intended for reading frames from files (or file-like objects) and uses file-specific methods (read(), seek(), tell(), etc.). It is not intended to decode frames from a Dataset that has already been fully read into memory. In other words, it is very different from the Dataset.pixel_array property and similar to dcmread(). What differentiates it from dcmread() is that it provides access to individual frames and that it caches information to be reused (metadata and BOT) for repeated frame-level access. For improved cohension, I would thus add all the functionality related to file reading into the filereader module. Functionality related to frame decoding can remain in frame.

hackermd · 2022-11-10T14:59:14Z

pydicom/framereader.py

+    return basic_offset_table
+
+
+def get_dataset_copy_with_frame_attrs(


I would not expose that function publicly. To me, this should be an implementation detail. I even tend argue that the individual elements should be passed directly to decoders and missing elements should be handled explicitly.

hackermd · 2022-11-10T15:06:32Z

pydicom/framereader.py

+
+
+def decode_frame(
+        frame_bytes: bytes, original_dataset: Dataset


I suggest passing the individual data elements to the function rather than original_dataset. That is more verbose, but easier to understand.

Specifically, I suggest using the existing implementation in highdicom: https://github.com/herrmannlab/highdicom/blob/d36e2a50350fd15531f549f52dafd87587a80afe/src/highdicom/frame.py#L327-L451

If that gets implemented here in pydicom, we can just import it in highdicom.

Happy to refactor with the function from highdicom, but came across an issue - this line has mismatched typing. I'm not terribly comfortable just telling mypy to ignore this. Do you have any suggestions?

hackermd · 2022-11-10T15:08:44Z

pydicom/framereader.py

+    return ds_temp.pixel_array
+
+
+class BasicOffsetTable(List):


I would not make that a subclass of List and then add to_dict() and from_dict() methods. If you really want to implement it as a class, I suggest using a NamedTuple instead.

kalebdfischer · 2022-11-10T15:58:59Z

Thanks for all your great work @kalebdfischer! I will take a closer look in the next couple of days.

@hackermd thanks for the original MR! We've been maintaining a pydicom branch with the ImageFileReader for
Flywheel's OHIF backend service and it's been very stable across a wide surface area of files.

For those using cloud storage for files, download of files is one of the greatest sources of latency. This refactor is going to enable thriftier frame retrieval the frame information can be stored and one can specifically request the frame bytes from storage on subsequent reads without need to retrieve the whole file.

hackermd · 2022-11-10T16:05:33Z

Thanks for all your great work @kalebdfischer! I will take a closer look in the next couple of days.

@hackermd thanks for the original MR! We've been maintaining a pydicom branch with the ImageFileReader for Flywheel's OHIF backend service and it's been very stable across a wide surface area of files.

For those using cloud storage for files, download of files is one of the greatest sources of latency. This refactor is going to enable thriftier frame retrieval the frame information can be stored and one can specifically request the frame bytes from storage on subsequent reads without need to retrieve the whole file.

Have you considered using dicomweb_client? It fully abstract frame-level access and supports retrieving frames via DICOMweb (DICOMwebClient) and reading them from local files (DICOMfileClient).

hackermd · 2022-11-10T16:06:46Z

@kalebdfischer @darcymason please note that the ImageFileReader class is part of highdicom: https://github.com/herrmannlab/highdicom/blob/d36e2a50350fd15531f549f52dafd87587a80afe/src/highdicom/io.py#L211

I think it would be great to have that class implemented in pydicom so that we can just import it in highdicom. However, that would require the API to remain the same.

kalebdfischer · 2022-11-10T16:42:06Z

Thanks for all your great work @kalebdfischer! I will take a closer look in the next couple of days.

@hackermd thanks for the original MR! We've been maintaining a pydicom branch with the ImageFileReader for Flywheel's OHIF backend service and it's been very stable across a wide surface area of files.
For those using cloud storage for files, download of files is one of the greatest sources of latency. This refactor is going to enable thriftier frame retrieval the frame information can be stored and one can specifically request the frame bytes from storage on subsequent reads without need to retrieve the whole file.

Have you considered using dicomweb_client? It fully abstract frame-level access and supports retrieving frames via DICOMweb (DICOMwebClient) and reading them from local files (DICOMfileClient).

Thanks for the recommendation - our DIMSE service actually uses dicomweb_client. There are a number of reasons that pydicom is more ergonomic and practical for our OHIF backend service.

kalebdfischer · 2022-11-10T16:46:20Z

@kalebdfischer @darcymason please note that the ImageFileReader class is part of highdicom: https://github.com/herrmannlab/highdicom/blob/d36e2a50350fd15531f549f52dafd87587a80afe/src/highdicom/io.py#L211

I think it would be great to have that class implemented in pydicom so that we can just import it in highdicom. However, that would require the API to remain the same.

I'm happy to help with any refactoring needed by highdicom. I will take a deeper look this afternoon to ensure that this PR doesn't make the refactoring more of a pain than it needs to be.

darcymason · 2022-11-10T17:02:10Z

I think it would be great to have that class implemented in pydicom so that we can just import it in highdicom

It seems reasonable to me to make this as compatible as possible. If it's going to be here anyway, you might as well not have duplicate functionality in highdicom.

darcymason · 2022-11-10T17:04:11Z

The rest of the library closely follows and directly implements the DICOM information model (e.g., DataElement, Dataset, etc.).

Also a good point. Pydicom has always tried to use DICOM nomenclature wherever possible.

hackermd · 2022-11-10T20:37:25Z

pydicom/framereader.py

+        return frame_dataset
+
+
+class FrameInfo:


Also, be aware that there are datasets with hundreds of thousands of frames and we don't want this class to become a performance bottleneck.

I don't see any way that this class imposes a performance bottleneck that ImageFileReader does not.

hackermd · 2022-11-10T20:38:01Z

cc'ing @pieper @fedorov @CPBridge for awareness

hackermd · 2022-11-10T20:41:24Z

I think it would be great to have that class implemented in pydicom so that we can just import it in highdicom

It seems reasonable to me to make this as compatible as possible. If it's going to be here anyway, you might as well not have duplicate functionality in highdicom.

Yes, if the API and behavior remains the same, highdicom.io.ImageFileReader could just be an alias for the class in pydicom and we could deprecate the highdicom implementation.

kalebdfischer · 2022-11-14T17:18:29Z

I started to review this PR and added comments. However, I ultimately concluded that we should have additional design discussions.

Thank you for the in-depth review. I'll be working on revisions today.

The PR adds several classes (FrameDataset, FrameInfo, etc.) and methods (to_dict()) that do not directly map to DICOM information entities and are therefore problematic in my opinion. The rest of the library closely follows and directly implements the DICOM information model (e.g., DataElement, Dataset, etc.). The changes introduce via this PR divert from that approach.

Could you elaborate on what the issue is and what changes you'd like? It doesn't seem terribly reasonable to remove tooling that allows storage to and from a json-like object, but I'm happy to make improvements if there are specific suggestions.

Furthermore, I think the functionality mainly belongs into the filereader module. The main class is intended for reading frames from files (or file-like objects) and uses file-specific methods (read(), seek(), tell(), etc.). It is not intended to decode frames from a Dataset that has already been fully read into memory. In other words, it is very different from the Dataset.pixel_array property and similar to dcmread(). What differentiates it from dcmread() is that it provides access to individual frames and that it caches information to be reused (metadata and BOT) for repeated frame-level access. For improved cohension, I would thus add all the functionality related to file reading into the filereader module. Functionality related to frame decoding can remain in frame.

I'll defer to the maintainers on this one, but the benefits of a separate module are:

filereader module does not get really long and cluttered
rebasing and merging are easier to manage
having the FrameReader class and associated functions in its own module makes the library more readable overall

hackermd · 2022-12-13T15:08:22Z

Could you elaborate on what the issue is and what changes you'd like? It doesn't seem terribly reasonable to remove tooling that allows storage to and from a json-like object, but I'm happy to make improvements if there are specific suggestions.

The library implements the DICOM standard and uses the terms defined by the DICOM Model of the Real World and DICOM Information Model. For example, "Frame Info" is not a defined Information Entity (IE), Information Object Definition (IOD), Module, Macro, Attribute, etc.

hackermd · 2022-12-13T15:10:22Z

having the FrameReader class and associated functions in its own module makes the library more readable overall

The problem with the term FrameReader is that is doesn't make clear that frames are read from a file. Also the filereader module encapsulates all data structures and functions for reading data from files. Therefore, including the class in this module increases cohesion.

hackermd · 2022-12-13T15:13:13Z

@kalebdfischer I suggest implementing the API of the highdicom.io.ImageFileReader class. I am all for improvement of implementation details, but our goal is to avoid duplication of functionality between pydicom and highdicom. Once the class is implemented in pydicom, we can deprecate it in highdicom.

It would be great if changes were based on #1447 so that we can more easily track changes.

darcymason · 2023-05-23T21:59:18Z

@kalebdfischer, @hackermd, @mrbean-bremen. and perhaps @pieper @fedorov @CPBridge @scaramallion

I'm reviewing issues again for consideration in release 2.4. This discussion has been idle for a while, and I'm thinking this is too complex to resolve in short order, and I'd rather get the release out in the next few weeks. So I suggest re-assigning this to the following (major) release, 3.0. Perhaps also as part of the Roadmap development it will become clearer how pydicom should incorporate this kind of functionality.

Any strong thoughts on this?

kalebdfischer added 2 commits November 7, 2022 14:51

Add framereader module with tests

8689063

fix framereader tests to check have_numpy

ab573ee

kalebdfischer added 3 commits November 7, 2022 15:58

test fixes for framereader

19b2b72

fix mypy issues

0fee696

make flake8 happy

6640fec

kalebdfischer changed the title ~~[WIP] Add framereader module~~ [MRG] Add framereader module Nov 8, 2022

kalebdfischer commented Nov 8, 2022

View reviewed changes

mrbean-bremen reviewed Nov 8, 2022

View reviewed changes

darcymason reviewed Nov 8, 2022

View reviewed changes

kalebdfischer added 4 commits November 9, 2022 16:25

framereader MR revisions + documentation update

573bfd7

update FrameReader open method

14ac3b8

doc tweaks

3d57cd5

framereader doc fixes

6d2c73a

hackermd suggested changes Nov 10, 2022

View reviewed changes

hackermd reviewed Nov 10, 2022

View reviewed changes

kalebdfischer added 4 commits November 14, 2022 14:13

replace decode_frame with highdicom function

0011674

make get_dataset_copy_with_frame_attrs private + formatting

8a7eb9d

replace list subclass with dataclass

6c7e2eb

Merge remote-tracking branch 'upstream/master' into frame_reader_for_MR

39e58ac

darcymason mentioned this pull request Nov 21, 2022

Accessing and displaying large multi frame DICOM images #614

Closed

psavery mentioned this pull request Nov 3, 2023

Adding support for more transfer syntaxes imi-bigpicture/wsidicom#119

Closed

		@@ -0,0 +1,1132 @@
		# Copyright 2008-2022 pydicom authors. See LICENSE file for details.
		"""Utilities for parsing DICOM PixelData frames.

		return basic_offset_table


		def get_dataset_copy_with_frame_attrs(



		def decode_frame(
		frame_bytes: bytes, original_dataset: Dataset

[MRG] Add framereader module #1725

Are you sure you want to change the base?

[MRG] Add framereader module #1725

Conversation

kalebdfischer commented Nov 7, 2022 • edited

Describe the changes

Tasks

codecov bot commented Nov 7, 2022 • edited

Codecov Report

kalebdfischer commented Nov 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrbean-bremen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hackermd commented Nov 8, 2022

darcymason left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalebdfischer commented Nov 10, 2022

hackermd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalebdfischer Nov 14, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalebdfischer commented Nov 10, 2022 • edited

hackermd commented Nov 10, 2022

hackermd commented Nov 10, 2022 • edited

kalebdfischer commented Nov 10, 2022

kalebdfischer commented Nov 10, 2022

darcymason commented Nov 10, 2022

darcymason commented Nov 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hackermd commented Nov 10, 2022

hackermd commented Nov 10, 2022

kalebdfischer commented Nov 14, 2022

hackermd commented Dec 13, 2022

hackermd commented Dec 13, 2022

hackermd commented Dec 13, 2022 • edited

darcymason commented May 23, 2023

kalebdfischer commented Nov 7, 2022 •

edited

codecov bot commented Nov 7, 2022 •

edited

kalebdfischer Nov 14, 2022 •

edited

kalebdfischer commented Nov 10, 2022 •

edited

hackermd commented Nov 10, 2022 •

edited

hackermd commented Dec 13, 2022 •

edited