Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add framereader module #1725

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

kalebdfischer
Copy link
Contributor

@kalebdfischer kalebdfischer commented Nov 7, 2022

Describe the changes

Adds support for reading efficiently individual frames of a (multi-frame) image without loading the entire Pixel Data element into memory (see also #534, #1263, #1243).

This work is based on @hackermd 's ImageFileReader MR, with the following noteworthy changes:

  • renamed ImageFileReader to FrameReader and moved to its own module so that rebasing is not a pain in the future
  • refactoring to allow for caching of metadata information, especially the basic offset table and to avoid parsing datasets just for FrameReader functionality
  • renamed private functions _get_bot, _read_bot, _build_bot to clarify that they are for encapsulated data
  • fixed basic offset table building for RLE files that lack a readable basic offset table (FrameReader can parse frames for test data files OBXXXX1A_rle_2frame.dcm and rtdose_rle.dcm)
  • added some logic to accommodate files like SC_rgb_rle.dcm which claim to be explicit, but are actually implicit.
  • wrote tests from ground up
  • added private methods for compressed/uncompressed reading to break up long modules under if encapsulated
  • added tolerance of files that are missing NumberOfFrames data element, but have an intact basic offset table
  • made log statements more informative
  • removed ambiguous usage of metadata
  • allow for non-path file-likes to be provided as input to FrameReader
  • allow for specification of read_partial params defer_size, force, specific_tags
  • added validation logic to confirm PixelData is present and dataset had attributes necessary to read frames

Tasks

  • Unit tests added that reproduce the issue or prove feature is working
  • Fix or feature added
  • Code typed and mypy shows no errors
  • Documentation updated (if relevant)
    • No warnings during build
    • Preview link (CircleCI -> Artifacts -> doc/_build/html/index.html)
  • Unit tests passing and overall coverage the same or better

@codecov
Copy link

codecov bot commented Nov 7, 2022

Codecov Report

Merging #1725 (6d2c73a) into master (01ae3fc) will increase coverage by 0.09%.
The diff coverage is 99.73%.

❗ Current head 6d2c73a differs from pull request most recent head 39e58ac. Consider uploading reports for the commit 39e58ac to get more accurate results

@@            Coverage Diff             @@
##           master    #1725      +/-   ##
==========================================
+ Coverage   97.59%   97.68%   +0.09%     
==========================================
  Files          66       67       +1     
  Lines       10769    11125     +356     
==========================================
+ Hits        10510    10868     +358     
+ Misses        259      257       -2     
Impacted Files Coverage Δ
pydicom/framereader.py 99.73% <99.73%> (ø)
pydicom/config.py 97.94% <0.00%> (-0.10%) ⬇️
pydicom/valuerep.py 99.41% <0.00%> (-0.02%) ⬇️
pydicom/dataset.py 99.06% <0.00%> (-0.01%) ⬇️
pydicom/filereader.py 94.67% <0.00%> (+0.25%) ⬆️
pydicom/filebase.py 99.19% <0.00%> (+1.61%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@kalebdfischer kalebdfischer changed the title [WIP] Add framereader module [MRG] Add framereader module Nov 8, 2022
@kalebdfischer
Copy link
Contributor Author

Not sure why mypy is complaining about files that are not changed by this MR - I haven't been able to reproduce locally.

fragment_offsets.append(frame_position - first_frame_location)
first_two_bytes = fp.read(2, True)
if not fp.is_little_endian:
first_two_bytes = first_two_bytes[::-1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could remove this since it appears that encapsulated should always be little endian, but leaving in case there's some other reason @hackermd included the if statement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is probably not needed, but maybe @hackermd can comment on this. Or generally review this PR, for that...

Copy link
Member

@mrbean-bremen mrbean-bremen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had only a cursory glance so far, will have a closer look probably in a couple of days. Generally, I think this is a very valuable addition, and it is good to continue the work @hackermd had started here. If he did not already do this, I suggest that he also reviews this.
This also needs an entry in the release notes, and ideally a description in the documentation related to reading pixel data.

fragment_offsets.append(frame_position - first_frame_location)
first_two_bytes = fp.read(2, True)
if not fp.is_little_endian:
first_two_bytes = first_two_bytes[::-1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is probably not needed, but maybe @hackermd can comment on this. Or generally review this PR, for that...

@hackermd
Copy link
Contributor

hackermd commented Nov 8, 2022

Thanks for all your great work @kalebdfischer! I will take a closer look in the next couple of days.

Copy link
Member

@darcymason darcymason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. I've done a quick pass through, will try to look again in more detail later. Just some minor comments at this point.

@@ -0,0 +1,1132 @@
# Copyright 2008-2022 pydicom authors. See LICENSE file for details.
"""Utilities for parsing DICOM PixelData frames.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: copyright here can just use 2022, helps to flag the file is newer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated here

}
_UINT_PIXEL_DATA_TAGS = {
0x7FE00010,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps should be INT without the U since in general pixels can be signed or unsigned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated here

"""
if fp.tell() != pixel_data_location:
fp.seek(pixel_data_location, 0)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the doc string about IOError correct? Seems it ignores and just moves the pointer itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - missed the discrepancy in the refactor. Updated to validate and raise here

ds_copy.file_meta = FileMetaDataset()
ds_copy.file_meta.TransferSyntaxUID = getattr(
original_dataset.file_meta, "TransferSyntaxUID"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a getattr default, this is same as original_dataset.file_meta.TransferSyntaxUID. Did you mean to default to None or something else? In which case can use original_dataset.file_meta.get("TransferSyntaxUID")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was lazy code that counted on the AttributeError exceptions. I've updated the function to raise ValueError for missing required attributes.

"""
logger.debug("read File Meta Information")
self.fp

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bare statement seems fragile. I assume it triggers the fp property to open the file, but would not on a second call if the self._fp had not been reset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.fp will raise if the file doesn't exist or another exception is raised when trying to open a BufferedReader which will trigger the exit logic. It's a fair consideration that there may be an issue that goes undetected if a bad file-like is provided instead of a path.
I've done a couple of things to improve behavior and readability:

  • added a try/except (with re-raise) to open with an extra logging statement to improve readability
  • called self.dicom_file_like instead since it actually requires valid frame_info and DicomFileLike instantiation

Let me know if there are other improvements that you'd like made to this.

for f in get_testdata_files("*.dcm")
if not any([f.endswith(x) for x in cannot_read])
]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier and more readable, I think, to make can_read a set, and then all_paths=(Path(f) for f in get_testdata_files("*.dcm")), and can_read_paths = [p for p in all_paths if p.name not in cannot_read].

Really pydicom should have a get_testdata_paths function since we've been trying to move more towards pathlib; something for a future PR if we can remember...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed here

@kalebdfischer
Copy link
Contributor Author

I had only a cursory glance so far, will have a closer look probably in a couple of days. Generally, I think this is a very valuable addition, and it is good to continue the work @hackermd had started here. If he did not already do this, I suggest that he also reviews this. This also needs an entry in the release notes, and ideally a description in the documentation related to reading pixel data.

I added updates to the release notes, working_with_pixel_data, and added a framereader.rst file to reference. Happy to wait for @hackermd's review.

Copy link
Contributor

@hackermd hackermd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to review this PR and added comments. However, I ultimately concluded that we should have additional design discussions.

The PR adds several classes (FrameDataset, FrameInfo, etc.) and methods (to_dict()) that do not directly map to DICOM information entities and are therefore problematic in my opinion. The rest of the library closely follows and directly implements the DICOM information model (e.g., DataElement, Dataset, etc.). The changes introduce via this PR divert from that approach.

Furthermore, I think the functionality mainly belongs into the filereader module. The main class is intended for reading frames from files (or file-like objects) and uses file-specific methods (read(), seek(), tell(), etc.). It is not intended to decode frames from a Dataset that has already been fully read into memory. In other words, it is very different from the Dataset.pixel_array property and similar to dcmread(). What differentiates it from dcmread() is that it provides access to individual frames and that it caches information to be reused (metadata and BOT) for repeated frame-level access. For improved cohension, I would thus add all the functionality related to file reading into the filereader module. Functionality related to frame decoding can remain in frame.

return basic_offset_table


def get_dataset_copy_with_frame_attrs(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not expose that function publicly. To me, this should be an implementation detail. I even tend argue that the individual elements should be passed directly to decoders and missing elements should be handled explicitly.



def decode_frame(
frame_bytes: bytes, original_dataset: Dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest passing the individual data elements to the function rather than original_dataset. That is more verbose, but easier to understand.

Specifically, I suggest using the existing implementation in highdicom: https://github.com/herrmannlab/highdicom/blob/d36e2a50350fd15531f549f52dafd87587a80afe/src/highdicom/frame.py#L327-L451

If that gets implemented here in pydicom, we can just import it in highdicom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to refactor with the function from highdicom, but came across an issue - this line has mismatched typing. I'm not terribly comfortable just telling mypy to ignore this. Do you have any suggestions?

return ds_temp.pixel_array


class BasicOffsetTable(List):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not make that a subclass of List and then add to_dict() and from_dict() methods. If you really want to implement it as a class, I suggest using a NamedTuple instead.

@kalebdfischer
Copy link
Contributor Author

kalebdfischer commented Nov 10, 2022

Thanks for all your great work @kalebdfischer! I will take a closer look in the next couple of days.

@hackermd thanks for the original MR! We've been maintaining a pydicom branch with the ImageFileReader for
Flywheel's OHIF backend service and it's been very stable across a wide surface area of files.

For those using cloud storage for files, download of files is one of the greatest sources of latency. This refactor is going to enable thriftier frame retrieval the frame information can be stored and one can specifically request the frame bytes from storage on subsequent reads without need to retrieve the whole file.

@hackermd
Copy link
Contributor

Thanks for all your great work @kalebdfischer! I will take a closer look in the next couple of days.

@hackermd thanks for the original MR! We've been maintaining a pydicom branch with the ImageFileReader for Flywheel's OHIF backend service and it's been very stable across a wide surface area of files.

For those using cloud storage for files, download of files is one of the greatest sources of latency. This refactor is going to enable thriftier frame retrieval the frame information can be stored and one can specifically request the frame bytes from storage on subsequent reads without need to retrieve the whole file.

Have you considered using dicomweb_client? It fully abstract frame-level access and supports retrieving frames via DICOMweb (DICOMwebClient) and reading them from local files (DICOMfileClient).

@hackermd
Copy link
Contributor

hackermd commented Nov 10, 2022

@kalebdfischer @darcymason please note that the ImageFileReader class is part of highdicom: https://github.com/herrmannlab/highdicom/blob/d36e2a50350fd15531f549f52dafd87587a80afe/src/highdicom/io.py#L211

I think it would be great to have that class implemented in pydicom so that we can just import it in highdicom. However, that would require the API to remain the same.

@kalebdfischer
Copy link
Contributor Author

Thanks for all your great work @kalebdfischer! I will take a closer look in the next couple of days.

@hackermd thanks for the original MR! We've been maintaining a pydicom branch with the ImageFileReader for Flywheel's OHIF backend service and it's been very stable across a wide surface area of files.
For those using cloud storage for files, download of files is one of the greatest sources of latency. This refactor is going to enable thriftier frame retrieval the frame information can be stored and one can specifically request the frame bytes from storage on subsequent reads without need to retrieve the whole file.

Have you considered using dicomweb_client? It fully abstract frame-level access and supports retrieving frames via DICOMweb (DICOMwebClient) and reading them from local files (DICOMfileClient).

Thanks for the recommendation - our DIMSE service actually uses dicomweb_client. There are a number of reasons that pydicom is more ergonomic and practical for our OHIF backend service.

@kalebdfischer
Copy link
Contributor Author

@kalebdfischer @darcymason please note that the ImageFileReader class is part of highdicom: https://github.com/herrmannlab/highdicom/blob/d36e2a50350fd15531f549f52dafd87587a80afe/src/highdicom/io.py#L211

I think it would be great to have that class implemented in pydicom so that we can just import it in highdicom. However, that would require the API to remain the same.

I'm happy to help with any refactoring needed by highdicom. I will take a deeper look this afternoon to ensure that this PR doesn't make the refactoring more of a pain than it needs to be.

@darcymason
Copy link
Member

I think it would be great to have that class implemented in pydicom so that we can just import it in highdicom

It seems reasonable to me to make this as compatible as possible. If it's going to be here anyway, you might as well not have duplicate functionality in highdicom.

@darcymason
Copy link
Member

The rest of the library closely follows and directly implements the DICOM information model (e.g., DataElement, Dataset, etc.).

Also a good point. Pydicom has always tried to use DICOM nomenclature wherever possible.

return frame_dataset


class FrameInfo:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, be aware that there are datasets with hundreds of thousands of frames and we don't want this class to become a performance bottleneck.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any way that this class imposes a performance bottleneck that ImageFileReader does not.

@hackermd
Copy link
Contributor

cc'ing @pieper @fedorov @CPBridge for awareness

@hackermd
Copy link
Contributor

I think it would be great to have that class implemented in pydicom so that we can just import it in highdicom

It seems reasonable to me to make this as compatible as possible. If it's going to be here anyway, you might as well not have duplicate functionality in highdicom.

Yes, if the API and behavior remains the same, highdicom.io.ImageFileReader could just be an alias for the class in pydicom and we could deprecate the highdicom implementation.

@kalebdfischer
Copy link
Contributor Author

I started to review this PR and added comments. However, I ultimately concluded that we should have additional design discussions.

Thank you for the in-depth review. I'll be working on revisions today.

The PR adds several classes (FrameDataset, FrameInfo, etc.) and methods (to_dict()) that do not directly map to DICOM information entities and are therefore problematic in my opinion. The rest of the library closely follows and directly implements the DICOM information model (e.g., DataElement, Dataset, etc.). The changes introduce via this PR divert from that approach.

Could you elaborate on what the issue is and what changes you'd like? It doesn't seem terribly reasonable to remove tooling that allows storage to and from a json-like object, but I'm happy to make improvements if there are specific suggestions.

Furthermore, I think the functionality mainly belongs into the filereader module. The main class is intended for reading frames from files (or file-like objects) and uses file-specific methods (read(), seek(), tell(), etc.). It is not intended to decode frames from a Dataset that has already been fully read into memory. In other words, it is very different from the Dataset.pixel_array property and similar to dcmread(). What differentiates it from dcmread() is that it provides access to individual frames and that it caches information to be reused (metadata and BOT) for repeated frame-level access. For improved cohension, I would thus add all the functionality related to file reading into the filereader module. Functionality related to frame decoding can remain in frame.

I'll defer to the maintainers on this one, but the benefits of a separate module are:

  • filereader module does not get really long and cluttered
  • rebasing and merging are easier to manage
  • having the FrameReader class and associated functions in its own module makes the library more readable overall

@hackermd
Copy link
Contributor

Could you elaborate on what the issue is and what changes you'd like? It doesn't seem terribly reasonable to remove tooling that allows storage to and from a json-like object, but I'm happy to make improvements if there are specific suggestions.

The library implements the DICOM standard and uses the terms defined by the DICOM Model of the Real World and DICOM Information Model. For example, "Frame Info" is not a defined Information Entity (IE), Information Object Definition (IOD), Module, Macro, Attribute, etc.

@hackermd
Copy link
Contributor

having the FrameReader class and associated functions in its own module makes the library more readable overall

The problem with the term FrameReader is that is doesn't make clear that frames are read from a file. Also the filereader module encapsulates all data structures and functions for reading data from files. Therefore, including the class in this module increases cohesion.

@hackermd
Copy link
Contributor

hackermd commented Dec 13, 2022

@kalebdfischer I suggest implementing the API of the highdicom.io.ImageFileReader class. I am all for improvement of implementation details, but our goal is to avoid duplication of functionality between pydicom and highdicom. Once the class is implemented in pydicom, we can deprecate it in highdicom.

It would be great if changes were based on #1447 so that we can more easily track changes.

@darcymason
Copy link
Member

@kalebdfischer, @hackermd, @mrbean-bremen. and perhaps @pieper @fedorov @CPBridge @scaramallion

I'm reviewing issues again for consideration in release 2.4. This discussion has been idle for a while, and I'm thinking this is too complex to resolve in short order, and I'd rather get the release out in the next few weeks. So I suggest re-assigning this to the following (major) release, 3.0. Perhaps also as part of the Roadmap development it will become clearer how pydicom should incorporate this kind of functionality.

Any strong thoughts on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants