-
-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Support buffers for OB DataElement Values #1919
base: main
Are you sure you want to change the base?
Conversation
* Allows assinging the DataElement's value to a BufferedIOBase * Adds a new method value_generator that yields chunks from the buffer
Codecov Report
@@ Coverage Diff @@
## main #1919 +/- ##
==========================================
+ Coverage 97.74% 97.75% +0.01%
==========================================
Files 63 64 +1
Lines 10739 10802 +63
==========================================
+ Hits 10497 10560 +63
Misses 242 242
📣 Codecov offers a browser extension for seamless coverage viewing on GitHub. Try it in Chrome or Firefox today! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very quick initial review ... I've posted a few comments but will need to absorb this for a while longer for the bigger structural issues.
I'm trying to think through how this would be used in practice. For example, here the full value is read in if value
is accessed, but that seems to defeat the purpose. However, if in practice people just substitute in a generator value for a 'dummy' small value just before writing then maybe that can be avoided.
I thought briefly about suggesting a subclass DataElementBuffered
rather than updating a bunch of code within DataElement
, but after a little thought I'm thinking one class is probably better.
I'll try to go over a little more in the next couple of days.
src/pydicom/dataelem.py
Outdated
"""Input parameter used to store how many bytes have been read from | ||
a buffered value""" | ||
bytes_read: int = 0 | ||
|
||
|
||
def __init__( | ||
self, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On BufferInfo class, why not just store bytes_read
directly in DataElement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of the value_generator
function having a return type of the # of bytes read so thought an output parameter made sense here. I think its most likely unnecessary.
src/pydicom/dataelem.py
Outdated
@@ -160,6 +162,20 @@ class DataElement: | |||
maxBytesToDisplay = 16 | |||
showVR = True | |||
is_raw = False | |||
is_buffered = False | |||
""" | |||
Tracks whether the data element value is a buffer. If so, calling self.value will read the entire |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are class attributes meant to apply to all instances of the class (is_raw
is used for all instances, because RawDataElement
sets its own is_raw
).
In any case, I think it better to make it a read-only property is_buffered
, which checks the type of _value
. Then they can never get out of synch. E.g. see my comment on changing the value
property setter.
src/pydicom/dataelem.py
Outdated
# handle a buffered value - ie, a buffer pointing to a file on disk | ||
if isinstance(value, BufferedIOBase): | ||
if VR not in VR_.OB: | ||
raise ValueError(f"Invalid VR: {self.VR}. Only the OB VR supports buffered values.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should just be !=
here instead of not in
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this with a tuple but good catch!
src/pydicom/dataelem.py
Outdated
@@ -449,6 +482,48 @@ def value(self, val: Any) -> None: | |||
|
|||
self._value = self._convert_value(val) | |||
|
|||
|
|||
def value_generator(self, *, buffer_info: BufferInfo = None, chunk_size = 1024) -> Iterator[bytes]: | |||
"""Consume data from a buffered value. The value must be a buffer, the buffer must be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think probably at least 8K chunk default is better. We had this come up with pydicom's undefined length reads before (#436).
But I'm also now wondering whether this generator is needed. Any code using buffered reading can just directly use value
after checking it is a buffered reader. I need to ponder a little more ...
src/pydicom/dataelem.py
Outdated
if self.is_buffered: | ||
self._value = cast(BufferedIOBase, self._value).read() | ||
self.is_buffered = False | ||
|
||
return self._value | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to update the value
setter code as well - value
could be set to a buffer after initial creation. The logic to check that on __init__
probably belongs in the value
setter, actually, where it can handle both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good thinking - will update.
Thanks for the initial feedback! I had some thinking about subclassing it as well but felt like it would end up leading to more complicated code. The reason I went with the value getter reading the entire buffer is because I wasn't sure if it made sense to make all code in the codebase buffer aware or maybe to allow more of a gradual upgrade for support. For our particular use case since we only need it to write the value to the file, it touches a pretty small amount of code. But what should be the expected behavior if you assign PixelData to a buffer and then try to compress/decompress it or access pixel_array for instance? |
* use @Property for dynamic is_buffered * use value setter to handle buffered value * remove BufferInfo class
* remove default buffer read behavior
After thinking about it a bit more, I think you are right on not reading the full buffer by default as that'll be hard to code against and inevitably may lead to reading in the whole thing by accident. From your perspective as a maintainer, is it okay if certain operations raise an exception if the code isn't setup to deal with a buffered value? |
# memory usage is in bytes | ||
limit = MEGABYTE * 400 | ||
else: | ||
pytest.skip("This test is not setup to run on this platform") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: I need to find what the other platform's unit is for this test. I'm also open to other implementation's of this test but it was the only way I could think of to ensure we didn't read the file into memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still needs to be done?
* formatting * remove invalid test
src/pydicom/filewriter.py
Outdated
curr_pos = fp.tell() | ||
fp.seek(vr_length_seek_pos) | ||
fp.write_UL(elem.bytes_read_from_buffer) | ||
fp.seek(curr_pos) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels pretty hacky - I'd love to get some thoughts on how to handle this. I'm stuck on how to handle writing the length of data before we know the length of data without having to backtrack without calling stat
on the original file the buffer is from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow the logic here, but I see no reason why you have to first write the buffer before writing the length, as you can get the length from the buffer. Also note that we don't want to seek backwards at all as this breaks certain use cases where fp is a stream or a zip file I think (I don't remember what use case it was exactly).
Edit: Ok, the problem was with writing gzip file-like (see #753). There was also an old issue about writing to a stream (#154).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow the logic here, but I see no reason why you have to first write the buffer before writing the length, as you can get the length from the buffer
The problem is that I'm not writing the data to the buffer but instead to the file directly. The goal is to avoid reading the entire buffered element into memory and instead to transfer the data directly from the element (which contains a buffered reader) to the dicom file.
So this means that we don't know the length until we read all of the data (and write it to the file).
For a little context, the goal is to avoid reading large uncompressed files into memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that I'm not writing the data to the buffer but instead to the file directly.
A right, I should have seen that - thanks!
That's quite the conundrum... any thoughts on how to tackle it then without reading the entirety of PixelData into a memory buffer?
Right now I have no idea. May have to think about it... How easy would it be to configure that no buffering is used? That would still allow the use case where seeking back is not possible (in case we find no other solution).
@darcymason @mrbean-bremen I was thinking about the seeking issue and I think I have a solution. Instead of supporting any IO buffer, I think it would make sense to limit it to a FileIO buffer. This would allow us to call If someone would like to read pixel data from a non-seekable stream like a socket, then they can write the file to a temporary file first. It'd also make the code a little easier to manage as we wouldn't need to monkey around with keeping track of the first 4 bytes to check for encapsulation. I'll make some changes to see how that would look on Monday or Tuesday but let me know what y'all think. |
I was also thinking of saying we just start with a seek-able file handle anyway, it is still a big step ahead from where we are now. I'm not sure about using |
Completely agree, thanks! |
* move buffer read code to a utils file * create util to reset buffer position * glean buffered value length from using buffer.tell and seeking
* fix existing tests
src/pydicom/dataelem.py
Outdated
# a known tag shall only have the VR 'UN' if it has a length that | ||
# exceeds the size that can be encoded in 16 bit - all other cases | ||
# can be seen as an encoding error and can be corrected | ||
if ( | ||
VR == VR_.UN | ||
and not tag.is_private | ||
and config.replace_un_with_known_vr | ||
# we don't know the length and such can't determine if the length exceeds | ||
# 0xFFFF | ||
and not is_buffered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: should I call buffer_length to get the length here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since UN is not included in the 'bufferable' list, at this point I don't think this check needs to be here at all, since it won't make it past the first term ... well, at least if the check of allowed VRs is moved up to before this line
Maybe I could be persuaded, but I'd like to stay with the philosophy that 2.x is in major-bug-fix-only mode, we have enough on our plates dealing with PRs and issues as it is. Would you not be able to patch 2.X locally and set up pip installs to your own local code-base? (e.g. pip install can be pointed to a file location). |
That's very reasonable - I don't blame you. I don't think it'd be an issue for us to use an internal patched version as long as we know 3.0.0 will contain the same functionality. We were just avoiding using a custom fork and maintaining it. Have you had a chance to look at the latest code changes? I think its in a pretty good state now. Also thinking about documentation - do you want me to write any docs for this feature? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it is getting close. I've got a few small comments.
And yes, we should have documentation -- release note for sure, and something for the 'examples' folder would be ideal -- an example of how to use with a mock large data source. I'd like to have that to 'exercise' the code myself before merging in, to fully understand it and play around with different scenarios.
src/pydicom/dataelem.py
Outdated
# a known tag shall only have the VR 'UN' if it has a length that | ||
# exceeds the size that can be encoded in 16 bit - all other cases | ||
# can be seen as an encoding error and can be corrected | ||
if ( | ||
VR == VR_.UN | ||
and not tag.is_private | ||
and config.replace_un_with_known_vr | ||
# we don't know the length and such can't determine if the length exceeds | ||
# 0xFFFF | ||
and not is_buffered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since UN is not included in the 'bufferable' list, at this point I don't think this check needs to be here at all, since it won't make it past the first term ... well, at least if the check of allowed VRs is moved up to before this line
src/pydicom/dataelem.py
Outdated
@@ -702,6 +740,9 @@ def keyword(self) -> str: | |||
|
|||
def __repr__(self) -> str: | |||
"""Return the representation of the element.""" | |||
if self.is_buffered: | |||
return repr(self.value) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have a test to cover this repr
line as suggested by codecov.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed it in favor of deferring to str(self)
below
* remove redundant code in __repr__ * remove redundant check in __init__
Working on this next - I'm going to see how to best replicate our use case in a generic example. |
Here is the sample file: I see the pydicom-data project but wasn't sure if this was what you were looking for. Figured I'd put up a minimal example first and then see where you wanted me to go from there. |
examples/write_video_uncompressed.py
Outdated
# set PixelData to the buffer - writing the file will copy the pixeldata from the file to the dicom file | ||
# without reading the pixeldata into memory | ||
dataset.PixelData = cine_pixeldata | ||
dataset.save_as("ds.dcm", write_like_original=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With transfer syntax set can use write_like_original = True
which is much better for conformance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can dig more into it, but when I set this to True
, I end up with an image that is all gray. Any intuition as to why that is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see in the program I'm using to view it (Horos), I get "unknown UID" - it seems that I'm not setting something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably need SOP Class UID in the dataset itself, but it depends on the viewer
* clean up example to be more conformant
@@ -9,6 +9,8 @@ | |||
|
|||
import base64 | |||
import copy | |||
from dataclasses import dataclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused import?
raise ValueError( | ||
f"Invalid VR: {self.VR}. Only the following VRs support buffers: {BUFFERABLE_VRS}." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following is a bit more clear
f"Elements with a VR of '{self.VR}' cannot be used with buffered values, supported VRs are: {BUFFERABLE_VRS}"
Utilities to help with reading/writing from python's BufferedIOBase. | ||
""" | ||
from contextlib import contextmanager | ||
from collections.abc import Generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from collections.abc import Generator, Iterator
and remove import from below
@@ -1219,3 +1222,35 @@ def test_invalid_very_long_value_length(self, vr, value): | |||
DataElement(0x00410001, vr, value, validation_mode=config.WARN) | |||
with pytest.raises(ValueError, match=msg): | |||
DataElement(0x00410001, vr, value, validation_mode=config.RAISE) | |||
|
|||
|
|||
class TestBufferedDataElement: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests for pickling/unpickling buffered elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to know what happens if you use a buffer and the source is then moved/deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests for pickling/unpickling buffered elements?
Is this an important use-case? I think we'd have the read the buffer into memory for that to work since a BufferedReader isn't pickle-able.
I'd also like to know what happens if you use a buffer and the source is then moved/deleted
If its before we read the buffer, then the assertion in "buffer assertions" should raise an error on a closed stream. Otherwise, I think it'd be undefined behavior. I can write tests for these specifically if you'd like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its more that I'd like to see if there's anywhere to put some nice exception messages about why it failed. Otherwise don't worry about it
* support setting pixel data to a reference to a BufferedReader * change version to 2.4.3+buffer-support * See pydicom#1919
@gnottobfly, I'm just trying to catch up on this issue and see if we can get over the finish line... what remains? Perhaps the one test with a branch for Windows, a home for the example file (I'm okay with pydicom-data if that works), ...? |
@darcymason I'm so sorry that I never responded and this fell off the radar! I was actually just thinking of this again and was thinking maybe a more elegant solution would be to have a utility function similar to Otherwise, I'm happy to come back and fix this up to make it mergeable. Let me know what you think and sorry again for dropping off for a while. |
@gnottobfly, have you seen the work @scaramallion has been doing over the last few months? There have been some big changes in handling buffers and reworking the structure of pixel handlers etc. I suspect it should be easier to tie into that new work rather than try to fix this merge. |
I haven't but I will take a look! Thanks for the heads up. |
Support buffers for OB DataElement Values
Putting up an early PR to get some early feedback before investing too much time. Early comments on the approach or API welcomed!
This allows setting a DataElement's value to a buffer. The motivation behind this is to write a dataset to disk without copying all of a large image/video into memory to do so. See #1913 for the discussion that led to this. For now I'm only targeting byte values as those are the ones that are expected to be large.
Describe the changes
Tasks
doc/_build/html/index.html
)