Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customization of DataElement_from_raw #1556

Open
naterichman opened this issue Dec 29, 2021 · 7 comments
Open

Customization of DataElement_from_raw #1556

naterichman opened this issue Dec 29, 2021 · 7 comments
Milestone

Comments

@naterichman
Copy link

naterichman commented Dec 29, 2021

Is your feature request related to a problem? Please describe.

I've had a number of interactions here recently relating to DataElement_from_raw and I think my takeaway is that the design could be revisited.

For the last year or so, I (and my colleagues) have been trying to extend pydicom to be able to read any dicom and correct values that can be read but are incorrect, no matter how invalid it is. In order to do this we've been stringing callbacks together that:

  • Replace float values in IS VRs with Integers
  • Remove invalid \ characters in VM == 1 strings
  • Certain specific rules on specific tags that are often wrong (for example, MagneticFieldStrength is often written in mT where the dicom standard says it should be in Tesla)
  • Track and report all the changes that were made to a given RawDataElement to make it valid

In order to do these things, we've needed to patch the DataElement_from_raw function and it's made me realize that there is a fair amount of hard-coded logic in that function that could be customized to allow a better API for fixing RawDataElements

Describe the solution you'd like

Right now, the DataElement_from_raw only exposes one callback, with a fair amount of hard-coded logic after it. Ideally DataElement_from_raw would have three stages exposed entirely as callbacks with sensible defaults:

  1. pre-processing of values callback (i.e. before converting values). Here one could:
    a. Infer VR's from either None or UN by lookup in public or private dictionaries
    b. Handle any number of known issues on certain tags (e.g. Magnetic Field Strength above)
  2. Convert Values callback. This would be a function that recursively calls convert_values and handles errors until no errors are raised. Here one could:
    a. Handle known specific exceptions: BytesLengthException, ValueError, etc.
    b. Handle unknown exceptions by either failing or replacing the VR with something that convert anything, such as 'OB'
  3. Post-processing callback to handle any known post-processing steps.

And possibly even a 4th stage to handle exceptions from initializing the DataElement. This design would give the use a much more fine grained control over the conversion of RawDataElements into DataElements

This is just one possible design, another idea might be that config.data_element_callback could actually just be a list of functions that can be executed in order instead of having to chain callbacks.

@darcymason
Copy link
Member

I'm not against improving this, but I'd like to clarify that the callback is to deal with problems in decoding the data elements from raw to native Python types. I'm not clear that all your examples match that situation, e.g. the mT vs T - so long as it can be read, the value can be divided out in regular code. Is that a case where the VR is actually wrong?

  • Remove invalid '' characters in VM == 1 strings

Was this meant to show a backslash character?

This is just one possible design, another idea might be that config.data_element_callback could actually just be a list of functions that can be executed in order instead of having to chain callbacks.

I think either solution could be possible, but this list of callbacks idea seems a relatively simple evolution of the existing code, and without adding additional time penalty (small, but always there for the 99.9+% of time when these interventions are not needed) for multiple (if callbackX is not None: ...) checks at different points.

@darcymason
Copy link
Member

Another thought: there is also the Dataset.get_item method, which if used before accessing a particular data element, gives the RawDataElement and thus user code could perform changes to the encoded data in a way completely unlinked to DateElement_from_raw which subsequently wouldn't be called for values already converted to DataElement. I think there are situations (undefined length sequences) where the conversion might be triggered before this, but it should be available for most situations.

We could also look at breaking out much of the code in DataElement_from_raw into subroutines, which would make those parts available to alternative processing code to call.

@naterichman
Copy link
Author

naterichman commented Dec 30, 2021

I'd like to clarify that the callback is to deal with problems in decoding the data elements from raw to native Python types

Yep that makes sense to me, the mT -> T a use case that's pretty unique to us. I only used it as a callback to hit the tracking stuff we have, but it probably shouldn't be since it isn't dealing with decoding the data element from raw to native types.

We could also look at breaking out much of the code in DataElement_from_raw into subroutines, which would make those parts available to alternative processing code to call.

That is what I was thinking, breaking the code in there into subroutines, but keeping them as the default operations. Then user code could add on to those functions, or even replace them by configuration.

Edit: I didn't mention it before, but I'd love to help contribute on this.

@darcymason
Copy link
Member

Edit: I didn't mention it before, but I'd love to help contribute on this.

Yep, sounds good. If you want to draft a PR, I (and others) could help fine tune it. I think it could start with just the two pieces mentioned - break out the DataElement_from_raw code to subroutines and add a check after if data_element_callback to see if it is a list and call all items if so.

@darcymason
Copy link
Member

Revisiting this issue, I thought perhaps there was some code done to break out some parts of DataElement_from_raw, but will have to go look. Leaving this issue open and assigning to v3.0 to consider along with with the planned harmonization of callbacks at different points in the reading/decoding processes.

@darcymason darcymason added this to the v3.0 milestone Apr 21, 2023
@naterichman
Copy link
Author

The PR is #1560 but I think it still needs some improvement to the design so its still a WIP, I haven't had a chance to revisit it in a while

@darcymason
Copy link
Member

I haven't had a chance to revisit it in a while

Sure, that's no problem. As noted, I think all of this might change somewhat in v3.0 anyway, although I expect this piece will probably not be affected much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants