Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VR type check when element values are being set #1414

Open
scaramallion opened this issue Jun 16, 2021 · 16 comments
Open

Add VR type check when element values are being set #1414

scaramallion opened this issue Jun 16, 2021 · 16 comments
Milestone

Comments

@scaramallion
Copy link
Member

scaramallion commented Jun 16, 2021

We should add a VR type check on setting element values to ensure they're valid. This should help users when an invalid value is used and with maintenance as it will make it easier to reason what types should be expected.

@mrbean-bremen
Copy link
Member

This is at leastly partly implemented by #1543. @scaramallion - I'm not sure what remains to do here, can you have a look?

@scaramallion
Copy link
Member Author

scaramallion commented Apr 19, 2022

I think originally I created the issue because a user was confused between supplying a raw bytes value for one of the longer O* VRs and supplying the decoded value.

It's also something I sometimes do, particularly with SH and integer VRs like US. I think it'd still be a nice enhancement to cover all VRs when setting new values.

@mrbean-bremen
Copy link
Member

mrbean-bremen commented Apr 20, 2022

As far as I understand, you mean validation for non-string VRs, which is currently not done - this would make sense.

I would say that for all numeric VRs (US, UL, UV, SS, SL, SV, FL, FD) only numbers in the correct range shall be allowed, for O* and AT probably arrays of these numbers or byte arrays (at least for OB, OW, OL, OV and maybe AT), but never strings.

I think originally I created the issue because a user was confused between supplying a raw bytes value for one of the longer O* VRs and supplying the decoded value.

Is this about assigning (decoded) string values instead of the raw byte values?

It's also something I sometimes do, particularly with SH and integer VRs like US.

Not sure about SH - this is a string VR and should already be validated. Can you elaborate on this?

@scaramallion
Copy link
Member Author

This was the original issue

You're right about the SH/string VRs, that should get caught now.

@mrbean-bremen
Copy link
Member

Ok, so you are basically saying that all O* values should always get raw bytes values, because they have a VM of 1, while the numerical VRs could get arrays of numbers (for VM > 1). One could argue that it would still be convenient to set a float array for an OF value, but the best way is probably to use a numpy float array and convert it to a byte array, so I guess your approach makes sense.

I will see if I can put something together, maybe over the weekend.

@darcymason
Copy link
Member

Ok, so you are basically saying that all O* values should always get raw bytes values, because they have a VM of 1

I'm not as deep in this as you two are, but I'd like to raise the principle that all native Python types that make sense should be accepted, as well as raw bytes.

My recollection is that raw bytes on setting are accepted almost as a side-effect because that is what is done when reading from file, and the same machinery is used. Ultimately (in future pydicom versions), decoding to native types (like arrays) should be done if possible when converting the raw data element - although the ambiguous types depend on other data elements, unfortunately, as do the shapes if not a 1-d array.

So I think numpy arrays, Python lists or Python arrays of the appropriate type could be accepted on setting, as well as single values if needed. E.g. for OW an array of ints should be okay, if they are checked to be within the correct number of bits when strict checking.

@mrbean-bremen
Copy link
Member

mrbean-bremen commented Apr 21, 2022

This is also my gut feeling, as I wrote above, though I think that this is currently not possible. Even wih the VM of 1 the O* VRs are semantically arrays of numbers, so I also think it should be possible to set them that way - though this is something that can be added separately later on.

Also I think that for these types (both numeric and O*) validation makes sense only for writing, not for reading. In principle we could add a length check for some of them that should be a multiple of 4 or 8 (FL, FD, UV, OF, OD, OV), though I'm not sure if this is needed.

@mrbean-bremen
Copy link
Member

mrbean-bremen commented Apr 24, 2022

One question, where I'm unsure: shall be allow to set bytes to a numerical VR? This is from a test:

elem = DataElement(0x00100010, 'US', b'\x00')

Should this be allowed, or should it issue a warning? I'm currently adding a check for these values, allowing only int values in the correct range, but maybe I shall also allow bytes... (by "allowing" I mean that a warning is otherwise issued if using the default config)

mrbean-bremen added a commit to mrbean-bremen/pydicom that referenced this issue Apr 24, 2022
mrbean-bremen added a commit to mrbean-bremen/pydicom that referenced this issue Apr 24, 2022
- also add type validation for validated VRs
- see pydicom#1414
mrbean-bremen added a commit to mrbean-bremen/pydicom that referenced this issue Apr 24, 2022
- also add type validation for validated VRs
- see pydicom#1414
@darcymason
Copy link
Member

One question, where I'm unsure: shall be allow to set bytes to a numerical VR? This is from a test:

elem = DataElement(0x00100010, 'US', b'\x00')

Should this be allowed, or should it issue a warning? I'm currently adding a check for these values, allowing only int values in the correct range, but maybe I shall also allow bytes... (by "allowing" I mean that a warning is otherwise issued if using the default config)

I suspect this may be used internally within pydicom. In any case, it has been allowed, so for backwards compatibility, that should still be the default, IMHO.

@mrbean-bremen
Copy link
Member

Ok, I added this (with only a check for the length).

darcymason pushed a commit that referenced this issue Apr 29, 2022
* Add validation for numerical VR values

- also add type validation for validated VRs
- see #1414

* Fix handling of invalid type in length check
- add more tests

* Fix indentation

* Also allow bytes for numeric values
@mrbean-bremen
Copy link
Member

As the PR got merged, I'm copying my comment here to avoid forgetting the question:

For the time being I left out the O* VRs and AT, as I'm not completely sure yet how to handle them. First there is the issue of which types shall be supported by assignment, which is currently discussed in the issue - I want to have some agreement here first. Second - if we allow something like list or array for assignment, we could also check each single value for validity, but I find that too much of an overhead, so I would not want to do this - but I may be wrong here.

@scaramallion
Copy link
Member Author

scaramallion commented May 1, 2022

I like just having bytes for O*, our typing is already pretty complex and encoding values isn't difficult. It also matches the read type.

@darcymason
Copy link
Member

I like just having bytes for O*, our typing is already pretty complex and encoding values isn't difficult

The complexity of typing is a valid point, but I'd hate to sacrifice a 'natural' fit assignment for that reason alone. I really like the philosophy that encoding/decoding are done at I/O and in between you usually work with native Python types (and to me, numpy arrays are the native type for multivalue binary data).

I could also see that someone might assign a value and then also do some kind of calculation with it later, so they would need it as numpy array again.

if we allow something like list or array for assignment, we could also check each single value for validity

Numpy array items are already guaranteed to be of the same type. It looks like Python arrays are too - I haven't really worked with them. We can fairly easily do the tobytes method on behalf of the pydicom user.

For python lists, I agree those might be a pain. We could force people to convert to numpy array, but then we could also do that quite easily, which would automatically enforce the item types.

Perhaps a reasonable compromise is to at least allow numpy arrays for now? I expect that is what most people would use. If the user is not just carrying around bytes then they would arise from some kind of calculation, no doubt done in numpy.

@mrbean-bremen
Copy link
Member

Perhaps a reasonable compromise is to at least allow numpy arrays for now? I expect that is what most people would use. If the user is not just carrying around bytes then they would arise from some kind of calculation, no doubt done in numpy.

I actually have been thinking around the same lines. Lists would be problematic to check, and Python arrays are not really used in this context, but numpy arrays are a natural choice here. If I understand @scaramallion correctly, he is saying that you can always convert a numpy array to bytes using tobytes, so that would not be a big deal fo the user - but we could do this ourselves, too, if a numpy array is assigned.

@darcymason
Copy link
Member

you can always convert a numpy array to bytes using tobytes, so that would not be a big deal fo the user - but we could do this ourselves, too, if a numpy array is assigned.

Just to be clear, not actually at the assignment time, right? - only when encoding the dataset e.g. for file write.

@mrbean-bremen
Copy link
Member

Just to be clear, not actually at the assignment time, right? - only when encoding the dataset e.g. for file write.

Actually I haven't thought about this... But you are right, it would be unexpected if the array would change after the assignment. A round trip should produce the same array.

@darcymason darcymason added this to the v3.0 milestone Apr 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants