-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Improve meta data extraction from DICOM files #290
base: master
Are you sure you want to change the base?
Conversation
To be more precise about the character encoding in a DICOM:
|
Hi, On Thursday, December 4, 2014, moloney <notifications@github.com
That sounds pretty unpleasant. Is there any way we could offer this kind of code to pydicom? Is there any way of knowing what encoding the VR of OB has? Or even |
Oops - sorry - I see (now I downloaded it extremely slowly) that |
I thought a bit about the xprotocol parser. I think we'll likely need a generic parser in nibabel, of which the Another advantage of a general parsing module would be to make the grammar I played with this idea by writing a parser for the xprotocol stuff using So, I'm wondering if you'd consider us changing to use
If you think that might work, let me know, and I will draft something up to |
Hi Matthew, Thanks for all the feedback. For the character encoding stuff. The pydicom package does have some code for handling the string type VRs (in the When I originally wrote the xprotocol parser, I looked into using something like pyparsing or ply. Then I realized I had no idea what type of grammar the format used, and figured the best way to learn the format was to hack out my own parser. I would be in favor of replacing this with a more generic parsing package provided the requirements you mentioned are met (readable, fast, informative error messages). |
OK - I'll wade back into it and see what I can do. Might end up giving up but I'll let you know if so. |
For the character encoding - the situation I hope we can avoid is where somebody running the code on two different machines on the same data gets two different answers - where one machine has chardet and the other does not. |
Do you consider it problematic if the machine without chardet gives back no answer for the element in question (since it can only assume the data is non-text), while the machine with chardet is able to give back an answer? |
Only problematic if the user could easily miss that there's a difference between the two. |
Preliminary parser here : https://github.com/matthew-brett/xprotoply Error messages not yet very informative and I need to check over some shift-reduce conflicts, but performance seems pretty good:
I think the code is fairly readable - what do you think? |
Error messages now fairly informative with line and column numbers. Will read a few Siemens files I have to hand. I'm testing with:
|
Thanks again for the hard work! I have been reading through the PLY documentation and your code. Obviously there is a bit of a learning curve to read through the code, but I don't think it is an issue. Once we need to start parsing other text formats, this really is more of a strength than a weakness (just learn PLY rather than N different custom parsers). The speedup is quite nice, and pretty important when it comes to doing something like converting 1000+ files into a Nifti. I am starting to play around with the code now. I will run it against some of our DICOM files (we have some fairly old files that should help tease out any issues with older versions of the format). I will also try against some Siemens K-space files (meas.dat) which is actually the reason I originally wrote the parser. |
Brendan - any news from your researches? |
Any news here? Anything I can do to help? |
Sorry about the lack of progress. I think the unicode handling is reasonable now. I use pydicom to decode the standard DICOM elements. For elements with a VR of OB, OW, or UN we will use the I integrated your PLY based parser for the xprotocol, with some minor changes made on my fork. I want to do some more work on this code to produce a more user friendly result. For example it would be nice to be able to do a lookup like I am also thinking it would be nice to do the meta data "extraction" in a lazy manner. So you only pay the cost of all the extra parsing if you actually need something from one of the private sub headers. To that end, I am thinking the extraction code should really be integrated with the |
The lazy loading sounds like a very good idea. Did you have a chance to look at the failing tests? Will you have time to get to cleaning up the output from the ply parser? It seems like a good idea to get that in, in a good state. |
nibabel/nicom/utils.py
Outdated
@@ -13,11 +16,11 @@ def find_private_section(dcm_data, group_no, creator): | |||
element number of these tags give the start of matching information, in the | |||
higher tag numbers. | |||
|
|||
Parameters | |||
---------- | |||
Paramters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to have inadvertedly removed an e.
Added some other small convienance functions for converting DICOM 'AS' and 'TM' values.
Fixed up our "fake" test data to work with the new find_private_element function.
Still need to handle text encoding, and ignoring non-text byte data.
The ASSCONV text header only needs its wierd double quotes fixed to be valid Python syntax. Fix double quotes, parse with Python ast module. I think this should be more robust than a custom parser.
Fixes to make tests pass for Python 3.
ElemDict claimed to accept a mapping as __init__ input, but in fact this was broken. Fix and test.
d9e4375
to
86da916
Compare
Minor cleanup in some doc strings
Get rid of 'extract' module. The dicom wrapper objects can be initialized with translators for private elements and then the parsed results can be accessed directly through __getitem__. The dicom wrappers still need a method that returns the nipy JSON meta data extension for the full data set.
I just commited my first attempt at putting the extraction logic into the dicom wrapper objects and doing lazy parsing. For example you can now access the "CSA Image" sub header by just doing I made some backwards incompatible changes. In particular the dicom wrapper objects all take a list of PrivateTranslator objects as their second argument to Also, the wrappers now require a real DICOM data set rather than a dict. Again, I don't think that is a big deal for user code. I still need to update the tests that are broken by this change, but wanted to get some feedback first. Finally, the dicom wrapper objects need a method to return the full JSON meta data extension described here: https://github.com/nipy/nibabel/wiki/json-header-extension I am going to start working on improving the xpparse output as described above while I wait for any feedback on these changes. |
Doesn't need dicom or xpparse as we just get input from text file.
Test array with nested array which in turn has a nested map. The array value itself also has meta data attributes which should presumably override those of the array itself, but for now we just ignore them.
No longer need to work around ply issue (dabeaz/ply#52)
Fairly large refactor of parser. Simplify the code a bit and make things more DRY.
The idea is to allow clean readable access to nested meta data structures where 99% of the time we are interested in the core "value" for each element in the structure rather than the associated meta data.
Allow addition on ElemList objects. Make both ElemList and ElemDict work better with other objects of the same type (e.g. allow the contructor to take them, allow ElemDict.update() to take ElemDict, etc). No longer try to autoconvert nested dict objects to ElemDict objects as this is a bad idea and we do want to allow plain dicts to be nested.
Fixes issue where I assumed 'dependency' and 'param_card' elements have unique names.
Newer versions of syngo may include some attributes that describe the ASCCONV sub header. We parse thes and put include the results.
Still do splitting of xprotocol/ascconv in dicomwrappers.
Hi Brendan. Sorry to be so slow to get to this. What do you think the best way forward here is? I'm afraid I've left a lot of changes to back up. Maybe schedule on online or real meeting sometime to work through this stuff? |
Hi Matthew. We just hired someone to help me out and he is working on a branch for doing the stacking part and creating the meta data extension. I hope to spend some time in the next week cleaning up this PR, at which point it would be good to discuss further. |
Hi Brendan - that's good news. I think I've not done a good job on helping you with this PR - what do you think we should do better from now? I should obviously give more timely feedback, I will try hard to do that, please do remind me. |
This is basically the meta data extraction, without anonymization, stuff from PR #232 with some improvements. Added tests for most of the functionality.
Adds the meta data extraction parts from dcmstack, with some improvements. Most notably it can also parse the Siemens XProtocol format (kind of looks like XML but isn't).
I still need to fix text encoding and filtering of non-text byte data in the
extract
module. Text in a DICOM file can basically have any encoding and there is often no specification of what that encoding is.@matthew-brett how do you feel about pulling in chardet (https://github.com/chardet/chardet) as a dependency? It is less static than OrderedDict, so I guess we don't want to pull it into externals.