Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Improve meta data extraction from DICOM files #290

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

moloney
Copy link
Contributor

@moloney moloney commented Dec 4, 2014

This is basically the meta data extraction, without anonymization, stuff from PR #232 with some improvements. Added tests for most of the functionality.

Adds the meta data extraction parts from dcmstack, with some improvements. Most notably it can also parse the Siemens XProtocol format (kind of looks like XML but isn't).

I still need to fix text encoding and filtering of non-text byte data in the extract module. Text in a DICOM file can basically have any encoding and there is often no specification of what that encoding is.

@matthew-brett how do you feel about pulling in chardet (https://github.com/chardet/chardet) as a dependency? It is less static than OrderedDict, so I guess we don't want to pull it into externals.

@moloney
Copy link
Contributor Author

moloney commented Dec 5, 2014

To be more precise about the character encoding in a DICOM:

  1. The "SpecificCharacterSet" does tell us something about how most of the text is encoded (at least for anything with a VR of SH, LO, ST, LT, UT or PN)

  2. There may be more than one character set specified. In this case elements could use any one of the encodings, or even more than one and switch between them using escape sequences.

  3. We do sometimes get text in an element with a VR of OB. This does not need to use one of the encodings specified in the SpecificCharacterSet.

  4. Anything parsed out of a private sub-header could be using a different encoding.

@matthew-brett
Copy link
Member

Hi,

On Thursday, December 4, 2014, moloney <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

To be more precise about the character encoding in a DICOM:

  1. The "SpecificCharacterSet" does tell us something about how most of the
    text is encoded (at least for anything with a VR of SH, LO, ST, LT, UT or
    PN)

  2. There may be more than one character set specified. In this case
    elements could use any one of the encodings, or even more than one and
    switch between them using escape sequences.

  3. We do sometimes get text in an element with a VR of OB. This does not
    need to use one of the encodings specified in the SpecificCharacterSet.

  4. Anything parsed out of a private sub-header could be using a different
    encoding.

That sounds pretty unpleasant.

Is there any way we could offer this kind of code to pydicom?

Is there any way of knowing what encoding the VR of OB has? Or even
guessing, other than sniffing the data? Maybe it would be a bad idea to
guess, would be parameter to the code?

@matthew-brett
Copy link
Member

Oops - sorry - I see (now I downloaded it extremely slowly) that chardet is for detecting character encoding. That makes me feel a little nervous though, I could imagine that character encoding detection going wrong giving odd errors. Maybe chardet could be an optional dependency and offer suggestions for encodings if present?

@matthew-brett
Copy link
Member

I thought a bit about the xprotocol parser.

I think we'll likely need a generic parser in nibabel, of which the
xprotocol parser could be a specific instance. For example, the ASCCONV
stuff just happens to be more or less Python so we can use the ast
module, but otherwise we'd have to write another parser for that.

Another advantage of a general parsing module would be to make the grammar
easier to read in the code, because the parsing would be separate from the
grammar.

I played with this idea by writing a parser for the xprotocol stuff using
pyparsing - here : https://github.com/matthew-brett/xprotp . The grammar
can be neatly expressed (xpparse.py), but I don't think that's the right
way to go because I find pyparsing a bit fiddly, somewhat slow, and the
parsing error messages are not at all informative.

So, I'm wondering if you'd consider us changing to use ply for the
xprotocol parsing here? If you would consider it, I will write a draft
implementation to see whether it would be fast enough and readable. If we
did go that route, I think we should ship a copy of ply in externals.
The advantages would be:

  • We'd have a generic lexer / parser for the next somewhat complicated text
    format;
  • We wouldn't have to maintain and fix the lexer / parser itself (ply is
    very stable, and widely used);
  • The grammar might be easier to read;
  • We might have less code (just the grammar, no lexing / parsing);
  • Others might find it easier to maintain (ply has good docs and examples).

If you think that might work, let me know, and I will draft something up to
give a better idea of what it would really look like. If it really was
going to be preferable, the code would have to be reasonably easy to read,
the error messages would have to be informative, and it would have to be
about as fast as your current implementation.

@moloney
Copy link
Contributor Author

moloney commented Dec 16, 2014

Hi Matthew,

Thanks for all the feedback.

For the character encoding stuff. The pydicom package does have some code for handling the string type VRs (in the charset module). So I think we would just need to deal with OB elements and private sub-headers. I suppose we could use the chardet package if present, and fall-back to just rejecting anything that is not ASCII (assume it is non-text binary data).

When I originally wrote the xprotocol parser, I looked into using something like pyparsing or ply. Then I realized I had no idea what type of grammar the format used, and figured the best way to learn the format was to hack out my own parser. I would be in favor of replacing this with a more generic parsing package provided the requirements you mentioned are met (readable, fast, informative error messages).

@matthew-brett
Copy link
Member

OK - I'll wade back into it and see what I can do. Might end up giving up but I'll let you know if so.

@matthew-brett
Copy link
Member

For the character encoding - the situation I hope we can avoid is where somebody running the code on two different machines on the same data gets two different answers - where one machine has chardet and the other does not.

@moloney
Copy link
Contributor Author

moloney commented Dec 17, 2014

Do you consider it problematic if the machine without chardet gives back no answer for the element in question (since it can only assume the data is non-text), while the machine with chardet is able to give back an answer?

@matthew-brett
Copy link
Member

Only problematic if the user could easily miss that there's a difference between the two.

@matthew-brett
Copy link
Member

Preliminary parser here : https://github.com/matthew-brett/xprotoply

Error messages not yet very informative and I need to check over some shift-reduce conflicts, but performance seems pretty good:

In [1]: import nibabel.nicom.xprotocol as nxp
/Users/mb312/usr/local/bin/ipython:1: UserWarning: The DICOM readers are highly experimental, unstable, and only work for Siemens time-series at the moment
Please use with caution.  We would be grateful for your help in improving them
  #!/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python

In [2]: import xpparse as xpp
WARNING: Token 'TYPED_TAG' defined, but not used
WARNING: Token 'WHITESPACE' defined, but not used
WARNING: There are 2 unused tokens

In [3]: contents = open('xprotocol_sample.txt', 'rt').read()

In [4]: timeit nxp.read_protos(contents)
10 loops, best of 3: 90.3 ms per loop

In [5]: timeit xpp.parser.parse(contents)
100 loops, best of 3: 14.9 ms per loop

I think the code is fairly readable - what do you think?

@matthew-brett
Copy link
Member

Error messages now fairly informative with line and column numbers. Will read a few Siemens files I have to hand. I'm testing with:

from __future__ import print_function

import sys

import dicom

from nibabel.nicom.utils import find_private_section
import nibabel.nicom.csareader as csa

import xpparse as xpp

from nose.tools import assert_equal

for fname in sys.argv[1:]:
    dcm = dicom.read_file(fname)
    sect_start = find_private_section(dcm, 0x29, 'SIEMENS CSA HEADER')
    elem = dcm[0x29, sect_start + 0x20]
    csa_dict = csa.header_to_key_val_mapping(csa.read(elem.value))
    phoenix = csa_dict.get('MrPhoenix')
    if phoenix is None:
        phoenix = csa_dict.get('MrPhoenixProtocol')
    if phoenix is None:
        print("No phoenix protocol in " + fname)
        continue
    res = xpp.parse(phoenix)
    print(len(res))
    # Get the embedded protocol string from the protocol contents
    protocol = res[0]
    assert_equal(len(protocol['depends']), 0)
    assert_equal(len(protocol['blocks']), 1)
    for v in protocol['blocks'][0]['value']:
        if v['name'].startswith('Protocol'):
            break
    proto_str, asc_hdr = xpp.split_ascconv(xpp.strip_twin_quote(v['value']))
    res2 = xpp.parse(proto_str)
    assert_equal(len(res2), 2)

@moloney
Copy link
Contributor Author

moloney commented Dec 23, 2014

Thanks again for the hard work!

I have been reading through the PLY documentation and your code. Obviously there is a bit of a learning curve to read through the code, but I don't think it is an issue. Once we need to start parsing other text formats, this really is more of a strength than a weakness (just learn PLY rather than N different custom parsers).

The speedup is quite nice, and pretty important when it comes to doing something like converting 1000+ files into a Nifti.

I am starting to play around with the code now. I will run it against some of our DICOM files (we have some fairly old files that should help tease out any issues with older versions of the format). I will also try against some Siemens K-space files (meas.dat) which is actually the reason I originally wrote the parser.

@matthew-brett
Copy link
Member

Brendan - any news from your researches?

@matthew-brett
Copy link
Member

Any news here? Anything I can do to help?

@moloney
Copy link
Contributor Author

moloney commented Feb 17, 2015

Sorry about the lack of progress.

I think the unicode handling is reasonable now. I use pydicom to decode the standard DICOM elements. For elements with a VR of OB, OW, or UN we will use the chardet package if available to try to find the encoding. We fall back to just testing if the data is ASCII (skipping elements where this is not the case).

I integrated your PLY based parser for the xprotocol, with some minor changes made on my fork. I want to do some more work on this code to produce a more user friendly result. For example it would be nice to be able to do a lookup like result['']['Protocol0'] instead of result[0]['blocks'][0]['value'][2]['value'].

I am also thinking it would be nice to do the meta data "extraction" in a lazy manner. So you only pay the cost of all the extra parsing if you actually need something from one of the private sub headers. To that end, I am thinking the extraction code should really be integrated with the DicomWrapper objects. For example you could access the CSA Image sub header directly through a DicomWrapper by doing dw['CsaImage']. What do you think about this approach?

@matthew-brett
Copy link
Member

The lazy loading sounds like a very good idea.

Did you have a chance to look at the failing tests?

Will you have time to get to cleaning up the output from the ply parser? It seems like a good idea to get that in, in a good state.

@@ -13,11 +16,11 @@ def find_private_section(dcm_data, group_no, creator):
element number of these tags give the start of matching information, in the
higher tag numbers.

Parameters
----------
Paramters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to have inadvertedly removed an e.

moloney and others added 12 commits April 23, 2015 15:43
Added some other small convienance functions for converting
DICOM 'AS' and 'TM' values.
Fixed up our "fake" test data to work with the new
find_private_element function.
Still need to handle text encoding, and ignoring non-text byte
data.
The ASSCONV text header only needs its wierd double quotes fixed to be
valid Python syntax.

Fix double quotes, parse with Python ast module.  I think this should be
more robust than a custom parser.
Fixes to make tests pass for Python 3.
ElemDict claimed to accept a mapping as __init__ input, but in fact this
was broken.  Fix and test.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.61%) to 94.65% when pulling 078d775 on moloney:enh-meta-extract into 2e575eb on nipy:master.

Get rid of 'extract' module. The dicom wrapper objects can be
initialized with translators for private elements and then the
parsed results can be accessed directly through __getitem__.

The dicom wrappers still need a method that returns the nipy
JSON meta data extension for the full data set.
@moloney
Copy link
Contributor Author

moloney commented Apr 24, 2015

I just commited my first attempt at putting the extraction logic into the dicom wrapper objects and doing lazy parsing. For example you can now access the "CSA Image" sub header by just doing dw['csa_image'].

I made some backwards incompatible changes. In particular the dicom wrapper objects all take a list of PrivateTranslator objects as their second argument to __init__ now. Previously, the siemens wrappers took the CSA Image header as the second argument. I don't think a lot of user code would have used that functionality rather than just relying on wrapper_from_data method to do the right thing. The siemens wrappers also no longer have a csa_header attribute that points to the CSA Image sub header, but we could add this back in temporarily with a deprecation warning.

Also, the wrappers now require a real DICOM data set rather than a dict. Again, I don't think that is a big deal for user code. I still need to update the tests that are broken by this change, but wanted to get some feedback first.

Finally, the dicom wrapper objects need a method to return the full JSON meta data extension described here: https://github.com/nipy/nibabel/wiki/json-header-extension

I am going to start working on improving the xpparse output as described above while I wait for any feedback on these changes.

Doesn't need dicom or xpparse as we just get input from text file.
Test array with nested array which in turn has a nested map. The
array value itself also has meta data attributes which should
presumably override those of the array itself, but for now we just
ignore them.
No longer need to work around ply issue
(dabeaz/ply#52)
Fairly large refactor of parser. Simplify the code a bit and
make things more DRY.
The idea is to allow clean readable access to nested meta data
structures where 99% of the time we are interested in the core
"value" for each element in the structure rather than the
associated meta data.
Allow addition on ElemList objects. Make both ElemList and ElemDict
work better with other objects of the same type (e.g. allow the
contructor to take them, allow ElemDict.update() to take ElemDict,
etc).

No longer try to autoconvert nested dict objects to ElemDict objects
as this is a bad idea and we do want to allow plain dicts to be
nested.
Fixes issue where I assumed 'dependency' and 'param_card' elements
have unique names.
Newer versions of syngo may include some attributes that describe
the ASCCONV sub header. We parse thes and put include the results.
Still do splitting of xprotocol/ascconv in dicomwrappers.
@matthew-brett
Copy link
Member

Hi Brendan. Sorry to be so slow to get to this. What do you think the best way forward here is? I'm afraid I've left a lot of changes to back up. Maybe schedule on online or real meeting sometime to work through this stuff?

@moloney
Copy link
Contributor Author

moloney commented Dec 18, 2015

Hi Matthew. We just hired someone to help me out and he is working on a branch for doing the stacking part and creating the meta data extension. I hope to spend some time in the next week cleaning up this PR, at which point it would be good to discuss further.

@matthew-brett
Copy link
Member

Hi Brendan - that's good news. I think I've not done a good job on helping you with this PR - what do you think we should do better from now? I should obviously give more timely feedback, I will try hard to do that, please do remind me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants