WIP: Improve meta data extraction from DICOM files #290

moloney · 2014-12-04T03:42:15Z

This is basically the meta data extraction, without anonymization, stuff from PR #232 with some improvements. Added tests for most of the functionality.

Adds the meta data extraction parts from dcmstack, with some improvements. Most notably it can also parse the Siemens XProtocol format (kind of looks like XML but isn't).

I still need to fix text encoding and filtering of non-text byte data in the extract module. Text in a DICOM file can basically have any encoding and there is often no specification of what that encoding is.

@matthew-brett how do you feel about pulling in chardet (https://github.com/chardet/chardet) as a dependency? It is less static than OrderedDict, so I guess we don't want to pull it into externals.

moloney · 2014-12-05T01:54:26Z

To be more precise about the character encoding in a DICOM:

The "SpecificCharacterSet" does tell us something about how most of the text is encoded (at least for anything with a VR of SH, LO, ST, LT, UT or PN)
There may be more than one character set specified. In this case elements could use any one of the encodings, or even more than one and switch between them using escape sequences.
We do sometimes get text in an element with a VR of OB. This does not need to use one of the encodings specified in the SpecificCharacterSet.
Anything parsed out of a private sub-header could be using a different encoding.

matthew-brett · 2014-12-11T18:55:19Z

Hi,

On Thursday, December 4, 2014, moloney <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

To be more precise about the character encoding in a DICOM:

The "SpecificCharacterSet" does tell us something about how most of the
text is encoded (at least for anything with a VR of SH, LO, ST, LT, UT or
PN)

There may be more than one character set specified. In this case
elements could use any one of the encodings, or even more than one and
switch between them using escape sequences.

We do sometimes get text in an element with a VR of OB. This does not
need to use one of the encodings specified in the SpecificCharacterSet.

Anything parsed out of a private sub-header could be using a different
encoding.

That sounds pretty unpleasant.

Is there any way we could offer this kind of code to pydicom?

Is there any way of knowing what encoding the VR of OB has? Or even
guessing, other than sniffing the data? Maybe it would be a bad idea to
guess, would be parameter to the code?

matthew-brett · 2014-12-11T19:14:48Z

Oops - sorry - I see (now I downloaded it extremely slowly) that chardet is for detecting character encoding. That makes me feel a little nervous though, I could imagine that character encoding detection going wrong giving odd errors. Maybe chardet could be an optional dependency and offer suggestions for encodings if present?

matthew-brett · 2014-12-15T16:03:27Z

I thought a bit about the xprotocol parser.

I think we'll likely need a generic parser in nibabel, of which the
xprotocol parser could be a specific instance. For example, the ASCCONV
stuff just happens to be more or less Python so we can use the ast
module, but otherwise we'd have to write another parser for that.

Another advantage of a general parsing module would be to make the grammar
easier to read in the code, because the parsing would be separate from the
grammar.

I played with this idea by writing a parser for the xprotocol stuff using
pyparsing - here : https://github.com/matthew-brett/xprotp . The grammar
can be neatly expressed (xpparse.py), but I don't think that's the right
way to go because I find pyparsing a bit fiddly, somewhat slow, and the
parsing error messages are not at all informative.

So, I'm wondering if you'd consider us changing to use ply for the
xprotocol parsing here? If you would consider it, I will write a draft
implementation to see whether it would be fast enough and readable. If we
did go that route, I think we should ship a copy of ply in externals.
The advantages would be:

We'd have a generic lexer / parser for the next somewhat complicated text
format;
We wouldn't have to maintain and fix the lexer / parser itself (ply is
very stable, and widely used);
The grammar might be easier to read;
We might have less code (just the grammar, no lexing / parsing);
Others might find it easier to maintain (ply has good docs and examples).

If you think that might work, let me know, and I will draft something up to
give a better idea of what it would really look like. If it really was
going to be preferable, the code would have to be reasonably easy to read,
the error messages would have to be informative, and it would have to be
about as fast as your current implementation.

moloney · 2014-12-16T01:36:18Z

Hi Matthew,

Thanks for all the feedback.

For the character encoding stuff. The pydicom package does have some code for handling the string type VRs (in the charset module). So I think we would just need to deal with OB elements and private sub-headers. I suppose we could use the chardet package if present, and fall-back to just rejecting anything that is not ASCII (assume it is non-text binary data).

When I originally wrote the xprotocol parser, I looked into using something like pyparsing or ply. Then I realized I had no idea what type of grammar the format used, and figured the best way to learn the format was to hack out my own parser. I would be in favor of replacing this with a more generic parsing package provided the requirements you mentioned are met (readable, fast, informative error messages).

matthew-brett · 2014-12-17T20:05:10Z

OK - I'll wade back into it and see what I can do. Might end up giving up but I'll let you know if so.

matthew-brett · 2014-12-17T20:07:38Z

For the character encoding - the situation I hope we can avoid is where somebody running the code on two different machines on the same data gets two different answers - where one machine has chardet and the other does not.

moloney · 2014-12-17T21:03:49Z

Do you consider it problematic if the machine without chardet gives back no answer for the element in question (since it can only assume the data is non-text), while the machine with chardet is able to give back an answer?

matthew-brett · 2014-12-17T21:34:04Z

Only problematic if the user could easily miss that there's a difference between the two.

matthew-brett · 2014-12-20T18:33:28Z

Preliminary parser here : https://github.com/matthew-brett/xprotoply

Error messages not yet very informative and I need to check over some shift-reduce conflicts, but performance seems pretty good:

In [1]: import nibabel.nicom.xprotocol as nxp
/Users/mb312/usr/local/bin/ipython:1: UserWarning: The DICOM readers are highly experimental, unstable, and only work for Siemens time-series at the moment
Please use with caution.  We would be grateful for your help in improving them
  #!/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python

In [2]: import xpparse as xpp
WARNING: Token 'TYPED_TAG' defined, but not used
WARNING: Token 'WHITESPACE' defined, but not used
WARNING: There are 2 unused tokens

In [3]: contents = open('xprotocol_sample.txt', 'rt').read()

In [4]: timeit nxp.read_protos(contents)
10 loops, best of 3: 90.3 ms per loop

In [5]: timeit xpp.parser.parse(contents)
100 loops, best of 3: 14.9 ms per loop

I think the code is fairly readable - what do you think?

matthew-brett · 2014-12-23T16:05:30Z

Error messages now fairly informative with line and column numbers. Will read a few Siemens files I have to hand. I'm testing with:

from __future__ import print_function

import sys

import dicom

from nibabel.nicom.utils import find_private_section
import nibabel.nicom.csareader as csa

import xpparse as xpp

from nose.tools import assert_equal

for fname in sys.argv[1:]:
    dcm = dicom.read_file(fname)
    sect_start = find_private_section(dcm, 0x29, 'SIEMENS CSA HEADER')
    elem = dcm[0x29, sect_start + 0x20]
    csa_dict = csa.header_to_key_val_mapping(csa.read(elem.value))
    phoenix = csa_dict.get('MrPhoenix')
    if phoenix is None:
        phoenix = csa_dict.get('MrPhoenixProtocol')
    if phoenix is None:
        print("No phoenix protocol in " + fname)
        continue
    res = xpp.parse(phoenix)
    print(len(res))
    # Get the embedded protocol string from the protocol contents
    protocol = res[0]
    assert_equal(len(protocol['depends']), 0)
    assert_equal(len(protocol['blocks']), 1)
    for v in protocol['blocks'][0]['value']:
        if v['name'].startswith('Protocol'):
            break
    proto_str, asc_hdr = xpp.split_ascconv(xpp.strip_twin_quote(v['value']))
    res2 = xpp.parse(proto_str)
    assert_equal(len(res2), 2)

moloney · 2014-12-23T22:14:16Z

Thanks again for the hard work!

I have been reading through the PLY documentation and your code. Obviously there is a bit of a learning curve to read through the code, but I don't think it is an issue. Once we need to start parsing other text formats, this really is more of a strength than a weakness (just learn PLY rather than N different custom parsers).

The speedup is quite nice, and pretty important when it comes to doing something like converting 1000+ files into a Nifti.

I am starting to play around with the code now. I will run it against some of our DICOM files (we have some fairly old files that should help tease out any issues with older versions of the format). I will also try against some Siemens K-space files (meas.dat) which is actually the reason I originally wrote the parser.

matthew-brett · 2015-01-21T02:59:22Z

Brendan - any news from your researches?

matthew-brett · 2015-02-16T22:19:19Z

Any news here? Anything I can do to help?

moloney · 2015-02-17T07:02:21Z

Sorry about the lack of progress.

I think the unicode handling is reasonable now. I use pydicom to decode the standard DICOM elements. For elements with a VR of OB, OW, or UN we will use the chardet package if available to try to find the encoding. We fall back to just testing if the data is ASCII (skipping elements where this is not the case).

I integrated your PLY based parser for the xprotocol, with some minor changes made on my fork. I want to do some more work on this code to produce a more user friendly result. For example it would be nice to be able to do a lookup like result['']['Protocol0'] instead of result[0]['blocks'][0]['value'][2]['value'].

I am also thinking it would be nice to do the meta data "extraction" in a lazy manner. So you only pay the cost of all the extra parsing if you actually need something from one of the private sub headers. To that end, I am thinking the extraction code should really be integrated with the DicomWrapper objects. For example you could access the CSA Image sub header directly through a DicomWrapper by doing dw['CsaImage']. What do you think about this approach?

matthew-brett · 2015-02-24T06:26:59Z

The lazy loading sounds like a very good idea.

Did you have a chance to look at the failing tests?

Will you have time to get to cleaning up the output from the ply parser? It seems like a good idea to get that in, in a good state.

samuelstjean · 2015-03-31T01:47:38Z

nibabel/nicom/utils.py

@@ -13,11 +16,11 @@ def find_private_section(dcm_data, group_no, creator):
    element number of these tags give the start of matching information, in the
    higher tag numbers.

-    Parameters
-    ----------
+    Paramters


You seem to have inadvertedly removed an e.

Added some other small convienance functions for converting DICOM 'AS' and 'TM' values.

Fixed up our "fake" test data to work with the new find_private_element function.

Still need to handle text encoding, and ignoring non-text byte data.

The ASSCONV text header only needs its wierd double quotes fixed to be valid Python syntax. Fix double quotes, parse with Python ast module. I think this should be more robust than a custom parser.

Fixes to make tests pass for Python 3.

ElemDict claimed to accept a mapping as __init__ input, but in fact this was broken. Fix and test.

Minor whitespace cleanup

Minor cleanup in some doc strings

coveralls · 2015-04-24T00:12:01Z

Coverage decreased (-0.61%) to 94.65% when pulling 078d775 on moloney:enh-meta-extract into 2e575eb on nipy:master.

Get rid of 'extract' module. The dicom wrapper objects can be initialized with translators for private elements and then the parsed results can be accessed directly through __getitem__. The dicom wrappers still need a method that returns the nipy JSON meta data extension for the full data set.

moloney · 2015-04-24T20:44:16Z

I just commited my first attempt at putting the extraction logic into the dicom wrapper objects and doing lazy parsing. For example you can now access the "CSA Image" sub header by just doing dw['csa_image'].

I made some backwards incompatible changes. In particular the dicom wrapper objects all take a list of PrivateTranslator objects as their second argument to __init__ now. Previously, the siemens wrappers took the CSA Image header as the second argument. I don't think a lot of user code would have used that functionality rather than just relying on wrapper_from_data method to do the right thing. The siemens wrappers also no longer have a csa_header attribute that points to the CSA Image sub header, but we could add this back in temporarily with a deprecation warning.

Also, the wrappers now require a real DICOM data set rather than a dict. Again, I don't think that is a big deal for user code. I still need to update the tests that are broken by this change, but wanted to get some feedback first.

Finally, the dicom wrapper objects need a method to return the full JSON meta data extension described here: https://github.com/nipy/nibabel/wiki/json-header-extension

I am going to start working on improving the xpparse output as described above while I wait for any feedback on these changes.

Doesn't need dicom or xpparse as we just get input from text file.

Test array with nested array which in turn has a nested map. The array value itself also has meta data attributes which should presumably override those of the array itself, but for now we just ignore them.

No longer need to work around ply issue (dabeaz/ply#52)

Fairly large refactor of parser. Simplify the code a bit and make things more DRY.

The idea is to allow clean readable access to nested meta data structures where 99% of the time we are interested in the core "value" for each element in the structure rather than the associated meta data.

Allow addition on ElemList objects. Make both ElemList and ElemDict work better with other objects of the same type (e.g. allow the contructor to take them, allow ElemDict.update() to take ElemDict, etc). No longer try to autoconvert nested dict objects to ElemDict objects as this is a bad idea and we do want to allow plain dicts to be nested.

Fixes issue where I assumed 'dependency' and 'param_card' elements have unique names.

Newer versions of syngo may include some attributes that describe the ASCCONV sub header. We parse thes and put include the results.

Still do splitting of xprotocol/ascconv in dicomwrappers.

matthew-brett · 2015-11-29T01:51:05Z

Hi Brendan. Sorry to be so slow to get to this. What do you think the best way forward here is? I'm afraid I've left a lot of changes to back up. Maybe schedule on online or real meeting sometime to work through this stuff?

moloney · 2015-12-18T00:28:51Z

Hi Matthew. We just hired someone to help me out and he is working on a branch for doing the stacking part and creating the meta data extension. I hope to spend some time in the next week cleaning up this PR, at which point it would be good to discuss further.

matthew-brett · 2015-12-18T09:03:20Z

Hi Brendan - that's good news. I think I've not done a good job on helping you with this PR - what do you think we should do better from now? I should obviously give more timely feedback, I will try hard to do that, please do remind me.

kastman mentioned this pull request Feb 16, 2015

ENH: Add Nifti1DicomExtension + test #296

Merged

samuelstjean reviewed Mar 31, 2015
View reviewed changes

moloney and others added 12 commits April 23, 2015 15:43

ENH: Added function for finding private element.

d62aaee

Added some other small convienance functions for converting DICOM 'AS' and 'TM' values.

ENH: Add write func for CSA, reader now uses find_private_element

b1ad71b

Fixed up our "fake" test data to work with the new find_private_element function.

ENH: Add xprotocol parsing and tests

ff70ad3

CLN: Remove some unneeded imports from test_xprotocol

c0085fc

CLN: More minor cleanup in test_xprotocol

98d959c

ENH: Add ascconv parser and tests

df2940b

ENH: Allow parsing of older XProtocol format (EvaProtocol)

c943414

WIP: Add extract module for pulling all meta data from a DICOM

47413f0

Still need to handle text encoding, and ignoring non-text byte data.

BF: Fix issue with pydicom 0.9.7 and our (now non-fake) test data

5d70405

RF: refactor ascconv parse with Python ast module

2aff015

The ASSCONV text header only needs its wierd double quotes fixed to be valid Python syntax. Fix double quotes, parse with Python ast module. I think this should be more robust than a custom parser.

BF: fixes for Python 3

0a7168d

Fixes to make tests pass for Python 3.

RF+BF+TST: refactor ElemDict to accept mapping

972f142

ElemDict claimed to accept a mapping as __init__ input, but in fact this was broken. Fix and test.

moloney added 3 commits April 23, 2015 15:43

ENH: Improve unicode handling, allow generic filtering.

e551838

RF: Swap in PLY based parser for xprotocol

0f39e80

CLN: Remove generate_uid function as that is being added to pydicom

86da916

Minor whitespace cleanup

moloney force-pushed the enh-meta-extract branch from d9e4375 to 86da916 Compare April 23, 2015 22:46

moloney added 2 commits April 23, 2015 16:11

TST: Remove outdated tests

b5baec5

BF: Fix issue with externals.ply not being installed.

078d775

Minor cleanup in some doc strings

moloney added 13 commits April 24, 2015 16:38

TST: Fix ascconv test, make more independent.

418d958

Doesn't need dicom or xpparse as we just get input from text file.

ENH: Added more complex array test case.

b14d0a8

Test array with nested array which in turn has a nested map. The array value itself also has meta data attributes which should presumably override those of the array itself, but for now we just ignore them.

ENH: Update ply to 3.6 and update test_xpparse.

471662b

No longer need to work around ply issue (dabeaz/ply#52)

RF: Add old sample file, make parser more flexible to accomodate.

af3f7b8

Fairly large refactor of parser. Simplify the code a bit and make things more DRY.

CLN: Remove unneeded class

14b1292

ENH: Add element containers with tests.

d65151b

The idea is to allow clean readable access to nested meta data structures where 99% of the time we are interested in the core "value" for each element in the structure rather than the associated meta data.

ENH: Add slicing functionality to ElemList

4dd044e

CLN: whitespace

f69f66a

RF: Use the elemcont classes in xpparse, fix some parsing.

290549c

Fixes issue where I assumed 'dependency' and 'param_card' elements have unique names.

TST: Improve test_sample_file by adding more checks.

a29832b

ENH: Allow attributes on 'ASCCONV BEGIN' line.

3764297

Newer versions of syngo may include some attributes that describe the ASCCONV sub header. We parse thes and put include the results.

RF+TST: Move attribute parsing logic into ascconv module.

3574455

Still do splitting of xprotocol/ascconv in dicomwrappers.

This was referenced Mar 3, 2016

ENH: Add "element" containers and make dicom wrappers compatible #416

Open

ENH: Add writer for Siemens CSA header #417

Open

WIP: Dicom meta data anonymization and extraction #232

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Improve meta data extraction from DICOM files #290

WIP: Improve meta data extraction from DICOM files #290

moloney commented Dec 4, 2014

moloney commented Dec 5, 2014

matthew-brett commented Dec 11, 2014

matthew-brett commented Dec 11, 2014

matthew-brett commented Dec 15, 2014

moloney commented Dec 16, 2014

matthew-brett commented Dec 17, 2014

matthew-brett commented Dec 17, 2014

moloney commented Dec 17, 2014

matthew-brett commented Dec 17, 2014

matthew-brett commented Dec 20, 2014

matthew-brett commented Dec 23, 2014

moloney commented Dec 23, 2014

matthew-brett commented Jan 21, 2015

matthew-brett commented Feb 16, 2015

moloney commented Feb 17, 2015

matthew-brett commented Feb 24, 2015

samuelstjean Mar 31, 2015

coveralls commented Apr 24, 2015

moloney commented Apr 24, 2015

matthew-brett commented Nov 29, 2015

moloney commented Dec 18, 2015

matthew-brett commented Dec 18, 2015

WIP: Improve meta data extraction from DICOM files #290

Are you sure you want to change the base?

WIP: Improve meta data extraction from DICOM files #290

Conversation

moloney commented Dec 4, 2014

moloney commented Dec 5, 2014

matthew-brett commented Dec 11, 2014

matthew-brett commented Dec 11, 2014

matthew-brett commented Dec 15, 2014

moloney commented Dec 16, 2014

matthew-brett commented Dec 17, 2014

matthew-brett commented Dec 17, 2014

moloney commented Dec 17, 2014

matthew-brett commented Dec 17, 2014

matthew-brett commented Dec 20, 2014

matthew-brett commented Dec 23, 2014

moloney commented Dec 23, 2014

matthew-brett commented Jan 21, 2015

matthew-brett commented Feb 16, 2015

moloney commented Feb 17, 2015

matthew-brett commented Feb 24, 2015

samuelstjean Mar 31, 2015

Choose a reason for hiding this comment

coveralls commented Apr 24, 2015

moloney commented Apr 24, 2015

matthew-brett commented Nov 29, 2015

moloney commented Dec 18, 2015

matthew-brett commented Dec 18, 2015