You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm finishing up support for using CDK to read V3000 SDF records in chemfp, and found a problem during testing.
Everything is fine when I use IteratingSDFReader to read a file. I am able to get the molecule and its associated tag data with both V2000 and V3000 records.
I have a problem when reading a single record, where the cdk.io.MDLV2000Reader().read() parses the tag data while the cdk.io.MDLV3000Reader().read() does not.
When I parse a single SDF record, I figure out if it's V2000 or V3000 and dispatch to the appropriate MDL Reader. This replicates what IteratingSDFReader does (see #952). I decided to use this approach as I find it gives better error reporting than IteratingSDFReader - the latter catches and logs errors, because it's designed to skip records it can't handle rather than raise an exception (see #951)
I discovered that while MDLV2000Reader parses the tag data, MDLV3000Reader does not. More specifically, MDLV2000Reader.readAtomContainer has readNonStructuralData() to read the tag data:
// read potential SD file data between M END and $$$$
readNonStructuralData(input, outputContainer);
and that in turn sets container.setProperty(header, data.toString()).
On the other hand, MDLV3000Reader does not parse the tag data, which you can verify by failing to find the relevant setProperty() call.
This is not an issue when using IteratingSDFReader because the iterating reader will only have the MDL reader process a record up to the M_END (ie, MDLV2000Reader will never call readNonStructuralData), then the iterating reader will parse the tag data itself, using readDataBlockInto():
if (currentLine.startsWith(M_END)) {
...
molecule = reader.read(builder.newAtomContainer());
...
if (molecule != null) {
readDataBlockInto(molecule);
(Also, it uses a different algorithm; MDLV2000Reader requires the SDF end with "$$$$" exactly while IteratingSDFReader says the line only needs to start with "$$$$".)
My work-around is to parse the tag data myself, if it's a V3000 record, and set the corresponding properties.
The text was updated successfully, but these errors were encountered:
Yes everything's out of sync, If you have a patch I can review apply otherwise you'll have to wait for my plan to rewrite both readers on a common base. At the moment I'm too busy with NextMove bits.
I don't have a patch. I don't have the Java skills, nor the ability to decide which patch is most appropriate for the project, especially given the known long-term plans for a rewrite.
I'm finishing up support for using CDK to read V3000 SDF records in chemfp, and found a problem during testing.
Everything is fine when I use IteratingSDFReader to read a file. I am able to get the molecule and its associated tag data with both V2000 and V3000 records.
I have a problem when reading a single record, where the cdk.io.MDLV2000Reader().read() parses the tag data while the cdk.io.MDLV3000Reader().read() does not.
When I parse a single SDF record, I figure out if it's V2000 or V3000 and dispatch to the appropriate MDL Reader. This replicates what IteratingSDFReader does (see #952). I decided to use this approach as I find it gives better error reporting than IteratingSDFReader - the latter catches and logs errors, because it's designed to skip records it can't handle rather than raise an exception (see #951)
I discovered that while MDLV2000Reader parses the tag data, MDLV3000Reader does not. More specifically, MDLV2000Reader.readAtomContainer has readNonStructuralData() to read the tag data:
and that in turn sets
container.setProperty(header, data.toString())
.On the other hand, MDLV3000Reader does not parse the tag data, which you can verify by failing to find the relevant setProperty() call.
This is not an issue when using IteratingSDFReader because the iterating reader will only have the MDL reader process a record up to the M_END (ie, MDLV2000Reader will never call readNonStructuralData), then the iterating reader will parse the tag data itself, using readDataBlockInto():
(Also, it uses a different algorithm; MDLV2000Reader requires the SDF end with "$$$$" exactly while IteratingSDFReader says the line only needs to start with "$$$$".)
My work-around is to parse the tag data myself, if it's a V3000 record, and set the corresponding properties.
The text was updated successfully, but these errors were encountered: