Array-like structure for MolBlock IO #7235
-
Hi everyone, Context: For Polaris, I would like to minimize the number of different file types the storage backend supports. I am therefore exploring whether Zarr could serve as a general purpose data format for saving drug discovery datasets. So far so good, but I'm now running into some challenges w.r.t. saving SDF files, specifically the Atom and Bond Block. For this application, I am not interested in supporting query molecules so that should simplify things. The Atom and Mol Block lend themselves quite naturally to be saved as an array. For example: Say we have the following SDF file:
We can save this as two arrays: # For the atom block, we convert the atom symbol to its atomic number
atom_block = [
[-2.1392, -3.1722, -1.1488, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[-1.1302, -2.4867, -0.3430, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
...
[1.9089, 0.5829, 1.4879, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
]
bond_block = [
[1, 2, 1, 0],
[2, 3, 1, 0],
...
[13, 5, 1, 0]
] As long as we save the Chiral Flag from the counts line, I believe we would not lose any information this way? Does RDKit provide an easy way to get all the properties that make up the atom and bond block? I only found
Curious to hear if anyone in the community has any thoughts on this. |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 5 replies
-
Alternatively, we could first convert the mol to an entirely different molecular representation and save this alongside the 3D coordinates separately? But I don't want to lose the information stored in the atom and mol block flags. Maybe |
Beta Was this translation helpful? Give feedback.
-
That is the intent. It's not a direct binary dump of the molecule, but going to/from that binary representation should be mostly lossless. There are some details depending on whether or not you choose to preserve atom/bond/mol properties (and whether or not you have properties you need to preserve), but in general you should be fine.
Though you may win with storage, it's unlikely to be dramatic and you are definitely going to lose with performance. Serializing/Deserializing molecules using the binary format is going to be MUCH faster. |
Beta Was this translation helpful? Give feedback.
-
@greglandrum One more question came up. The properties you're referring to here ☝️ , I found this bit of code to specify this, but cannot find any documentation on what this means exactly. I already have a way to save any properties stored in the SDF, e.g.:
If I want to preserve as much info as possible, is this necessary? bytes_data = [mol.ToBinary(Chem.PropertyPickleOptions.AllProps) for mol in mols] |
Beta Was this translation helpful? Give feedback.
Hi @cwognum, if I were doing this I would have a single column in which I store the V3000 mol block for each molecule.
If you want to enable more efficient processing with the RDKit, you could also have a column with the output of
Mol.ToBinary()
(that can change from version to version, but we maintain backwards compatibility).Storing arrays of atoms and bonds is limiting, adds complication, requires you to write your own code to serialize/deserialize the molecules into the array format, and delivers very little additional value - there aren't very many use cases where you'd want to query those features and if you need that, you can always add code that creates those arrays while still s…