-
Hi everyone, I'm currently facing a challenge and would appreciate your insights. I have multiple SDF (Structure Data File) datasets, and I need to compare all of them to remove duplicate molecules. However, the task needs to be parallelized and scalable as I'm working with a limited amount of RAM. Here are the specifics I need help with: Here are the specifics I need help with:
I'm open to suggestions and would love to hear about how others have tackled similar problems. Thanks in advance for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
How many files/compounds are you talking? What is your memory constraint? What do you consider scaled, i.e. throughput expectation. I tend to use unix tools when I can, so this only really works in a unix environment. Here is my approach. GOAL:
you can use the smiles string as the unique identifier and use the filename and index to extract the original data from the file. Here is one way to do this. For each file, write out the canonical smiles string, the file it came from and the index of the molecule in that file to a file with extension .smitxt
tosmitxt.py
Run the commands in parallel, this converts *.sdf to *.smitxt (see gnu parallel docs, these commands can look more complicated than they really are)
Next we need a way to extract molecules from the smitxt files:
Then we need to get the unique smiles from this data set. We'll use sort. The goal is to output unique smiles strings, followed by the file they were in and the index. As a bonus, to make our extraction easier, we'll sort on the filename and the index so we extract the molecules in order and don't keep bouncing around within the file or across files. (see the sort documentation for details)
Here:
returns the lines with unique smiles strings (note, if you have a TON of files, you may need to use xargs or something like that to make the command line work)
sorts the output so that the filenames and their indices are in order. This makes extraction quicker. These unix tools are quite fast and robust, I would try this approach first before reinventing the wheel or using a database like sqlite or some such to do the heavy lifting. Cheers, |
Beta Was this translation helpful? Give feedback.
-
I have been playing around with something that might help you regarding this problem: https://github.com/kienerj/rdkit_parallel_tools This makes use of below code: https://baoilleach.blogspot.com/2020/05/python-patterns-for-processing-large.html Code snippets you find in these links might help get you started. You will still need the required hardware to do this task. Another option but here it really depends on how much data we are speaking about is to use KNIME for such tasks. Why KNIME? because it handles the IO part disk/memory for you. back in the day i think I worked with > 100 mio structures on 8 gb of RAM. Yes it's not fast at all but it works. if you don't know KNIME, it's a GUI workflow creation tool (like alteryx or pipeline pilot). It's opensource and free and has chemistry integrations (CDK and RDKit). For example there would be an SDF Reader node that you can add as many files as you want to be read or you can simply read and entire directory of sd-files. then you have all the data in a single table. From there you can start with your calculations. de-duplicate hard part is to choose the right key. Inchi? canonical smiles? but once you have it you can simply group on it. Anyway it will have a learning curve but as said it takes care of the disk vs memory part for you which can be very helpful. |
Beta Was this translation helpful? Give feedback.
How many files/compounds are you talking? What is your memory constraint? What do you consider scaled, i.e. throughput expectation.
I tend to use unix tools when I can, so this only really works in a unix environment. Here is my approach.
GOAL:
if you have a data record:
canonical smiles string, original filename, index
you can use the smiles string as the unique identifier and use the filename and index to extract the original data from the file. Here is one way to do this.
For each file, write out the canonical smiles string, the file it came from and the index of the molecule in that file to a file with extension .smitxt
i.e. foo.sdf > foo.smitxt
tosmitxt.py