Strategies for Parallelized Comparison and De-duplication of SDF Files in a Memory-Constrained Environment #7356

Poccia · 2024-04-12T15:24:30Z

Poccia
Apr 12, 2024

Hi everyone,

I'm currently facing a challenge and would appreciate your insights. I have multiple SDF (Structure Data File) datasets, and I need to compare all of them to remove duplicate molecules. However, the task needs to be parallelized and scalable as I'm working with a limited amount of RAM.

Here are the specifics I need help with:

Efficient Comparison: What are the best practices for efficiently comparing large numbers of molecular structures stored in SDF files? Considering that each file might contain thousands of molecules, how can I manage comparisons without exceeding RAM limitations?
Parallel Processing: Given the memory constraints, what tools or frameworks would you recommend for parallelizing the comparison process? I'm looking for ways to leverage parallel processing either on a local machine or distributed computing environment.
Scalable Techniques: Are there specific scalable techniques or algorithms especially suited for handling large-scale chemical data like this? I’m particularly interested in any method that can handle data distribution and load balancing effectively.
De-duplication: Once I identify duplicates, what are the most efficient ways to remove them? Is there a preferred approach or tool that works well with chemical datasets and can ensure data integrity?
Tools and Libraries: Which libraries or tools do you recommend for reading and processing SDF files? I'm familiar with RDKit and Open Babel but am curious if there are other tools that might be particularly useful for this type of task.

I'm open to suggestions and would love to hear about how others have tackled similar problems.

Thanks in advance for your help!

Answered by bp-kelley

Apr 12, 2024

How many files/compounds are you talking? What is your memory constraint? What do you consider scaled, i.e. throughput expectation.

I tend to use unix tools when I can, so this only really works in a unix environment. Here is my approach.

GOAL:
if you have a data record:

canonical smiles string, original filename, index

you can use the smiles string as the unique identifier and use the filename and index to extract the original data from the file. Here is one way to do this.

For each file, write out the canonical smiles string, the file it came from and the index of the molecule in that file to a file with extension .smitxt

i.e. foo.sdf > foo.smitxt

tosmitxt.py

import sys
from rdkit impo…

View full answer

bp-kelley · 2024-04-12T21:43:26Z

bp-kelley
Apr 12, 2024
Collaborator

How many files/compounds are you talking? What is your memory constraint? What do you consider scaled, i.e. throughput expectation.

I tend to use unix tools when I can, so this only really works in a unix environment. Here is my approach.

GOAL:
if you have a data record:

canonical smiles string, original filename, index

you can use the smiles string as the unique identifier and use the filename and index to extract the original data from the file. Here is one way to do this.

For each file, write out the canonical smiles string, the file it came from and the index of the molecule in that file to a file with extension .smitxt

i.e. foo.sdf > foo.smitxt

tosmitxt.py

import sys
from rdkit import Chem
unique = {}
sdfile = Chem.SDMolSupplier(sys.argv[1])
for idx, mol in enumerate(sdfile):
    if not mol:continue
    smi = Chem.MolToSmiles(mol)
    sys.stdout.write(f"{smi} {sys.argv[1]} {idx}\n")

Run the commands in parallel, this converts *.sdf to *.smitxt (see gnu parallel docs, these commands can look more complicated than they really are)

parallel python  tosmitxt.py {} ">" {.}.smitxt ::: *.sdf

Next we need a way to extract molecules from the smitxt files:

from rdkit import Chem
import sys
last_fname = None
for line in sys.stdin:
    smiles, fname, idx  = line.split()
    idx = int(idx)
    if fname != last_fname:
        # opening the sd file can be slow, so only do it if necessary
        supplier = Chem.SDMolSupplier(fname)
        last_fname = fname
    # sd mol supplier is indexable, so grab the text at the specified index
    sys.stdout.write(supplier.GetItemText(idx))

Then we need to get the unique smiles from this data set. We'll use sort. The goal is to output unique smiles strings, followed by the file they were in and the index. As a bonus, to make our extraction easier, we'll sort on the filename and the index so we extract the molecules in order and don't keep bouncing around within the file or across files. (see the sort documentation for details)

sort -u *.smitxt -k 1,1 | sort -k 2,2 -k 3n | python extract.py > unique-mols.sdf

Here:

sort -u *.smitxt -k 1,1

returns the lines with unique smiles strings (note, if you have a TON of files, you may need to use xargs or something like that to make the command line work)

sort -k 2,2 -k 3n

sorts the output so that the filenames and their indices are in order. This makes extraction quicker.

These unix tools are quite fast and robust, I would try this approach first before reinventing the wheel or using a database like sqlite or some such to do the heavy lifting.

Cheers,
Brian

2 replies

bp-kelley Apr 16, 2024
Collaborator

What I provided absolutely removes duplicates between files. The initial processing for creating the smiles is in parallel, but the final sort is not. However, the files are a lot smaller, so this might not be an issue. The benefit is that memory is very conserved here.

Poccia Apr 16, 2024
Author

"So, when you use sort like that, with a wildcard, the files are not processed individually and separately? This is new to me. I might try both ways to see the difference in speed

Thank you

kienerj · 2024-04-17T06:18:18Z

kienerj
Apr 17, 2024

I have been playing around with something that might help you regarding this problem:

https://github.com/kienerj/rdkit_parallel_tools

This makes use of below code:

https://baoilleach.blogspot.com/2020/05/python-patterns-for-processing-large.html

Code snippets you find in these links might help get you started.

You will still need the required hardware to do this task.

Another option but here it really depends on how much data we are speaking about is to use KNIME for such tasks. Why KNIME? because it handles the IO part disk/memory for you. back in the day i think I worked with > 100 mio structures on 8 gb of RAM. Yes it's not fast at all but it works.

if you don't know KNIME, it's a GUI workflow creation tool (like alteryx or pipeline pilot). It's opensource and free and has chemistry integrations (CDK and RDKit). For example there would be an SDF Reader node that you can add as many files as you want to be read or you can simply read and entire directory of sd-files. then you have all the data in a single table. From there you can start with your calculations. de-duplicate hard part is to choose the right key. Inchi? canonical smiles? but once you have it you can simply group on it.

Anyway it will have a learning curve but as said it takes care of the disk vs memory part for you which can be very helpful.

1 reply

kienerj Apr 19, 2024

From this description I really do think KNIME could help you a lot to speed up such operations if you are willing to invest time into learning knime.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strategies for Parallelized Comparison and De-duplication of SDF Files in a Memory-Constrained Environment #7356

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Strategies for Parallelized Comparison and De-duplication of SDF Files in a Memory-Constrained Environment #7356

Poccia Apr 12, 2024

Replies: 2 comments · 3 replies

bp-kelley Apr 12, 2024 Collaborator

bp-kelley Apr 16, 2024 Collaborator

Poccia Apr 16, 2024 Author

kienerj Apr 17, 2024

kienerj Apr 19, 2024

Poccia
Apr 12, 2024

Replies: 2 comments 3 replies

bp-kelley
Apr 12, 2024
Collaborator

bp-kelley Apr 16, 2024
Collaborator

Poccia Apr 16, 2024
Author

kienerj
Apr 17, 2024