Skip to content

Strategies for Parallelized Comparison and De-duplication of SDF Files in a Memory-Constrained Environment #7356

Answered by bp-kelley
Poccia asked this question in Q&A
Discussion options

You must be logged in to vote

How many files/compounds are you talking? What is your memory constraint? What do you consider scaled, i.e. throughput expectation.

I tend to use unix tools when I can, so this only really works in a unix environment. Here is my approach.

GOAL:
if you have a data record:

canonical smiles string, original filename, index

you can use the smiles string as the unique identifier and use the filename and index to extract the original data from the file. Here is one way to do this.

For each file, write out the canonical smiles string, the file it came from and the index of the molecule in that file to a file with extension .smitxt

i.e. foo.sdf > foo.smitxt

tosmitxt.py

import sys
from rdkit impo…

Replies: 2 comments 3 replies

Comment options

You must be logged in to vote
2 replies
@bp-kelley
Comment options

@Poccia
Comment options

Answer selected by Poccia
Comment options

You must be logged in to vote
1 reply
@kienerj
Comment options

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants