Large SDF to rdkit mol object conversion tips? #7079
Replies: 3 comments 3 replies
-
I think your culprit is GetPropsByDict, try doing this only on the first molecule and then you can use the resulting types to map the rest. if this doesn’t make sense, I can post an example. |
Beta Was this translation helpful? Give feedback.
-
@DKchemistry did you try using the multithreaded SD mol supplier as I demonstrated in that blog post? |
Beta Was this translation helpful? Give feedback.
-
Well, for anyone in the future, I got this together, with both the multithreading (@greglandrum) and def load_sdf_to_dataframe_multithread_dict(args):
"""
Load molecules and their properties from an SDF file into a DataFrame.
"""
file, active_status = args # Unpack the tuple of arguments
# Create a molecule supplier
mol_supplier = Chem.MultithreadedSDMolSupplier(file, numWriterThreads=8)
# Load the molecules and their properties into a list
molecules = []
first_mol = True
for mol in mol_supplier:
if mol is not None:
if first_mol:
# Get properties as dictionary only for the first molecule
props = mol.GetPropsAsDict()
keys = props.keys()
first_mol = False
else:
# For the rest of the molecules, get properties directly
props = {key: mol.GetProp(key) for key in keys if mol.HasProp(key)}
props["Title"] = mol.GetProp("_Name")
props["Mol"] = mol
props["Activity"] = 1 if active_status == "active" else 0
molecules.append(props)
# Convert the list into a DataFrame
df = pd.DataFrame(molecules)
# Reorder the DataFrame columns
cols = ["Title", "Mol", "Activity"] + [
col for col in df.columns if col not in ["Title", "Mol", "Activity"]
]
df = df[cols]
return df I can't give an exact speed up of each component, but this parsed 11665 docked sdf molecules from glide in 0.7s! I parsed a much smaller file, only 300 docked sdf molecules from glide, using a version of this function that had the multithreading but did not implement @bp-kelley @kienerj suggestion (thank you!) and it took 4.5s! So it seems like the rate determining step (if you forgive the organic chemist speak) was .GetPropAsDict(). Hopefully this will scale fast enough for my needs :) Here it is: def load_sdf_to_dataframe_multithread(args):
"""
Load molecules and their properties from an SDF file into a DataFrame.
"""
file, active_status = args # Unpack the tuple of arguments
# Create a molecule supplier
mol_supplier = Chem.MultithreadedSDMolSupplier(file, numWriterThreads=8
)
# Load the molecules and their properties into a list
molecules = []
for mol in mol_supplier:
if mol is not None:
props = mol.GetPropsAsDict()
props["Title"] = mol.GetProp("_Name")
props["Mol"] = mol
props["Activity"] = 1 if active_status == "active" else 0
molecules.append(props)
# Convert the list into a DataFrame
df = pd.DataFrame(molecules)
# Reorder the DataFrame columns
cols = ["Title", "Mol", "Activity"] + [
col for col in df.columns if col not in ["Title", "Mol", "Activity"]
]
df = df[cols]
return df |
Beta Was this translation helpful? Give feedback.
-
Hi All, I am trying to convert an SDF to a dataframe of rdkit mol objects. The sdf is large (500K+ records). I have more compute to throw at the problem but I don't know how to take advantage of it.
I've seen this post from Greg: https://greglandrum.github.io/rdkit-blog/posts/2023-11-11-usingmultithreadedreaders.html
And this from iwatobipen: https://iwatobipen.wordpress.com/2021/05/04/read-sdf-with-multi-thread-rdkit-memo-chemoinformatics/
But I can't really figure out how to implement it, or if it even really helps. I think my code is choking on coverting the sdf to a rdkit mol object, and i would love to use >1 CPUs to do this. The order of the df is irrelevant, as long as the records themselves are correct.
My code can can do like 500 molecules in about 12 seconds on one CPU. If I can use 10, I'd be totally fine with that speed.
Beta Was this translation helpful? Give feedback.
All reactions