Large SDF to rdkit mol object conversion tips? #7079

DKchemistry · 2024-01-21T05:58:45Z

DKchemistry
Jan 21, 2024

Hi All, I am trying to convert an SDF to a dataframe of rdkit mol objects. The sdf is large (500K+ records). I have more compute to throw at the problem but I don't know how to take advantage of it.

def load_sdf_to_dataframe(filename, active):
    # Create a molecule supplier
    mol_supplier = Chem.SDMolSupplier(filename)

    # Load the molecules and their properties into a list. Activity is set by active=True/False 
    molecules = []
    for mol in mol_supplier:
        if mol is not None:
            props = mol.GetPropsAsDict()
            props['Title'] = mol.GetProp('_Name')
            props['Mol'] = mol
            props['Activity'] = 1 if active else 0
            molecules.append(props)

    # Convert the list into a DataFrame
    df = pd.DataFrame(molecules)

    # Reorder the DataFrame columns
    cols = ['Title', 'Mol', 'Activity'] + [col for col in df.columns if col not in ['Title', 'Mol', 'Activity']]
    df = df[cols]

    return df

I've seen this post from Greg: https://greglandrum.github.io/rdkit-blog/posts/2023-11-11-usingmultithreadedreaders.html
And this from iwatobipen: https://iwatobipen.wordpress.com/2021/05/04/read-sdf-with-multi-thread-rdkit-memo-chemoinformatics/

But I can't really figure out how to implement it, or if it even really helps. I think my code is choking on coverting the sdf to a rdkit mol object, and i would love to use >1 CPUs to do this. The order of the df is irrelevant, as long as the records themselves are correct.

My code can can do like 500 molecules in about 12 seconds on one CPU. If I can use 10, I'd be totally fine with that speed.

bp-kelley · 2024-01-22T00:58:18Z

bp-kelley
Jan 22, 2024
Collaborator

I think your culprit is GetPropsByDict, try doing this only on the first molecule and then you can use the resulting types to map the rest.

if this doesn’t make sense, I can post an example.

2 replies

kienerj Jan 30, 2024

@DKchemistry To explain from my own experience, GetPropsByDict is very slow, so doing it for each of the 500k molecule will have a noticeable effect on runtime.

Besides that as Greg says the multithreaded supplier should help with the part about parsing the sdf to a molecule.

DKchemistry Feb 11, 2024
Author

@bp-kelley Could you post an example? Sorry not great at this :)

@kienerj I will read it again, I had trouble adapting it from my use case.

greglandrum · 2024-01-24T17:46:35Z

greglandrum
Jan 24, 2024
Maintainer

@DKchemistry did you try using the multithreaded SD mol supplier as I demonstrated in that blog post?

1 reply

DKchemistry Feb 11, 2024
Author

@greglandrum I couldn't figure that out at the time and had other work, but finished everything else and need to figure it out :)

DKchemistry · 2024-02-11T23:47:38Z

DKchemistry
Feb 11, 2024
Author

Well, for anyone in the future, I got this together, with both the multithreading (@greglandrum) and GetPropsByDict (@bp-kelley @kienerj ) revision:

def load_sdf_to_dataframe_multithread_dict(args):
  """
  Load molecules and their properties from an SDF file into a DataFrame.
  """
  file, active_status = args  # Unpack the tuple of arguments

  # Create a molecule supplier
  mol_supplier = Chem.MultithreadedSDMolSupplier(file, numWriterThreads=8)

  # Load the molecules and their properties into a list
  molecules = []
  first_mol = True
  for mol in mol_supplier:
    if mol is not None:
      if first_mol:
        # Get properties as dictionary only for the first molecule
        props = mol.GetPropsAsDict()
        keys = props.keys()
        first_mol = False
      else:
        # For the rest of the molecules, get properties directly
        props = {key: mol.GetProp(key) for key in keys if mol.HasProp(key)}
      
      props["Title"] = mol.GetProp("_Name")
      props["Mol"] = mol
      props["Activity"] = 1 if active_status == "active" else 0
      molecules.append(props)

  # Convert the list into a DataFrame
  df = pd.DataFrame(molecules)

  # Reorder the DataFrame columns
  cols = ["Title", "Mol", "Activity"] + [
    col for col in df.columns if col not in ["Title", "Mol", "Activity"]
  ]
  df = df[cols]

  return df

I can't give an exact speed up of each component, but this parsed 11665 docked sdf molecules from glide in 0.7s!

I parsed a much smaller file, only 300 docked sdf molecules from glide, using a version of this function that had the multithreading but did not implement @bp-kelley @kienerj suggestion (thank you!) and it took 4.5s! So it seems like the rate determining step (if you forgive the organic chemist speak) was .GetPropAsDict(). Hopefully this will scale fast enough for my needs :)

Here it is:

def load_sdf_to_dataframe_multithread(args):
  """
  Load molecules and their properties from an SDF file into a DataFrame.
  """
  file, active_status = args  # Unpack the tuple of arguments

  # Create a molecule supplier
  mol_supplier = Chem.MultithreadedSDMolSupplier(file, numWriterThreads=8
  )

  # Load the molecules and their properties into a list
  molecules = []
  for mol in mol_supplier:
      if mol is not None:
          props = mol.GetPropsAsDict()
          props["Title"] = mol.GetProp("_Name")
          props["Mol"] = mol
          props["Activity"] = 1 if active_status == "active" else 0
          molecules.append(props)

  # Convert the list into a DataFrame
  df = pd.DataFrame(molecules)

  # Reorder the DataFrame columns
  cols = ["Title", "Mol", "Activity"] + [
      col for col in df.columns if col not in ["Title", "Mol", "Activity"]
  ]
  df = df[cols]

  return df

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large SDF to rdkit mol object conversion tips? #7079

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Large SDF to rdkit mol object conversion tips? #7079

DKchemistry Jan 21, 2024

Replies: 3 comments · 3 replies

bp-kelley Jan 22, 2024 Collaborator

kienerj Jan 30, 2024

DKchemistry Feb 11, 2024 Author

greglandrum Jan 24, 2024 Maintainer

DKchemistry Feb 11, 2024 Author

DKchemistry Feb 11, 2024 Author

DKchemistry
Jan 21, 2024

Replies: 3 comments 3 replies

bp-kelley
Jan 22, 2024
Collaborator

DKchemistry Feb 11, 2024
Author

greglandrum
Jan 24, 2024
Maintainer

DKchemistry Feb 11, 2024
Author

DKchemistry
Feb 11, 2024
Author