Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2 Feature Request]: automated rdkit2d features #843

Open
muammar opened this issue Apr 26, 2024 · 7 comments
Open

[v2 Feature Request]: automated rdkit2d features #843

muammar opened this issue Apr 26, 2024 · 7 comments
Labels
enhancement a new feature request
Milestone

Comments

@muammar
Copy link
Contributor

muammar commented Apr 26, 2024

What are you trying to do?
I could run my first training/prediction following the tutorials in a notebook! I had a couple of questions on how to achieve the following:

  1. How to use rdkit2d features? Based on the documentation, I think I should compute these and dump them into a file and then load them. Is that correct? I am unsure how to proceed.
  2. How can I perform a hyperparameter optimization? If I have to do this from the command line, how do I load the hyperparameters from a file?
  3. How can I perform a multi-task learning training?

Thank you very much for this new version, it is very exciting. I look forward to hearing from you.

Best,

@muammar muammar added the question Further information is requested label Apr 26, 2024
@kevingreenman kevingreenman added this to the v2.0.1 milestone Apr 30, 2024
@SteshinSS
Copy link

Also got here looking for ways to use rdkit2d features as in v1: --features_generator rdkit_2d_normalized --no_features_scaling

@JacksonBurns JacksonBurns added enhancement a new feature request and removed question Further information is requested labels May 16, 2024
@JacksonBurns JacksonBurns changed the title [v2 QUESTION]: how to run some of the old routines, e.g., multi-task, rdkit2d features and do a hyper parameter optimization of the architecture [v2 Feature Request]: automated rdkit2d features May 16, 2024
@JacksonBurns
Copy link
Member

As far as 2D features go, we had originally not added this to v2 because we wanted people to calculate them on their own and pass them as extra features manually. Since it has now been requested to do this automatically here, in #864 (here: #864 (comment)), and in #849, we will put this feature on the TODO list.

@KC-Zhang
Copy link

KC-Zhang commented May 17, 2024

for now how do i manually calculate --features_generator rdkit_2d_normalized --no_features_scaling in V2?

@SteshinSS
Copy link

@KC-Zhang chemprop v1 generated feature by descriptastorus and it's simple to use. However, I didn't understand how to use those feature during training, as the example notebook from #772 doesn't work.

@KC-Zhang
Copy link

KC-Zhang commented May 18, 2024

@SteshinSS which specific notebook in 772 are you talking about? is it examples/extra_features_from_featurizer.ipynb ? when i try this file, it bugs out at this line from chemprop.featurizers import MoleculeFeaturizer, are you experiencing the same?

@SteshinSS
Copy link

@KC-Zhang I think examples/loaded_molecule_features.ipynb should be the example of additional features usage, but it didn't work for me with the 2.0.0.

@KnathanM
Copy link
Contributor

for now how do i manually calculate --features_generator rdkit_2d_normalized --no_features_scaling in V2?

First install descriptastorus pip install git+https://github.com/bp-kelley/descriptastorus
Next run this script:

import numpy as np
import pandas as pd
from descriptastorus.descriptors import rdNormalizedDescriptors

# Load data - Example:
df = pd.read_csv("mydata.csv")

generator = rdNormalizedDescriptors.RDKit2DNormalized()
rdkit_descriptors = np.array([generator.process(smi)[1:] for smi in df.smiles])
np.savez("rdkit_descriptors.npz", rdkit_descriptors)

Then run chemprop

chemprop train --data-path mydata.csv --descriptors-path rdkit_descriptors.npz --no-descriptor-scaling

Alternatively, you don't need to install descriptastorus if you use RDKit directly to make the descriptors

from rdkit import Chem
from rdkit.Chem import Descriptors
rdkit_descriptors = np.array([[func(Chem.MolFromSmiles(smi)) for name, func in sorted(Descriptors.descList)] for smi in df.smiles])

Note however that these descriptors are not scaled so the --no-descriptor-scaling flag should not be set if using them. Also the scaling in Chemprop is different than in descriptastorus as it is dataset dependent and not determined beforehand. Lastly the following rdkit descriptors are skipped in descriptastorus ['AvgIpc', 'BCUT2D_CHGHI', 'BCUT2D_CHGLO', 'BCUT2D_LOGPHI', 'BCUT2D_LOGPLOW', 'BCUT2D_MRHI', 'BCUT2D_MRLOW', 'BCUT2D_MWHI', 'BCUT2D_MWLOW', 'SPS']

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement a new feature request
Projects
None yet
Development

No branches or pull requests

6 participants