Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training scoring functions and updated version of PDBBind #167

Open
Tonylac77 opened this issue Feb 9, 2023 · 2 comments
Open

Training scoring functions and updated version of PDBBind #167

Tonylac77 opened this issue Feb 9, 2023 · 2 comments

Comments

@Tonylac77
Copy link

Tonylac77 commented Feb 9, 2023

I am currently trying to train the NNScore and PLECScore models for ligand scoring. So far I have not found a way to train the model "purposefully" and have resorted to run scorer.load() without any arguments, which starts the training of the scoring function. However, I don't know which version of PDBBind this is using as a result (I assume v2016?).

I have tried the following for example :

scorer = NNScore.nnscore() scorer.gen_training_data(pdbbindir=$PATH$, pdbbind_versions=2016)

where path is just a directory on my machine and get the following error

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[10], line 23
     21 rescorers = {'nnscore':NNScore.nnscore()}
     22 scorer = rescorers['nnscore']
---> 23 scorer.gen_training_data(pdbbind_dir='/home/tony/CADD22/software/pdbbind', pdbbind_versions=[2016], use_proteins=False)

File ~/.conda/envs/wocondock/lib/python3.8/site-packages/oddt/scoring/functions/NNScore.py:63, in nnscore.gen_training_data(self, pdbbind_dir, pdbbind_versions, home_dir, use_proteins)
     60     home_dir = dirname(__file__) + '/NNScore'
     61 filename = path_join(home_dir, 'nnscore_descs.csv')
---> 63 super(nnscore, self)._gen_pdbbind_desc(
     64     pdbbind_dir=pdbbind_dir,
     65     pdbbind_versions=pdbbind_versions,
     66     desc_path=filename,
     67     use_proteins=use_proteins
     68 )

File ~/.conda/envs/wocondock/lib/python3.8/site-packages/oddt/scoring/__init__.py:94, in scorer._gen_pdbbind_desc(self, pdbbind_dir, pdbbind_versions, desc_path, include_general_set, use_proteins, **kwargs)
     92 df = None
     93 for pdbbind_version in pdbbind_versions:
---> 94     p = pdbbind('%s/v%i/' % (pdbbind_dir, pdbbind_version),
     95                 version=pdbbind_version,
     96                 opt=opt)
     97     # Core set
     99     for set_name in p.pdbind_sets:

File ~/.conda/envs/wocondock/lib/python3.8/site-packages/oddt/datasets.py:85, in pdbbind.__init__(self, home, version, default_set, opt)
     82         self.sets[pdbind_set] = dict(zip(self._set_ids[pdbind_set],
     83                                          self._set_act[pdbind_set]))
     84 if len(self.sets) == 0:
---> 85     raise Exception('There is no PDBbind set availabe')

Exception: There is no PDBbind set availabe

Additionally, when I then score ligands, the performance of these models is very poor (Enrichment Factor at 1% of around 0-2%) when compared to other scoring functions (as implemented in GNINA for example) achieving ~20% enrichment.

Therefore I am wondering if there is a tutorial/notebook that explains how to train these models using the gen_training_data() or fit() methods.

I was also wondering if it was possible to use a more updated version of the PDBBind data, such as version 2020, and how hard that would be to implement.

I am happy to provide the dataset I am using for comparison of the performance of these scoring functions (aldr dataset from DUD-E).

@mwojcikowski
Copy link
Contributor

Each scoring function built-in has a method .load() which loads pre-generated descriptors to train the models. Have you checked those bundled in ODDT?

@Tonylac77
Copy link
Author

Tonylac77 commented Feb 23, 2023

Thanks for your answer. When using the load method without arguments, it starts training the model. However, I believe this is what I was suing previously and was getting low enrichment with. I will retrain now and update you. Where would i find the bundled models? I can only find .csv files in oddt/scoring/functions/NNScore/, should I be using the load() method with those?

Update : I've managed to load the pretrained model for linear PLECScore from the one bundled in ODDT. However, I would still like to understand how to train the models myself in order to use the MLP or RF version, perhaps on PDBbindv2020 and how to load the model for NNScore

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants