instructions for generating vocab.pkl? #17

manavsingh415 · 2021-06-28T13:07:23Z

Hi. Would it be possible for the authors to either upload vocab.pkl (for the pretrained model), or give instructions and code about how to generate the vocab.pkl file from the CHEMBL24 dataset (or any other dataset used)? Thanks

shionhonda · 2021-06-28T14:36:39Z

Hi. I'm sorry but I lost access to the resources.
Does this issue help?
#11

manavsingh415 · 2021-06-28T14:50:24Z

Hi Shion Thanks for getting back to me! I checked that issue and it points to the ChEMBL24 dataset. I am interested in how to generate vocab.pkl from this dataset. Actually, I just wish to run the pretrained model on a set of molecules I have, to generate their vector representations. If this is possible without the vocab.pkl file, please let me know! Thanks Regards, Manav

…

________________________________ From: Shion Honda ***@***.***> Sent: Monday, June 28, 2021 3:36 PM To: DSPsleeporg/smiles-transformer ***@***.***> Cc: manavsingh415 ***@***.***>; Author ***@***.***> Subject: Re: [DSPsleeporg/smiles-transformer] instructions for generating vocab.pkl? (#17) Hi. I'm sorry but I lost access to the resources. Does this issue help? #11<#11> — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#17 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AMRX2BDUJKUPHPRPEPZEY43TVCCIDANCNFSM47N53UPA>.

shionhonda · 2021-06-28T23:45:08Z

Well, I meant to mention this comment. I'm glad if it helps.
#11 (comment)

miquelduranfrigola · 2021-09-28T08:31:33Z

Hi Shion,

I have tried to generate the vocab.pkl from ChEMBL 24. Using default parameters in the build_vocab.py file, I get a vocabulary size of 75.

If I am not mistaken, this is not compatible with the pretrained model provided:

size mismatch for embed.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([75, 256]).
size mismatch for out.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([75, 256]).
size mismatch for out.bias: copying a param with shape torch.Size([45]) from checkpoint, the shape in current model is torch.Size([75]).

Thanks!
M

shionhonda · 2021-10-02T03:47:11Z

Thanks for reporting.
That's strange. Then I might have used different parameters... I'm sorry that it's not set properly.

shionhonda · 2021-10-02T03:48:10Z

Does it help?
#19

miquelduranfrigola · 2021-10-02T05:52:57Z

Hi..! Unfortunately the vocab.pkl file from #19 does not help either...

size mismatch for embed.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([50, 256]).
size mismatch for out.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([50, 256]).
size mismatch for out.bias: copying a param with shape torch.Size([45]) from checkpoint, the shape in current model is torch.Size([50]).

dinabandhu50 · 2021-10-20T07:52:57Z

I was able to reproduce the vocab.pkl with following steps

Download chemble_24 data form chemble_24_1 with name chembl_24_1_chemreps.txt.gz , this is the same data as mentioned by the author in this issue.
Then open the 01_data_prepare.ipynb file and start running from the following cell
Run till the following line (to obtain chembl_24.csv)
After obtaining the csv file run the build_corpus.py, I have only changed the file reading location and the pandas dataframe column to obtain SMILES. Running this file will take some time.
After obtaining the data/chembl24_corpus.txt by running above, run the build_vocab.py file
Now this vocab will have len(vocab)==45, I am attaching the obtained result below

PS: don't forget to change the the n_layers from 3 to 4 - trfm = TrfmSeq2seq(len(vocab), 256, len(vocab), 4)

Thanks

Regards,
Dinabandhu

shionhonda · 2021-10-22T00:08:54Z

@dinabandhu50
Thank you so much!!

sevencheung2021 · 2021-11-26T10:41:57Z

copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([75, 256]).
size mismatch for out.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the

have you solved this mismatch problem ?

VincentBt · 2022-07-03T16:47:02Z

It indeed solves the mismatch problem

AachenOcean · 2022-11-11T15:37:11Z

where could I fin 01_data_prepare.ipynb ? Thanks.

AachenOcean · 2022-11-11T17:23:38Z

I find the data_prepare.ipynb, however, I still have a problem in step of runing the build_corpus.py. At the beginning It shows i don't have the utils module, then I install it with pip install utils. However, when I run it again, it shows the error "cannot import name 'split' from 'utils'". I use Python3 to run this command, do you have any suggestion on it? Thanks.

madiha1ahmed · 2023-03-22T16:33:46Z

Where did you get the "01_data_prepare.ipnyb'?

GGuuu · 2023-09-09T02:35:01Z

Where did you get the "01_data_prepare.ipnyb'?

I think that is 'prepare_data.ipynb' in 'experiments' folder
I could make file from that file

VincentBt mentioned this issue Jul 3, 2022

Added missing vocab.pkl file #19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

instructions for generating vocab.pkl? #17

instructions for generating vocab.pkl? #17

manavsingh415 commented Jun 28, 2021 •

edited

shionhonda commented Jun 28, 2021

manavsingh415 commented Jun 28, 2021 via email

shionhonda commented Jun 28, 2021

miquelduranfrigola commented Sep 28, 2021

shionhonda commented Oct 2, 2021

shionhonda commented Oct 2, 2021

miquelduranfrigola commented Oct 2, 2021

dinabandhu50 commented Oct 20, 2021

shionhonda commented Oct 22, 2021

sevencheung2021 commented Nov 26, 2021

VincentBt commented Jul 3, 2022

AachenOcean commented Nov 11, 2022

AachenOcean commented Nov 11, 2022

madiha1ahmed commented Mar 22, 2023

GGuuu commented Sep 9, 2023 •

edited

instructions for generating vocab.pkl? #17

instructions for generating vocab.pkl? #17

Comments

manavsingh415 commented Jun 28, 2021 • edited

shionhonda commented Jun 28, 2021

manavsingh415 commented Jun 28, 2021 via email

shionhonda commented Jun 28, 2021

miquelduranfrigola commented Sep 28, 2021

shionhonda commented Oct 2, 2021

shionhonda commented Oct 2, 2021

miquelduranfrigola commented Oct 2, 2021

dinabandhu50 commented Oct 20, 2021

shionhonda commented Oct 22, 2021

sevencheung2021 commented Nov 26, 2021

VincentBt commented Jul 3, 2022

AachenOcean commented Nov 11, 2022

AachenOcean commented Nov 11, 2022

madiha1ahmed commented Mar 22, 2023

GGuuu commented Sep 9, 2023 • edited

manavsingh415 commented Jun 28, 2021 •

edited

GGuuu commented Sep 9, 2023 •

edited