Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

instructions for generating vocab.pkl? #17

Open
manavsingh415 opened this issue Jun 28, 2021 · 15 comments
Open

instructions for generating vocab.pkl? #17

manavsingh415 opened this issue Jun 28, 2021 · 15 comments

Comments

@manavsingh415
Copy link

manavsingh415 commented Jun 28, 2021

Hi. Would it be possible for the authors to either upload vocab.pkl (for the pretrained model), or give instructions and code about how to generate the vocab.pkl file from the CHEMBL24 dataset (or any other dataset used)? Thanks

@shionhonda
Copy link
Contributor

Hi. I'm sorry but I lost access to the resources.
Does this issue help?
#11

@manavsingh415
Copy link
Author

manavsingh415 commented Jun 28, 2021 via email

@shionhonda
Copy link
Contributor

Well, I meant to mention this comment. I'm glad if it helps.
#11 (comment)

@miquelduranfrigola
Copy link

Hi Shion,

I have tried to generate the vocab.pkl from ChEMBL 24. Using default parameters in the build_vocab.py file, I get a vocabulary size of 75.

If I am not mistaken, this is not compatible with the pretrained model provided:

size mismatch for embed.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([75, 256]).
size mismatch for out.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([75, 256]).
size mismatch for out.bias: copying a param with shape torch.Size([45]) from checkpoint, the shape in current model is torch.Size([75]).

Thanks!
M

@shionhonda
Copy link
Contributor

Thanks for reporting.
That's strange. Then I might have used different parameters... I'm sorry that it's not set properly.

@shionhonda
Copy link
Contributor

Does it help?
#19

@miquelduranfrigola
Copy link

Hi..! Unfortunately the vocab.pkl file from #19 does not help either...

size mismatch for embed.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([50, 256]).
size mismatch for out.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([50, 256]).
size mismatch for out.bias: copying a param with shape torch.Size([45]) from checkpoint, the shape in current model is torch.Size([50]).

@dinabandhu50
Copy link

I was able to reproduce the vocab.pkl with following steps

  1. Download chemble_24 data form chemble_24_1 with name chembl_24_1_chemreps.txt.gz , this is the same data as mentioned by the author in this issue.

  2. Then open the 01_data_prepare.ipynb file and start running from the following cell
    chembl_24_corpus_reading

  3. Run till the following line (to obtain chembl_24.csv)
    run_till_this_lines

  4. After obtaining the csv file run the build_corpus.py, I have only changed the file reading location and the pandas dataframe column to obtain SMILES. Running this file will take some time.
    build_corpus_LI

  5. After obtaining the data/chembl24_corpus.txt by running above, run the build_vocab.py file
    build_vocab

  6. Now this vocab will have len(vocab)==45, I am attaching the obtained result below
    proof

PS: don't forget to change the the n_layers from 3 to 4 - trfm = TrfmSeq2seq(len(vocab), 256, len(vocab), 4)

Thanks

Regards,
Dinabandhu

@shionhonda
Copy link
Contributor

@dinabandhu50
Thank you so much!!

@sevencheung2021
Copy link

copying a param with shape torch.Size([45, 256]) from checkpoint, the shape in current model is torch.Size([75, 256]).
size mismatch for out.weight: copying a param with shape torch.Size([45, 256]) from checkpoint, the

have you solved this mismatch problem ?

@VincentBt
Copy link

It indeed solves the mismatch problem

@AachenOcean
Copy link

where could I fin 01_data_prepare.ipynb ? Thanks.

@AachenOcean
Copy link

I find the data_prepare.ipynb, however, I still have a problem in step of runing the build_corpus.py. At the beginning It shows i don't have the utils module, then I install it with pip install utils. However, when I run it again, it shows the error "cannot import name 'split' from 'utils'". I use Python3 to run this command, do you have any suggestion on it? Thanks.

@madiha1ahmed
Copy link

Where did you get the "01_data_prepare.ipnyb'?

@GGuuu
Copy link

GGuuu commented Sep 9, 2023

Where did you get the "01_data_prepare.ipnyb'?

I think that is 'prepare_data.ipynb' in 'experiments' folder
I could make file from that file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants