Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to determine the token of a new data set #34

Open
zhouhao-learning opened this issue Jul 23, 2019 · 0 comments
Open

How to determine the token of a new data set #34

zhouhao-learning opened this issue Jul 23, 2019 · 0 comments

Comments

@zhouhao-learning
Copy link

I collected a large smiles data set. I wanted to try to generate the model from scratch. Then I counted the unique characters of all smiles, as follows:
#%()*+-./0123456789:=@ABCDEFGHIKLMNOPRSTUVWXYZ[\\]abcdefghiklmnoprstuy

But I see in your `JAK2_min_max_demo.ipynb',

tokens = ['<', '>', '#', '%', ')', '(', '+', '-', '/', '.', '1', '0', '3', '2', '5', '4', '7', '6', '9', '8', '=', 'A', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'P', 'S', '[', ']','\\', 'c', 'e', 'i', 'l', 'o', 'n', 'p', 's', 'r', '\n']

Then I read the smiles data file you provided chembl_22_clean_1576904_sorted_std_final.smi,Get the unique character of smiles,But I found that token is not equal to token in `JAK2_min_max_demo.ipynb':

chem_smiles = read_smi_file("ReLeaSE/data/chembl_22_clean_1576904_sorted_std_final.smi")
ch_smiles = [i.split("\t")[0] for i in chem_smiles[0]]

tokens2 = list(set(''.join(ch_smiles)))
tokens2 = list(np.sort(tokens))
tokens2 = ''.join(tokens)

The token2 result is:#%()+-./0123456789=BCFHINOPS[\\]clnoprs

Except that < and >'denote beginning and ending, token1 and token2 are not equal, why is that? What did you do with the chembl_22_clean_1576904_sorted_std_final.smi?

Can you give me more guidance? Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant