How to determine the token of a new data set #34

zhouhao-learning · 2019-07-23T02:32:49Z

I collected a large smiles data set. I wanted to try to generate the model from scratch. Then I counted the unique characters of all smiles, as follows:
#%()*+-./0123456789:=@ABCDEFGHIKLMNOPRSTUVWXYZ[\\]abcdefghiklmnoprstuy

But I see in your `JAK2_min_max_demo.ipynb',

tokens = ['<', '>', '#', '%', ')', '(', '+', '-', '/', '.', '1', '0', '3', '2', '5', '4', '7', '6', '9', '8', '=', 'A', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'P', 'S', '[', ']','\\', 'c', 'e', 'i', 'l', 'o', 'n', 'p', 's', 'r', '\n']

Then I read the smiles data file you provided chembl_22_clean_1576904_sorted_std_final.smi,Get the unique character of smiles,But I found that token is not equal to token in `JAK2_min_max_demo.ipynb':

chem_smiles = read_smi_file("ReLeaSE/data/chembl_22_clean_1576904_sorted_std_final.smi")
ch_smiles = [i.split("\t")[0] for i in chem_smiles[0]]

tokens2 = list(set(''.join(ch_smiles)))
tokens2 = list(np.sort(tokens))
tokens2 = ''.join(tokens)

The token2 result is:#%()+-./0123456789=BCFHINOPS[\\]clnoprs

Except that < and >'denote beginning and ending, token1 and token2 are not equal, why is that? What did you do with the chembl_22_clean_1576904_sorted_std_final.smi?

Can you give me more guidance? Thank you very much.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to determine the token of a new data set #34

How to determine the token of a new data set #34

zhouhao-learning commented Jul 23, 2019

How to determine the token of a new data set #34

How to determine the token of a new data set #34

Comments

zhouhao-learning commented Jul 23, 2019