Request for Detailed Information on Training the Tokenizer #117

a-green-hand-jack · 2024-03-14T07:12:51Z

Dear author,

I am pleased to see that you have provided some pre-trained Tokenizer save files in this repository. However, I noticed that you may not provide the code for training this DNATokenizer. I am very interested in how you trained this .

Specifically, I would like to know the following:

What specific method or algorithm did you use to train the Tokenizer? Did you use any existing libraries or frameworks?
What hardware environment did you train on? For example, CPU or GPU, as well as the model and quantity.
Which version of Python did you use for training?
Approximately how much time did it take to train this Tokenizer on the complete dataset?

Providing this information would be very helpful for me to understand your workflow and how to retrain the Tokenizer in my environment. Thank you for your time and assistance!

jithinairr · 2024-05-20T06:14:47Z

Hi @a-green-hand-jack,

Hope you are doing great!

I wanted to know how the DNATokenizer works and wanted to understand how to run tokenization_dna.py as a standalone as I am only interested in the way that the tokenisation takes places on the sequence.

Thank you,

Kind Regards,
Jithin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request for Detailed Information on Training the Tokenizer #117

Request for Detailed Information on Training the Tokenizer #117

a-green-hand-jack commented Mar 14, 2024

jithinairr commented May 20, 2024

Request for Detailed Information on Training the Tokenizer #117

Request for Detailed Information on Training the Tokenizer #117

Comments

a-green-hand-jack commented Mar 14, 2024

jithinairr commented May 20, 2024