Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for Detailed Information on Training the Tokenizer #117

Open
a-green-hand-jack opened this issue Mar 14, 2024 · 1 comment
Open

Comments

@a-green-hand-jack
Copy link

Dear author,

I am pleased to see that you have provided some pre-trained Tokenizer save files in this repository. However, I noticed that you may not provide the code for training this DNATokenizer. I am very interested in how you trained this .

Specifically, I would like to know the following:

  1. What specific method or algorithm did you use to train the Tokenizer? Did you use any existing libraries or frameworks?
  2. What hardware environment did you train on? For example, CPU or GPU, as well as the model and quantity.
  3. Which version of Python did you use for training?
  4. Approximately how much time did it take to train this Tokenizer on the complete dataset?

Providing this information would be very helpful for me to understand your workflow and how to retrain the Tokenizer in my environment. Thank you for your time and assistance!

@jithinairr
Copy link

Hi @a-green-hand-jack,

Hope you are doing great!

I wanted to know how the DNATokenizer works and wanted to understand how to run tokenization_dna.py as a standalone as I am only interested in the way that the tokenisation takes places on the sequence.

Thank you,

Kind Regards,
Jithin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants