Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I retokenize at the start of a training pipeline? #13484

Open
gtoffoli opened this issue May 12, 2024 · 0 comments
Open

Can I retokenize at the start of a training pipeline? #13484

gtoffoli opened this issue May 12, 2024 · 0 comments
Labels
feat / tokenizer Feature: Tokenizer

Comments

@gtoffoli
Copy link
Contributor

I need to perform a lot of retokenization before running a training pipeline, but from the doc I cannot understand if that is possible and, if yes, how to specify that in the config file.
In #5921 , @svlandeg showed how to deal with a similar issue, in the case at hand (i.e. training NER), without adding a custom component; thus, she didn't explicitly answer the original question and my question.

As I explained in issue #13248 and, more extensively, in discussion #7146 , I'm struggling to develop a viable tokenizer for the Arabic language. For doing that, I think I need both to extend the data (the configuration files) in the current implementation of the tokenizer and to add a considerable amount of post-processing.
In the past, I've implemented the post-processing with some Cython code and I began to get significantly improved results from the data debug and train commands. Then, I installed spaCy from source, but in this case I wasn't able to integrate my Cython code with the spaCy codebase, more precisely to import tokenizer.Tokenizer and vocab.Vocab.

Now, I guess that being able to put a component just after the spaCy Tokenizer in the training pipeline (and in the production pipeline) would be much cleaner and probably more efficient.
Could somebody answer my question and/or suggest a solution for my problem? Thanks in advance!

@svlandeg svlandeg added the feat / tokenizer Feature: Tokenizer label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer
Projects
None yet
Development

No branches or pull requests

2 participants