Can I retokenize at the start of a training pipeline? #13484

gtoffoli · 2024-05-12T15:11:00Z

I need to perform a lot of retokenization before running a training pipeline, but from the doc I cannot understand if that is possible and, if yes, how to specify that in the config file.
In #5921 , @svlandeg showed how to deal with a similar issue, in the case at hand (i.e. training NER), without adding a custom component; thus, she didn't explicitly answer the original question and my question.

As I explained in issue #13248 and, more extensively, in discussion #7146 , I'm struggling to develop a viable tokenizer for the Arabic language. For doing that, I think I need both to extend the data (the configuration files) in the current implementation of the tokenizer and to add a considerable amount of post-processing.
In the past, I've implemented the post-processing with some Cython code and I began to get significantly improved results from the data debug and train commands. Then, I installed spaCy from source, but in this case I wasn't able to integrate my Cython code with the spaCy codebase, more precisely to import tokenizer.Tokenizer and vocab.Vocab.

Now, I guess that being able to put a component just after the spaCy Tokenizer in the training pipeline (and in the production pipeline) would be much cleaner and probably more efficient.
Could somebody answer my question and/or suggest a solution for my problem? Thanks in advance!

svlandeg added the feat / tokenizer Feature: Tokenizer label May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I retokenize at the start of a training pipeline? #13484

Can I retokenize at the start of a training pipeline? #13484

gtoffoli commented May 12, 2024

Can I retokenize at the start of a training pipeline? #13484

Can I retokenize at the start of a training pipeline? #13484

Comments

gtoffoli commented May 12, 2024