We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPMConverter
The SPM converter (and all that inherit / use it) does not take into account the user defined symbols, leading to issues like this one: https://huggingface.co/01-ai/Yi-9B/discussions/11#6643844f6ac3fe108e3d5190
SPM
It also does not really take into account the prefix space param which can and should be extracted from the proto:
split_by_unicode_script: true split_by_number: true split_by_whitespace: true treat_whitespace_as_suffix: false allow_whitespace_only_pieces: true split_digits: true
and
normalizer_spec { name: "identity" precompiled_charsmap: "" add_dummy_prefix: false remove_extra_whitespaces: false normalization_rule_tsv: "" }
cc @itazap, on a more general converter!
The text was updated successfully, but these errors were encountered:
adding user defined tokens #30824
83e3e1f
996ff22
8b0aa67
3edfd83
24ea0cd
Successfully merging a pull request may close this issue.
The
SPM
converter (and all that inherit / use it) does not take into account the user defined symbols, leading to issues like this one: https://huggingface.co/01-ai/Yi-9B/discussions/11#6643844f6ac3fe108e3d5190It also does not really take into account the prefix space param which can and should be extracted from the proto:
and
cc @itazap, on a more general converter!
The text was updated successfully, but these errors were encountered: