Word "C++" is tokenized incorrectly and can not be whitelisted #272

ravenexp · 2022-08-17T12:03:18Z

Describe the bug

It is not possible to whitelist the word "C++" by adding it to the local Hunspell dictionary.

Adding "^[cC][+][+]$" to the transform_regex list also does not help.

To Reproduce

Steps to reproduce the behaviour:

A file containing the word "C++"
Add "C++" into the local Hunspell dictionary.
Run cargo spellcheck ....
A spelling error message is displayed for every "+" in "C++".

Expected behavior

Hunspell finds "C++" in the local dictionary and accepts it as correct.

Screenshots

error: spellcheck(Hunspell)
    --> /home/x/y.md:252
     |
 252 | Specifically, the GNU C++ compiler version 8.2 or newer and
     |                        ^
     |   Possible spelling mistake found.
error: spellcheck(Hunspell)
    --> /home/x/y.md:252
     |
 252 | Specifically, the GNU C++ compiler version 8.2 or newer and
     |                         ^
     |   Possible spelling mistake found.

Please complete the following information:

System: Arch Linux
Obtained: pacman
Version: cargo-spellcheck 0.11.2

The text was updated successfully, but these errors were encountered:

ravenexp · 2022-08-17T12:14:31Z

Oh, I've accidentally found a workaround while figuring out how to make cargo-spellcheck not complain about "—" (EM-DASH).

Adding

transform_regex = [..., "^[+]$"]

to the config makes cargo-spellcheck accept "C++" as a correct word.

drahnr · 2022-08-17T12:16:39Z

A workaround is to .. yes, exactly this - allow + tokens. Tokenization is done by a third party lib and will never be perfect. Either use ``` or add the workaround you found.

If you would like to make spellcheck aware of additional splitchars, there is tokenization_splitchars in [Hunspell].

ravenexp · 2022-08-17T12:21:04Z

If you would like to make spellcheck aware of additional splitchars, there is tokenization_splitchars in [Hunspell].

Thanks, that's even better!

BTW, it's not mentioned in

https://github.com/drahnr/cargo-spellcheck/blob/master/docs/configuration.md

and I had to run cargo spellcheck config --stdout to find out about this parameter.

ravenexp added the bug Something isn't working label Aug 17, 2022

ravenexp assigned drahnr Aug 17, 2022

drahnr added checker / hunspell hunspell checker related topics tokenization labels Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word "C++" is tokenized incorrectly and can not be whitelisted #272

Word "C++" is tokenized incorrectly and can not be whitelisted #272

ravenexp commented Aug 17, 2022

ravenexp commented Aug 17, 2022

drahnr commented Aug 17, 2022

ravenexp commented Aug 17, 2022

Word "C++" is tokenized incorrectly and can not be whitelisted #272

Word "C++" is tokenized incorrectly and can not be whitelisted #272

Comments

ravenexp commented Aug 17, 2022

ravenexp commented Aug 17, 2022

drahnr commented Aug 17, 2022

ravenexp commented Aug 17, 2022