Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hunspell stumbles over copyright symbol #281

Open
drahnr opened this issue Sep 16, 2022 · 5 comments
Open

hunspell stumbles over copyright symbol #281

drahnr opened this issue Sep 16, 2022 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@drahnr
Copy link
Owner

drahnr commented Sep 16, 2022

Describe the bug

Encountered error Utf8Error { valid_up_to: 2, error_len: Some(1) } returned from Hunspell_suggest(handle, ["\"\\xc2\\xa9\""]): 0: "\xc2\x80\x93"
[2022-09-16T10:03:50Z DEBUG hunspell] © --{suggest}--> []

To Reproduce

Steps to reproduce the behaviour:

  1. A file containing ©
  2. Run cargo spellcheck file.rs
  3. ...

Expected behavior

Handle or ignore, currently hunspell-rs is hacked to print an error.

Screenshots

Please complete the following information:

  • System: Fedora
  • Obtained: cargo + git
  • Version: 0.12.2 / git
@drahnr drahnr added the bug Something isn't working label Sep 16, 2022
@drahnr drahnr self-assigned this Sep 16, 2022
@drahnr
Copy link
Owner Author

drahnr commented Sep 16, 2022

CC @lopopolo that was the issue at hand, the suggestion should accept 0xC2 0xA9 as valid since it is valid itself, but returns garbage suggestions instead. It's still present in 0.12.2 but will only be a verbose message that will be handled in the next release.

@lopopolo
Copy link
Contributor

Oh awesome. That's in the generated headers of the Unicode files. Great catch @drahnr and thanks for debugging!

@drahnr
Copy link
Owner Author

drahnr commented Sep 17, 2022

@lopopolo I realized you have custom dict, with a - - removing the single char items from the list resolves the issue. It seems that trips the parser and makes it's way into the lut inside hunspell and then surfaces with some byte sequences.

@lopopolo
Copy link
Contributor

I just tried this workaround and that seemed to work! I get no warning messages in cargo-spellcheck 0.12.1.

@drahnr
Copy link
Owner Author

drahnr commented Sep 28, 2022

The core issue is due to the fact that Hunspell uses the encoding used in the affix file, for the dictionaries as well. For en_us.aff (both builtin and Fedora 36) this was latin-1 encoding rather than utf-8. Hunspell then treats all inputs of encoding equiv to the affix file, implicitly by the used prefix tree.

Solution would be to use i.e. encoding_rs to re-encode the dictionaries to UTF-8 and only afterwards feed them to Hunspell or reject all encodings set besides utf-8.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants