Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mult lang spell don't work #270

Open
brolny opened this issue Oct 17, 2021 · 17 comments
Open

mult lang spell don't work #270

brolny opened this issue Oct 17, 2021 · 17 comments

Comments

@brolny
Copy link

brolny commented Oct 17, 2021

Multiple languages spell check stop work after v8.1.5 (probably)
Now I have
DSpellCheck 1.4.21.0 - 64bit
Notepad++ v8.1.7 - 64bit
and spell check works only for 1 language...

multi
en
hy
ru

@Predelnik
Copy link
Owner

Unfortunately cannot be reproduced from my side, maybe it is some problem with settings so multiple languages are no longer chosen, what it shows if you press Set Multiple Languages...?

@er13
Copy link

er13 commented Jul 9, 2022

@Predelnik

I'm also affected by the problem (or a very similar problem) since a couple of months. But only recently found a way to reliably reproduce it.

Notepad++ 8.4.3 (64-bit version)
DSpellCheck-1.4.24.0
DSpellCheck is configured to use Hunspell with the "Multiple Languages"-option.

Test case

correct правильный, incarrect непровильный
  • [working configuration] "Multiple Languages" is set to "English (United Stated), Russian (Russia)":
    Everything works as expected, i.e. "incarrect" and "непровильный" are marked as misspelled.

  • [failing configuration] "Multiple Languages" is set to "English (United Stated), Russian (Russia), and Ukrainian (Ukraine)":
    DSpellCheck fails to detect all misspelled words, only "непровильный" is marked as misspelled. "incarrect" and any other arbitrary sequence of Latin alphabet letters are considered to be written correctly.

NB:
A very similar problem has recently been fixed in Firefox: Firefox bug report, Firefox fix.
Perhaps taking a look at the Firefox fix might help fixing the problem in DSpellCheck.

@Predelnik
Copy link
Owner

Thank you for the test case @er13, I can reproduce it and will certainly look at what is going on.

@Predelnik
Copy link
Owner

So I understand the problem but I'm not sure whose responsibility it should be to fix that unfortunately. Also as far as I understand it seems to be different from firefox. The problem is that uk_UA.aff contains lines ICONV a 0 and so on for each Latin character. What it means is before we check any word against the dictionary all the Latin characters are changed to 0 (not \0 byte but literal 0) then Hunspell internally thinks that such words are actually numbers and tells that they are correct. Since my logic for multiple dictionary usage is checking against each dictionary and looking if any of them is correct in this case word is treated as correct.

@er13
Copy link

er13 commented Jul 10, 2022

Hmm, what about ignoring (i.e. not checking) the words containing 0 only characters (just like Firefox ignores \0-words). Still a workaround, but better than the wrong behavior.

On the other hand, I also don't understand all these iconv [a-zA-Z] 0 lines in the Ukrainian dictionary and would rather say, the proper solution would be to remove them from the Ukrainian dictionary as they cause any sequence of Latin letters to be treated as a correctly spelled word, also with only one dictionary - the Ukrainian one - enabled.

Edit:
the iconv-lines exist since April 2021
https://cgit.freedesktop.org/libreoffice/dictionaries/commit/?id=06a28cf2efe2e3fa887912989650dcaaf05f0958
https://gerrit.libreoffice.org/c/dictionaries/+/114308
https://bugs.documentfoundation.org/show_bug.cgi?id=141408

Edit2:
the sources of the Ukrainian dictionary
https://github.com/brown-uk/dict_uk

@er13
Copy link

er13 commented Jul 11, 2022

Created an issue in the project developing the Ukrainian dictionary.

brown-uk/dict_uk#306

@Predelnik
Copy link
Owner

I believe it's possible to have a workaround which ignores words which had something converted to 0 because luckily Hunspell exposes the function to perform this conversion externally, I will have to consider doing it.

@er13
Copy link

er13 commented Jul 12, 2022

@Predelnik: the workaround you implemented in c31e59b is not good.

The typical error/typo in Ukrainian is the usage of Latin «i» instead of Ukrainian «і» (e.g. «гранiтний» instead of «гранітний»). With you workaround the Latin «i» would be replaced with 0, the number of zeros would get increased and the whole word would be excluded from the spellchecking. This is not good.

Please check as requested for the whole word after dictionary conversion contains zeros only instead of the number of zeros after dictionary conversion has increased. Thanks!

@Predelnik
Copy link
Owner

On one hand what you are saying makes sense but seems like fix technically still works because I only add logic to treat word with Latin characters as incorrect without changes for other cases. So for the word containing Latin i it will stay incorrect as it was with 0 replacement, just dictionary won't be consulted about it. Strictly speaking it doesn't matter which logic there will be, yours or mine, unless there is some word which is correct after changing its Latin characters to zero and I don't think there is such a word.

@er13
Copy link

er13 commented Jul 12, 2022

Hmm, to be honest I don't see your

because I only add logic to treat word with Latin characters as incorrect without changes for other cases

neither in c31e59b nor in src/spellers/HunspellInterface.cpp. Could you please point me out where this check for Latin only words takes place?

Besides that a performance argument:

  • With you check you have to iterate completely over every original word (because you count the number of zeros in it), you also have to iterate over every dictionary converted word (also because you count the number of zeros) and then compare both numbers. => So assuming your text contains N words you have definitely 2*N iterations, whereas each iteration definitely iterates over the whole word.
  • With my way to check you have to iterate only over the dictionary converted words and could break the iteration for each word after the first non-zero character. => So assuming the same text with N words you have only at most N iterations, whereas each iteration is only then takes places over the whole word if it's a Latin-characters-word.

@Predelnik
Copy link
Owner

Could you please point me out where this check for Latin only words takes place?

By word with latin characters I mean word containing some Latin characters so no there is no "check for Latin only words"

Performance argument doesn't matter because

  1. The largest cost is repeating conversion which Hunspell will do later anyway.
  2. I cannot break anywhere during conversion because it's done internally in Hunspell and I don't want to change its code.

I don't think all these things are important to be honest, if it works for your example case which seems to be true I fail to see the problem.

@er13
Copy link

er13 commented Jul 12, 2022

I mean

  • std::ranges::count(word_to_check, '0') is the complete iteration over the original word
  • std::ranges::count(buf, '0') is the complete iteration over the converted word

If you change the name of the function from has_more_zeroes_after_conv to contains_zeros_only_after_conv and change the line

return std::ranges::count(word_to_check, '0') < std::ranges::count(buf, '0');

to

for (auto & ch: buf) {
    if (ch != '0')
        return false;
}
return true;

then you would significantly increase the performance.

Off course the costly line dic.hunspell->input_conv(word_to_check, buf); is always (i.e. in both cases) necessary.

@Predelnik
Copy link
Owner

But it's pretty obvious that in current case I can also short-circuit if I encountered any 0 which was not 0 before, I just don't think it's that important. Also not checking original word completely is not entirely correct also because the word initially made out of all 0s would probably needs to be treated as correct. My point is that all of this is not really important, it's just a workaround.

@er13
Copy link

er13 commented Jul 12, 2022

Ok it's up to you, I just provided my argumentative feedback after reviewing the code.

My point is "less optimal code there", "less optimal code somewhere else" - at some point in time you will notice a performance degradation. If you have a chance to do it in a more optimal way, then do it straight away.

@er13
Copy link

er13 commented Jul 13, 2022

@Predelnik
It looks like the developer of the Ukrainian dictionary is willing to remove the iconv lines in one of the upcoming releases. So wait a bit with merging the workaround into the master branch. It might become unnecessary.

I'm fine with the workaround based on the manual edit of uk_UA.aff until then.

And thanks once again for analyzing the issue. I would however consider my case as unrelated to that of the original bug reporter. The original bug reporter doesn't use the Ukrainian dictionary at all.

@Predelnik
Copy link
Owner

Thank you for the information, @er13, I will probably abstain from merging the workaround for now.
Yes, the original bug is probably something different but it's hard to tell without feedback from the author unfortunately.

@tomasz1986
Copy link

I think the Korean language has a similar problem, although I'm unsure whether it's the same culprit.

I'm a total newbie when it comes to dictionary files structure, but https://raw.githubusercontent.com/LibreOffice/dictionaries/master/ko_KR/ko_KR.dic contains these lines:

a/32
b/32
c/32
d/32
e/32
f/32
g/32
h/32
i/32
j/32
k/32
l/32
m/32
n/32
o/32
p/32
q/32
r/32
s/32
t/32
u/32
v/32
w/32
x/32
y/32
z/32

and after removing them, the multi language spell check seems to be working correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants