mult lang spell don't work #270

brolny · 2021-10-17T13:12:26Z

Multiple languages spell check stop work after v8.1.5 (probably)
Now I have
DSpellCheck 1.4.21.0 - 64bit
Notepad++ v8.1.7 - 64bit
and spell check works only for 1 language...

Predelnik · 2021-11-06T16:48:50Z

Unfortunately cannot be reproduced from my side, maybe it is some problem with settings so multiple languages are no longer chosen, what it shows if you press Set Multiple Languages...?

er13 · 2022-07-09T20:52:32Z

@Predelnik

I'm also affected by the problem (or a very similar problem) since a couple of months. But only recently found a way to reliably reproduce it.

Notepad++ 8.4.3 (64-bit version)
DSpellCheck-1.4.24.0
DSpellCheck is configured to use Hunspell with the "Multiple Languages"-option.

Test case

correct правильный, incarrect непровильный

[working configuration] "Multiple Languages" is set to "English (United Stated), Russian (Russia)":
Everything works as expected, i.e. "incarrect" and "непровильный" are marked as misspelled.
[failing configuration] "Multiple Languages" is set to "English (United Stated), Russian (Russia), and Ukrainian (Ukraine)":
DSpellCheck fails to detect all misspelled words, only "непровильный" is marked as misspelled. "incarrect" and any other arbitrary sequence of Latin alphabet letters are considered to be written correctly.

NB:
A very similar problem has recently been fixed in Firefox: Firefox bug report, Firefox fix.
Perhaps taking a look at the Firefox fix might help fixing the problem in DSpellCheck.

Predelnik · 2022-07-10T19:29:50Z

Thank you for the test case @er13, I can reproduce it and will certainly look at what is going on.

Predelnik · 2022-07-10T20:06:53Z

So I understand the problem but I'm not sure whose responsibility it should be to fix that unfortunately. Also as far as I understand it seems to be different from firefox. The problem is that uk_UA.aff contains lines ICONV a 0 and so on for each Latin character. What it means is before we check any word against the dictionary all the Latin characters are changed to 0 (not \0 byte but literal 0) then Hunspell internally thinks that such words are actually numbers and tells that they are correct. Since my logic for multiple dictionary usage is checking against each dictionary and looking if any of them is correct in this case word is treated as correct.

er13 · 2022-07-10T22:36:53Z

Hmm, what about ignoring (i.e. not checking) the words containing 0 only characters (just like Firefox ignores \0-words). Still a workaround, but better than the wrong behavior.

On the other hand, I also don't understand all these iconv [a-zA-Z] 0 lines in the Ukrainian dictionary and would rather say, the proper solution would be to remove them from the Ukrainian dictionary as they cause any sequence of Latin letters to be treated as a correctly spelled word, also with only one dictionary - the Ukrainian one - enabled.

Edit:
the iconv-lines exist since April 2021
https://cgit.freedesktop.org/libreoffice/dictionaries/commit/?id=06a28cf2efe2e3fa887912989650dcaaf05f0958
https://gerrit.libreoffice.org/c/dictionaries/+/114308
https://bugs.documentfoundation.org/show_bug.cgi?id=141408

Edit2:
the sources of the Ukrainian dictionary
https://github.com/brown-uk/dict_uk

er13 · 2022-07-11T08:16:58Z

Created an issue in the project developing the Ukrainian dictionary.

brown-uk/dict_uk#306

Predelnik · 2022-07-11T08:29:01Z

I believe it's possible to have a workaround which ignores words which had something converted to 0 because luckily Hunspell exposes the function to perform this conversion externally, I will have to consider doing it.

er13 · 2022-07-12T09:48:28Z

@Predelnik: the workaround you implemented in c31e59b is not good.

The typical error/typo in Ukrainian is the usage of Latin «i» instead of Ukrainian «і» (e.g. «гранiтний» instead of «гранітний»). With you workaround the Latin «i» would be replaced with 0, the number of zeros would get increased and the whole word would be excluded from the spellchecking. This is not good.

Please check as requested for the whole word after dictionary conversion contains zeros only instead of the number of zeros after dictionary conversion has increased. Thanks!

Predelnik · 2022-07-12T10:05:40Z

On one hand what you are saying makes sense but seems like fix technically still works because I only add logic to treat word with Latin characters as incorrect without changes for other cases. So for the word containing Latin i it will stay incorrect as it was with 0 replacement, just dictionary won't be consulted about it. Strictly speaking it doesn't matter which logic there will be, yours or mine, unless there is some word which is correct after changing its Latin characters to zero and I don't think there is such a word.

er13 · 2022-07-12T10:52:01Z

Hmm, to be honest I don't see your

because I only add logic to treat word with Latin characters as incorrect without changes for other cases

neither in c31e59b nor in src/spellers/HunspellInterface.cpp. Could you please point me out where this check for Latin only words takes place?

Besides that a performance argument:

With you check you have to iterate completely over every original word (because you count the number of zeros in it), you also have to iterate over every dictionary converted word (also because you count the number of zeros) and then compare both numbers. => So assuming your text contains N words you have definitely 2*N iterations, whereas each iteration definitely iterates over the whole word.
With my way to check you have to iterate only over the dictionary converted words and could break the iteration for each word after the first non-zero character. => So assuming the same text with N words you have only at most N iterations, whereas each iteration is only then takes places over the whole word if it's a Latin-characters-word.

Predelnik · 2022-07-12T11:05:04Z

Could you please point me out where this check for Latin only words takes place?

By word with latin characters I mean word containing some Latin characters so no there is no "check for Latin only words"

Performance argument doesn't matter because

The largest cost is repeating conversion which Hunspell will do later anyway.
I cannot break anywhere during conversion because it's done internally in Hunspell and I don't want to change its code.

I don't think all these things are important to be honest, if it works for your example case which seems to be true I fail to see the problem.

er13 · 2022-07-12T11:41:43Z

I mean

std::ranges::count(word_to_check, '0') is the complete iteration over the original word
std::ranges::count(buf, '0') is the complete iteration over the converted word

If you change the name of the function from has_more_zeroes_after_conv to contains_zeros_only_after_conv and change the line

return std::ranges::count(word_to_check, '0') < std::ranges::count(buf, '0');

to

for (auto & ch: buf) {
    if (ch != '0')
        return false;
}
return true;

then you would significantly increase the performance.

Off course the costly line dic.hunspell->input_conv(word_to_check, buf); is always (i.e. in both cases) necessary.

Predelnik · 2022-07-12T11:49:30Z

But it's pretty obvious that in current case I can also short-circuit if I encountered any 0 which was not 0 before, I just don't think it's that important. Also not checking original word completely is not entirely correct also because the word initially made out of all 0s would probably needs to be treated as correct. My point is that all of this is not really important, it's just a workaround.

er13 · 2022-07-12T12:04:36Z

Ok it's up to you, I just provided my argumentative feedback after reviewing the code.

My point is "less optimal code there", "less optimal code somewhere else" - at some point in time you will notice a performance degradation. If you have a chance to do it in a more optimal way, then do it straight away.

er13 · 2022-07-13T07:54:41Z

@Predelnik
It looks like the developer of the Ukrainian dictionary is willing to remove the iconv lines in one of the upcoming releases. So wait a bit with merging the workaround into the master branch. It might become unnecessary.

I'm fine with the workaround based on the manual edit of uk_UA.aff until then.

And thanks once again for analyzing the issue. I would however consider my case as unrelated to that of the original bug reporter. The original bug reporter doesn't use the Ukrainian dictionary at all.

Predelnik · 2022-07-13T07:59:59Z

Thank you for the information, @er13, I will probably abstain from merging the workaround for now.
Yes, the original bug is probably something different but it's hard to tell without feedback from the author unfortunately.

tomasz1986 · 2023-11-25T16:48:34Z

I think the Korean language has a similar problem, although I'm unsure whether it's the same culprit.

I'm a total newbie when it comes to dictionary files structure, but https://raw.githubusercontent.com/LibreOffice/dictionaries/master/ko_KR/ko_KR.dic contains these lines:

a/32
b/32
c/32
d/32
e/32
f/32
g/32
h/32
i/32
j/32
k/32
l/32
m/32
n/32
o/32
p/32
q/32
r/32
s/32
t/32
u/32
v/32
w/32
x/32
y/32
z/32

and after removing them, the multi language spell check seems to be working correctly.

Predelnik mentioned this issue Jul 12, 2022

Add a workaround for the case when dictionary converts characters to 0 #302

Closed

er13 mentioned this issue Jul 12, 2022

[hunspell dictionary] "iconv * 0" lines cause any Latin-characters-words to be treated as correctly spelled brown-uk/dict_uk#306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mult lang spell don't work #270

mult lang spell don't work #270

brolny commented Oct 17, 2021

Predelnik commented Nov 6, 2021

er13 commented Jul 9, 2022

Predelnik commented Jul 10, 2022

Predelnik commented Jul 10, 2022

er13 commented Jul 10, 2022 •

edited

er13 commented Jul 11, 2022

Predelnik commented Jul 11, 2022

er13 commented Jul 12, 2022

Predelnik commented Jul 12, 2022

er13 commented Jul 12, 2022

Predelnik commented Jul 12, 2022

er13 commented Jul 12, 2022

Predelnik commented Jul 12, 2022

er13 commented Jul 12, 2022

er13 commented Jul 13, 2022 •

edited

Predelnik commented Jul 13, 2022

tomasz1986 commented Nov 25, 2023

mult lang spell don't work #270

mult lang spell don't work #270

Comments

brolny commented Oct 17, 2021

Predelnik commented Nov 6, 2021

er13 commented Jul 9, 2022

Predelnik commented Jul 10, 2022

Predelnik commented Jul 10, 2022

er13 commented Jul 10, 2022 • edited

er13 commented Jul 11, 2022

Predelnik commented Jul 11, 2022

er13 commented Jul 12, 2022

Predelnik commented Jul 12, 2022

er13 commented Jul 12, 2022

Predelnik commented Jul 12, 2022

er13 commented Jul 12, 2022

Predelnik commented Jul 12, 2022

er13 commented Jul 12, 2022

er13 commented Jul 13, 2022 • edited

Predelnik commented Jul 13, 2022

tomasz1986 commented Nov 25, 2023

er13 commented Jul 10, 2022 •

edited

er13 commented Jul 13, 2022 •

edited