Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unmunch + hunspell, is anything wrong here? #968

Open
olea opened this issue Aug 14, 2023 · 1 comment
Open

unmunch + hunspell, is anything wrong here? #968

olea opened this issue Aug 14, 2023 · 1 comment

Comments

@olea
Copy link

olea commented Aug 14, 2023

Hi:

I'm a little helper for the hunspell-es team (RLA-ES). I randomly made this test for fun but I found a weird, to me, result:

$ git clone https://git.libreoffice.org/dictionaries/
$ DICC=$(pwd)/dictionaries/es/
$ unmunch ${DICC}/es.dic ${DICC}/es.aff > es.unmunched
$  wc -l es.unmunched 
1284912 es.unmunched
$ hunspell -d "${DICC}/es" -l es.unmunched |wc -l
520877

Honestly I would have expected a result of 0 lines for the spellchecking operation. What I'm doing wrong? I misunderstood what the unmunch output is? Maybe there is a significant problem in the Spanish dictionary?

I really don't know how to interpret this results and what action, if any, should be done.

Thanks a lot.

@fin-w
Copy link

fin-w commented Jan 15, 2024

I'm not a developer of Hunspell, but from what I understand, what you're seeing there is a problem with Unmunch, not your dictionary. I have the same problem with my dictionaries. I think Unmunch tries to generate words that exist, but it doesn't always generate words that actually exist. So Hunspell will find errors in the output from Unmunch, because some of the words it created aren't actually in the dictionary. Unmunch also doesn't generate all the words that the dictionary files actually contain: for example, I just tested Unmunch on my dictionary and it didn't create many words that I know the dictionary contains.

If you want to generate all the words in your dictionary, you may be interested in a program I've written that does this: https://github.com/fin-w/LibreOffice-Geiriadur-Cymraeg-Welsh-Dictionary/blob/main/wordforms
It's an entirely rewritten version of Hunspell's Wordforms script, with the same functionality, faster run times (usually) for generating the affixed variations of single words, and the ability to generate all word variations in the dictionary, like Unmunch is supposed to do. I will warn you though, it's only a proof-of-concept, and it takes a long time to generate every word in the dictionary (particularly with the Spanish dictionary). Instead of using Unmunch, you can use wordforms -g es.aff es.dic es.unmunched with my script and it should generate all the words in your dictionary and put them in es.unmunched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants