Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does Hunspell handle agglutinative languages like Turkish? #985

Open
lancejpollard opened this issue Nov 5, 2023 · 0 comments
Open

Comments

@lancejpollard
Copy link

lancejpollard commented Nov 5, 2023

The Turkish Hunspell .aff file has over 50,000+ affixes, all of which say N (No) for the suffix. They are also including very long suffixes.

SFX 3 N 1
SFX 3 0 cilerdensin .

According to this blog post, there are I think around 700 suffixes last time I counted. Then they can be combined in arbitrary ways, sometimes having over 10+ suffixes concatenated onto the base word. I would think in principle you would store some sort of Directed Acyclic Graph for allowing dynamically computing possible/theoretical words which have never been encountered before, but it appears the Hunspell Turkish dictionary is precompiling possible suffix chains and just making them as SFX ... N (no chaining). Am I reading that correctly?

In newer Hunspell, is there a more idiomatic way of solving this with less suffixes?

I feel like I read somewhere that Hunspell can only support 2 prefixes or 2 suffixes, or 1 of each together. Is something like that an issue here, the reason for the way they organize the Turkish dictionary?

Thank you so much for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant