How does Hunspell handle agglutinative languages like Turkish? #985

lancejpollard · 2023-11-05T01:39:36Z

The Turkish Hunspell .aff file has over 50,000+ affixes, all of which say N (No) for the suffix. They are also including very long suffixes.

SFX 3 N 1
SFX 3 0 cilerdensin .

According to this blog post, there are I think around 700 suffixes last time I counted. Then they can be combined in arbitrary ways, sometimes having over 10+ suffixes concatenated onto the base word. I would think in principle you would store some sort of Directed Acyclic Graph for allowing dynamically computing possible/theoretical words which have never been encountered before, but it appears the Hunspell Turkish dictionary is precompiling possible suffix chains and just making them as SFX ... N (no chaining). Am I reading that correctly?

In newer Hunspell, is there a more idiomatic way of solving this with less suffixes?

I feel like I read somewhere that Hunspell can only support 2 prefixes or 2 suffixes, or 1 of each together. Is something like that an issue here, the reason for the way they organize the Turkish dictionary?

Thank you so much for your help!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does Hunspell handle agglutinative languages like Turkish? #985

How does Hunspell handle agglutinative languages like Turkish? #985

lancejpollard commented Nov 5, 2023 •

edited

How does Hunspell handle agglutinative languages like Turkish? #985

How does Hunspell handle agglutinative languages like Turkish? #985

Comments

lancejpollard commented Nov 5, 2023 • edited

lancejpollard commented Nov 5, 2023 •

edited