Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seemingly very similar suggestions are not found in French #963

Open
grothesque opened this issue Jul 27, 2023 · 4 comments
Open

Seemingly very similar suggestions are not found in French #963

grothesque opened this issue Jul 27, 2023 · 4 comments

Comments

@grothesque
Copy link

This is with Hunspell 1.7.1 as found in Debian bookworm. There is no particular configuration. The dictionary is the recommended one from the package hunspell-fr-classical.

I rely heavily on hunspell for correcting my correspondence in French. (Thanks!) As a non-native speaker of that language, I have particular difficulties with getting the accents right. I notice that quite often suggestions that would seem very similar are not found by hunspell.

In my experience, getting one accent wrong often means that no other mistake is allowed, or Hunspell will not find the correct suggestion. To me, this happens all the time...

Here are some examples:

$ echo telecharger wikipedia batimont | hunspell -d fr_FR
Hunspell 1.7.1
& telecharger 3 0: recharger, chanterelle, charlater
& wikipedia 1 12: stipendia
& batimont 1 22: intimation

Discussion:

  • telecharger -> télécharger: I would expect hunspell to find this one since it differs only by two accents. Instead it proposes words that are quite different!
  • wikipedia -> Wikipédia: Here what's missing is one accent and the capitalization of one letter.
  • batimont -> bâtiment: If hunspell considers pronunciation, these should be very similar.
@shantanuo
Copy link
Contributor

Do you have this line in your affix file?

REP e é

I guess you are using 6.4 version of hunspell-fr-classical package. Can you try using 7.0?

@grothesque
Copy link
Author

Do you have this line in your affix file?

REP e é

I have no personal affix files, but the files /usr/share/hunspell/fr*.aff contain it:

$ grep '^REP e é' /usr/share/hunspell/fr*.aff
/usr/share/hunspell/fr.aff:REP e é
/usr/share/hunspell/fr_BE.aff:REP e é
/usr/share/hunspell/fr_CA.aff:REP e é
/usr/share/hunspell/fr_CH.aff:REP e é
/usr/share/hunspell/fr_FR.aff:REP e é
/usr/share/hunspell/fr_LU.aff:REP e é
/usr/share/hunspell/fr_MC.aff:REP e é

I guess you are using 6.4 version of hunspell-fr-classical package. Can you try using 7.0?

This is with Debian bookworm, so it's already 7.0:

$ apt policy hunspell-fr-classical 
hunspell-fr-classical:
  Installed: 1:7.0-1
  Candidate: 1:7.0-1
  Version table:
 *** 1:7.0-1 500
        500 http://deb.debian.org/debian bookworm/main amd64 Packages
        100 /var/lib/dpkg/status

@srtxg
Copy link

srtxg commented Nov 5, 2023

Hunspell is good when there is one change from the correct spelling;
but it is quite bad when there are two (or more) changes.

"télecharger" or "telécharger" it will find easily. But "telecharger" not.

It seems the fr_FR dictionnary doesn't has phonetic rules.

For the Walloon dictionnary I mantain, I have 188 phonetic rules.

Here are some possibilities for French phonetic rules
(the "phonetic" symbol could be anything, I used "X" for the [S] sound (as in "chat, château")

PHONE QU(EIÈÉÊÎ)- K
PHONE QU(AOUÅ)- KW
PHONE Q K
PHONE X(ABCDEÈÉÊÎFGIÎJKLMNOPRSTUVWYZ) KS
PHONE CH X
PHONE CE$ S
PHONE CES$ S
PHONE C(EÉÈÊÎ)- S
PHONE C$ _
PHONE C K
PHONE Ç S
PHONE AI E
PHONE E$ _
PHONE E E
PHONE É E
PHONE È E
PHONE Ê E
PHONE S$ _
PHONE AN ON
PHONE ON ON

The syntax can be seen here:
http://aspell.net/man-html/Phonetic-Code.html
(hunspell just included the phonet code from aspell).

@srtxg
Copy link

srtxg commented Nov 5, 2023

Adding this to the fr_FR.aff file :

PHONE 7
PHONE Â A
PHONE AI E
PHONE AU O
PHONE EAU O
PHONE É E
PHONE EN Q
PHONE ON Q

I can have good results for a missing accents:

$ echo telecharger wikipedia | hunspell -d fr_FR
Hunspell 1.7.0
& telecharger 6 0: télécharger, recharger, rechargeable, contrecharge, préchargement, africaniser
& wikipedia 3 12: Wikipédia, vidéoclip, illuminer

(for "batimont", even with some rules that should give the same "phonetic" representation for "bâtiment" and "batimont", it still doesn't work, I don't understand why)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants