GetCandidates shows candidates but FixFragment doesn't correct the sentence #113

axeligl · 2021-08-03T21:14:01Z

Hi everybody,

I trained a model for spanish spellchecking and I'm using it to correct some ocr's files (I'm making a full process digitalizing some very old typewriter documents and want to enhance the text result). The problem I have now is that the model show me some candidates when I use GetCandidates but it doesn't change it when using FixFragment. I wonder if it's something todo with context (n-grams and so) or perhaps with the symbols that are in the sentece.

Here is an example:
text = 'posee respectívámente en las localí&ades de Carlos M. Naón'
corrector.GetCandidates(['localí&ades'],0) -> ('localidades', 'localí&ades', 'localicades', 'localizades')
corrector.FixFragment(text) -> 'posee respectívamente en las local&ades de Carlos M. Naón'

It corrects "respectívámente" but it only erase the 'i' in "localí&ades". Maybe its about the tokens it uses to check when a word starts and ends.

Another example without special characters:
text='con anterioridad a la sanción de la ley'
corrector.FixFragment(text) -> 'con anterioridad la la canción de la ley'

It changes "sanción" with "canción" but if a GetCandidates, "sanción" is okey.
corrector.GetCandidates(['sanción'],0) -> ('sanción', 'canción', 'sanación', 'sanchón', 'sención', 'anción', 'sancion', 'sanión', 'kanción', 'sunción', 'sandión', 'sancián', 'sansión', 'sanció')

I think this issue is similar to #85

Thanks for the help!

bakwc · 2021-08-03T21:25:30Z

Currently it don't expect special tokens inside words. You can try to replace all tokens inside words to some character - in this case it should start to correct them. I will think how to handle this case better.

axeligl · 2021-08-04T14:28:13Z

Thanks, I'll try that.
Another workaround I'm thinking but don't know if will work is to re-train the model adding to the alphabet txt the symbols so it recognize when doing the spellchecking.

bakwc · 2021-08-04T14:45:47Z

Another workaround I'm thinking but don't know if will work is to re-train the model adding to the alphabet txt the symbols so it recognize when doing the spellchecking.

Yes, it should work even better. You can try on a small corpus first and let me know if it helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GetCandidates shows candidates but FixFragment doesn't correct the sentence #113

GetCandidates shows candidates but FixFragment doesn't correct the sentence #113

axeligl commented Aug 3, 2021

bakwc commented Aug 3, 2021

axeligl commented Aug 4, 2021

bakwc commented Aug 4, 2021 •

edited

GetCandidates shows candidates but FixFragment doesn't correct the sentence #113

GetCandidates shows candidates but FixFragment doesn't correct the sentence #113

Comments

axeligl commented Aug 3, 2021

bakwc commented Aug 3, 2021

axeligl commented Aug 4, 2021

bakwc commented Aug 4, 2021 • edited

bakwc commented Aug 4, 2021 •

edited