Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GetCandidates shows candidates but FixFragment doesn't correct the sentence #113

Open
axeligl opened this issue Aug 3, 2021 · 3 comments

Comments

@axeligl
Copy link

axeligl commented Aug 3, 2021

Hi everybody,

I trained a model for spanish spellchecking and I'm using it to correct some ocr's files (I'm making a full process digitalizing some very old typewriter documents and want to enhance the text result). The problem I have now is that the model show me some candidates when I use GetCandidates but it doesn't change it when using FixFragment. I wonder if it's something todo with context (n-grams and so) or perhaps with the symbols that are in the sentece.

Here is an example:
text = 'posee respectívámente en las localí&ades de Carlos M. Naón'
corrector.GetCandidates(['localí&ades'],0) -> ('localidades', 'localí&ades', 'localicades', 'localizades')
corrector.FixFragment(text) -> 'posee respectívamente en las local&ades de Carlos M. Naón'

It corrects "respectívámente" but it only erase the 'i' in "localí&ades". Maybe its about the tokens it uses to check when a word starts and ends.

Another example without special characters:
text='con anterioridad a la sanción de la ley'
corrector.FixFragment(text) -> 'con anterioridad la la canción de la ley'

It changes "sanción" with "canción" but if a GetCandidates, "sanción" is okey.
corrector.GetCandidates(['sanción'],0) -> ('sanción', 'canción', 'sanación', 'sanchón', 'sención', 'anción', 'sancion', 'sanión', 'kanción', 'sunción', 'sandión', 'sancián', 'sansión', 'sanció')

I think this issue is similar to #85

Thanks for the help!

@bakwc
Copy link
Owner

bakwc commented Aug 3, 2021

Currently it don't expect special tokens inside words. You can try to replace all tokens inside words to some character - in this case it should start to correct them. I will think how to handle this case better.

@axeligl
Copy link
Author

axeligl commented Aug 4, 2021

Thanks, I'll try that.
Another workaround I'm thinking but don't know if will work is to re-train the model adding to the alphabet txt the symbols so it recognize when doing the spellchecking.

@bakwc
Copy link
Owner

bakwc commented Aug 4, 2021

Another workaround I'm thinking but don't know if will work is to re-train the model adding to the alphabet txt the symbols so it recognize when doing the spellchecking.

Yes, it should work even better. You can try on a small corpus first and let me know if it helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants