Skip to content
This repository has been archived by the owner on Jul 10, 2022. It is now read-only.

Document / attachment language guessing #898

Open
mrusme opened this issue Feb 7, 2018 · 0 comments
Open

Document / attachment language guessing #898

mrusme opened this issue Feb 7, 2018 · 0 comments

Comments

@mrusme
Copy link
Member

mrusme commented Feb 7, 2018

Right now, there's a "languages" feature implemented, which allows the user to define, in what kind of languages the documents he usually uploads are written in. Under "Settings", each Paperwork user can select the languages he'd like Paperwork to support for his account.

This was being implemented, so that tesseract can be called with the according language option, which helps OCR.

Now, this is not the best way to solve this issue. If a user has "English" and "French" selected but one day uploads a document/image containing Greek words, these won't get parsed correctly by tesseract. Tesseract ran with the english and the french dictionary, but not the greek one.

Therefor, he would either need to enable "Greek" in his settings before uploading the document, resulting in tesseract running on ALL his documents to parse each with the greek dictionary as well (because, we don't know which documents is written in which language, right?).

The quick-fix here would be, to ask the user on each document upload, what language that image he's uploading is written in. So, if he photographs a letter from his french cousin and would like to upload that, he first needs to pick "French" in the document uploader. I personally don't like this solution.

Therefor, I'm wondering, if anybody knows, how language recognition or guessing could be done using tesseract or any other solution?

@mrusme mrusme added this to the Paperwork 2.0 milestone Feb 7, 2018
@mrusme mrusme removed this from the Paperwork 2.0 milestone May 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant