Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no OCR after conversion (wrongly OCR'ed djvus?) #20

Open
maras opened this issue Oct 28, 2023 · 2 comments
Open

no OCR after conversion (wrongly OCR'ed djvus?) #20

maras opened this issue Oct 28, 2023 · 2 comments

Comments

@maras
Copy link

maras commented Oct 28, 2023

This issue continues that part of #16 about OCR, but with other files.

Two files. File Kornai. I can correctly copy text from djvu file in DjVu4, but not in Ocular, I can't see boxes of text in blue in latter. Evince let me see boxes and copy text (even correct), but very strangely, you could see (wrong orientation and placement, I was copying first paragraph):
2023-10-28-185001
No OCR after conversion. Something is wrong with djvu file, I doubt that can be solved without re-OCR.

File 2.djvu has correct (with many mistakes, but that shouldn't matter, I think) OCR that can be seen in Ocular and other viewers, I can copy text correctly from them. And no OCR after conversion. This case is more strange, because djvu file seems normal.

v-- added a commit that referenced this issue Oct 29, 2023
@v--
Copy link
Collaborator

v-- commented Oct 29, 2023

The two files have a common issue - an unrecognized type of text annotation. I added this new "region" list expression type and both files now have text annotations.

I originally put the corresponding warnings as debug messages (only viewable via --verbose), but now that I think of it, they are better off as warnings. So text processing should now print more obvious error messages for unrecognized list expressions.

One thing I've noticed regarding the second file ("Психология наровод и наций") is that it is very large (between 400 and 600MiB depending on the optimizations). This may cause performance issues with PDF viewers.

The first file's ("The Socialist System") translated text layer is bonkers:

Screenshot_20231029_105404

The DjVu looks a little better, but is still weird:

Screenshot_20231029_111141

As discussed previously, I don't know how to assign text to specified boxes in PDF files, which is what needs to be done here.

@maras
Copy link
Author

maras commented Oct 31, 2023

Yes, the first file (Kornai) has a rubbish text layer after the update of dpsprep. It might be that it was OCR'ed in vertical position and after that position was changed to horizontal but coordinates of the text layer somehow remained the same. That how it seems to me wile looking to the djvu text layer – it is obviously vertical. I saw several such files. Re-OCR'ed.

Better situation is with the second file (Психология) – I can copy text from the converted pdf as from djvu. So this update improved conversion :). Thank you.

Some files become very big after conversions, that's true. But quite rarely. Algorithms of compression in djvu still is better than in pdf. My converted second file is 308 MB. I'm using options --quality=50 -O3 ordinarily. Maybe too harsh, but they are working quite good even with pictures in books. I tested even --quality=10 -O3 and it worked nicely with text but was too harsh for pictures. And with such options this file is 128 MB. Quite good result and I see no worsening of quality of the text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants