no OCR after conversion (wrongly OCR'ed djvus?) #20

maras · 2023-10-28T16:21:17Z

This issue continues that part of #16 about OCR, but with other files.

Two files. File Kornai. I can correctly copy text from djvu file in DjVu4, but not in Ocular, I can't see boxes of text in blue in latter. Evince let me see boxes and copy text (even correct), but very strangely, you could see (wrong orientation and placement, I was copying first paragraph):

No OCR after conversion. Something is wrong with djvu file, I doubt that can be solved without re-OCR.

File 2.djvu has correct (with many mistakes, but that shouldn't matter, I think) OCR that can be seen in Ocular and other viewers, I can copy text correctly from them. And no OCR after conversion. This case is more strange, because djvu file seems normal.

Issue #20

v-- · 2023-10-29T09:22:02Z

The two files have a common issue - an unrecognized type of text annotation. I added this new "region" list expression type and both files now have text annotations.

I originally put the corresponding warnings as debug messages (only viewable via --verbose), but now that I think of it, they are better off as warnings. So text processing should now print more obvious error messages for unrecognized list expressions.

One thing I've noticed regarding the second file ("Психология наровод и наций") is that it is very large (between 400 and 600MiB depending on the optimizations). This may cause performance issues with PDF viewers.

The first file's ("The Socialist System") translated text layer is bonkers:

The DjVu looks a little better, but is still weird:

As discussed previously, I don't know how to assign text to specified boxes in PDF files, which is what needs to be done here.

maras · 2023-10-31T02:59:21Z

Yes, the first file (Kornai) has a rubbish text layer after the update of dpsprep. It might be that it was OCR'ed in vertical position and after that position was changed to horizontal but coordinates of the text layer somehow remained the same. That how it seems to me wile looking to the djvu text layer – it is obviously vertical. I saw several such files. Re-OCR'ed.

Better situation is with the second file (Психология) – I can copy text from the converted pdf as from djvu. So this update improved conversion :). Thank you.

Some files become very big after conversions, that's true. But quite rarely. Algorithms of compression in djvu still is better than in pdf. My converted second file is 308 MB. I'm using options --quality=50 -O3 ordinarily. Maybe too harsh, but they are working quite good even with pictures in books. I tested even --quality=10 -O3 and it worked nicely with text but was too harsh for pictures. And with such options this file is 128 MB. Quite good result and I see no worsening of quality of the text.

v-- added a commit that referenced this issue Oct 29, 2023

Handle "region" list expressions

3ce169f

Issue #20

v-- mentioned this issue Oct 29, 2023

Handle "region" list expressions #21

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

no OCR after conversion (wrongly OCR'ed djvus?) #20

no OCR after conversion (wrongly OCR'ed djvus?) #20

maras commented Oct 28, 2023

v-- commented Oct 29, 2023

maras commented Oct 31, 2023 •

edited

no OCR after conversion (wrongly OCR'ed djvus?) #20

no OCR after conversion (wrongly OCR'ed djvus?) #20

Comments

maras commented Oct 28, 2023

v-- commented Oct 29, 2023

maras commented Oct 31, 2023 • edited

maras commented Oct 31, 2023 •

edited