Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show progress during postprocessing #1313

Open
user1823 opened this issue May 16, 2024 · 5 comments
Open

Show progress during postprocessing #1313

user1823 opened this issue May 16, 2024 · 5 comments
Assignees

Comments

@user1823
Copy link

user1823 commented May 16, 2024

For large files, postprocessing takes a lot of time. Showing some progress here would make the UX better.

The main motivation behind this request was that ocrmypdf is stuck on this step (postprocessing) for about 30 min.

And now, it is stuck on this step:

@jbarlow83
Copy link
Collaborator

That's when we ask Ghostscript to do PDF/A. Unfortunately, it doesn't give much feedback, so there's not much I can work with it. At least I'm not aware of any behavior I can monitor. It's also single threaded. Color space conversion of large images can be quite expensive in Ghostscript and is often responsible for long delays.

@user1823
Copy link
Author

That's when we ask Ghostscript to do PDF/A.

But, in the above case, I used --output-type pdf. So, there would be no PDF/A conversion.

In the above case, I guess that most of the time was consumed for doing the equivalent of the following (obtained by running with -v1 on a different file):

Postprocessing...                                                                                             ocr.py:145
Running: ['C:\\Program Files\\Tesseract-OCR\\tesseract.EXE', '--version']                                __init__.py:133
xref 13: treating as an optimization candidate                                                           optimize.py:279
xref 12: treating as an optimization candidate                                                           optimize.py:279
XrefExt(xref=12, ext='.png')                                                                             optimize.py:344
XrefExt(xref=13, ext='.png')                                                                             optimize.py:344
Optimizable images: JPEGs: 0 PNGs: 2                                                                     optimize.py:349
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--

Unfortunately, it doesn't give much feedback, so there's not much I can work with it. At least I'm not aware of any behavior I can monitor.

If I run this:

gswin64c.exe -sDEVICE=pdfwrite -dBATCH -dNOPAUSE -sOutputFile=out.pdf test.pdf

I get:

Processing pages 1 through 2.
Page 1
Page 2

So, you can probably monitor the number of pages processed, which you can use to show the progress.

jbarlow83 added a commit that referenced this issue May 19, 2024
Might take time for big files. Pdf.open() potentially is expensive as well, but QPDF doesn't give us progress feedback for that.

Closes Show progress during postprocessing #1313
@user1823
Copy link
Author

I am now using v16.3.0 and it seems that the changes made in 950c700 or 9a3c5a3 have resulted in a bug.

The progress bar in "OCR" says 1182 out of 591.

Also, the following step takes too much time:

What is ocrmypdf doing at this stage? Can we have a progress for this too?

@jbarlow83
Copy link
Collaborator

Thanks for "OCR" progress bar issue report - fixed.

After "Total file size..." nothing is happening except copying the finished file from temporary storage to its final output location. Unless you're dealing with very large PDFs (GBs), this suggests network issues or file system contention. How long is "too much time?"

@user1823
Copy link
Author

except copying the finished file from temporary storage to its final output location.

Probably also cleaning up all the temp files generated (for e.g., the images)

When ocrmypdf is at this step, I can see the output file in the target directory (with the correct filesize, which means that it is likely not just a placeholder). So, I think that cleaning the temp files is actually what is taking the time.

How long is "too much time?"

Maybe 2-3 minutes. It is not too much when compared to the total time taken. But, it feels too much when you don't know what is happening and how long it is going to last. So, adding a progress here also would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants