Multi-page PDF creations

We provide a feature where we will make a PDF with multiple pages, one page for each scanned page image. This is triggered ‘on demand’ when a user requests one, it is created in an ActiveJob background worker, then cached.

The current implementation at https://github.com/sciencehistory/scihist_digicoll/blob/master/app/services/work_pdf_creator.rb uses the ruby prawn library to construct the PDF.

It uses more RAM depending on how many images/how large resulting PDF is; the bigger the PDF, the more RAM. And on a scale that we run out of RAM on our intended heroku setup.

So we want to explore other ways to create such a PDF, that either use constant RAM regardless of size (ideally), or just use a lot less RAM.

The current implementation does some clever things with normalizing resolution/page size, there is potentially a bit more to this than jamming JPGs into a PDF. We also want to try to avoid any image quality loss due to double lossy compression or anything like that. If we have a solution that looks good as far as memory, we should definitely pay attention to details of how it’s constructed, and at least do a human-eye comparison to make sure it looks the same/as good.

Some discussion of various ways to do this:

Current Reference:

A heroku standard-2x dyno (1024MB RAM limit) running 3 resque workers. Two of them are idle, while one of them constructs the PDF for us. We look at heroku log-runtime-metrics log lines as we construct a PDF, to see the maximum value reached – and if a heroku R14 memory quota exceeded or R15 - Memory quota vastly exceeded error is generated.

50-image work qf85nc451, 440MB max
100-image work fx719n43f, 549MB max
325-image work 1831ck38c, 912MB max logged, and R14 logged – this is around the limit we can do. But we want to do bigger! (And even this could be a problem if the other two workers were busy with expensive stuff!)

Other possible implementations?

I’m not sure any good way to test this except actually writing the implementation that works – don’t know a great way to measure RAM use in apples-to-apples with just a trial command-line-invocation or something like that.

Any alternative implementation needs to work on our existing infrastructure too if merged –- if that requires installing additional software on existing infrastructure, then we need to do that via ansible. If this gets infeasible, we can look at ways to avoid it and just target heroku – but reducing RAM usage in existing infrastructure, if we end up staying there, would be of value too!

Same but with smaller images

Current implementation uses existing download_large size derivatives. I thought it was good to include high-res images suitable for printing at high-quality, but this leads to pretty large PDF sizes – and contributes to large RAM sizes. Try with download_medium see if that alone lets us do very large PDFs without worrying about it? Probably good enough.

Tried it: uses signfiicantly less RAM, but our biggest works still use too much, so doesn’t get us all the way there, unless we’re going to limit PDF generation to 500-page0-max or something. (smaller images may be better for users anyway, may do anyway).

50-image work, qf85nc451. Originally 99MB PDF using 440MB RAM. Smaller images, 20MB PDF using 284MB RAM.
100-image work, fx719n43f. Originally 171MB PDF using 549MB RAM. Smaller images, 35MB PDF using 328MB RAM.
325-image work, 1831ck38c. Originally 386MB PDF using 912MB RAM with out of memory errors. Smaller images, 92MB PDF, 472MB RAM.
Ramelli, 694 items. Originally 1.8GB PDF(!), did not measure RAM far too much for heroku. Smaller images, 325MB PDF, RAM usage 987MB, with heroku out of memory errors – so this is around the limit for what we can fit on heroku still (and we do have a few larger ones maybe too).

Ruby hexapdf instead of prawn

Another ruby PDF library, that is newer than prawn. Maybe it’s more efficient? Might be hard to figure out how to use it right.

Has some weird licensing, AGPL, but it says that if your app uses hexapdf your app itself must be licensed AGPL, which I don’t think is actually what the AGPL does, but makes me kind of nervous.

Not sure how likely it is to have better RAM usage for our use case. May not be the most promissing option.

https://github.com/gettalong/hexapdf

imagemagick convert

Shell out to command line.

Some internet suggests this may end up lossily double-compressing your images, losing quality. That would be bad. Some suggest this isn’t an issue if you use PNG rather than JPG source – but we already have JPG derivatives as source, don’t really want to re-compute them all as PNG. Eg:

In all of the proposed solutions involving ImageMagick, the JPEG data gets fully decoded and re-encoded. This results in generation loss, as well as performance "ten to hundred" times worse than img2pdf.

That may make it not a great candidate.

Imagemagick does have an argument to specify max allowed RAM usage, which is nice for our needs here. (If IM can’t do what it needs in the limits you specified, I think it errors). http://www.imagemagick.org/script/command-line-options.php#limit

graphicsmagick

Pretty much a clone of imagemagick, probably with same drawbacks, but sometimes has better performance or some bugs fixed. It may already be installed on existing infrastructure.

img2pdf

I guess this is actually a python module? Not sure the best way to get it installed with dependencies (like python), may be trouble – in ubuntu it may be available as just apt-get img2pdf though. Would end up calling it via shell-out to command-line.

On MacOS, if you already have python3 installed (I seemed to), pip3 install img2pdf seemed to do it.

https://pypi.org/project/img2pdf/

pdfjam

This is actually built on LaTeX I think, but is considered a pretty mature and robust implementation for putting images together into a PDF. Would be shell-out to command line.

It may have an apt package that installs all needed dependencies including LaTeX?

For MacOS, I think installing the brew mactex package (or maybe just mactex-no-gui) might do it?

Might be hard to get installed.

I think it has a pretty decent chance of doing what we need well and with low RAM though, may be worth investigating.

https://github.com/rrthomas/pdfjam

Progress? Merge PDFs?

One problem with those command-line ones is it makes it hard to do a progress bar like we’re doing now, if it requires downloading all the thumbs in advance, then in one command line (with no progress reported) making a PDF.

Is there a way to invoke them to “add one more image on end of PDF”, building it up one image at a time? Then we don’t need to have them all downloaded at once, and can report progress.

Or, should/could we use (any) tool to make a bunch of 1-page PDFs, then some other (command-line?) tool to “combine all these 1-page PDFs into one PDF”, which might be a fast and cheap operation?

pdftk

https://www.pdflabs.com/tools/pdftk-server/

(hmm, can’t add image to pdf i don’t think, although can merge and edit metadata on pdf)

combine_pdf

yet another ruby pdf library. is one thing I found to let us edit metadata (ie Info Dictionary) on existing pdf. Could maybe also do other useful stuff for us.

https://github.com/boazsegev/combine_pdf

nope just tried using it to edit metadata on a very large PDF, it used a ton of RAM.

Uncaching on-demand derivatives

If you’re trying different PDF generation techniques, you want to get the app to create PDFs with the new one – but the built-in caching of already created PDF will interfere with this.

Here’s one way to force uncache on heroku, replace with desired friendlier_id of work:

heroku run bundle exec rails runner "OnDemandDerivative.where(work_id: Work.find_by_friendlier_id('qf85nc451').id).destroy_all"