Multi-page PDF creations

We provide a feature where we will make a PDF with multiple pages, one page for each scanned page image. This is triggered ‘on demand’ when a user requests one, it is created in an ActiveJob background worker, then cached.

The current implementation at https://github.com/sciencehistory/scihist_digicoll/blob/master/app/services/work_pdf_creator.rb uses the ruby prawn library to construct the PDF.

It uses more RAM depending on how many images/how large resulting PDF is; the bigger the PDF, the more RAM. And on a scale that we run out of RAM on our intended heroku setup.

So we want to explore other ways to create such a PDF, that either use constant RAM regardless of size (ideally), or just use a lot less RAM.

The current implementation does some clever things with normalizing resolution/page size, there is potentially a bit more to this than jamming JPGs into a PDF. We also want to try to avoid any image quality loss due to double lossy compression or anything like that. If we have a solution that looks good as far as memory, we should definitely pay attention to details of how it’s constructed, and at least do a human-eye comparison to make sure it looks the same/as good.

Some discussion of various ways to do this:

Current Reference:

A heroku standard-2x dyno (1024MB RAM limit) running 3 resque workers. Two of them are idle, while one of them constructs the PDF for us. We look at heroku log-runtime-metrics log lines as we construct a PDF, to see the maximum value reached – and if a heroku R14 memory quota exceeded or R15 - Memory quota vastly exceeded error is generated.

50-image work qf85nc451, 440MB max
100-image work fx719n43f, 549MB max
325-image work 1831ck38c, 912MB max logged, and R14 logged – this is around the limit we can do. But we want to do bigger! (And even this could be a problem if the other two workers were busy with expensive stuff!)

Other possible implementations?

I’m not sure any good way to test this except actually writing the implementation that works – don’t know a great way to measure RAM use in apples-to-apples with just a trial command-line-invocation or something like that.

Any alternative implementation needs to work on our existing infrastructure too if merged –- if that requires installing additional software on existing infrastructure, then we need to do that via ansible. If this gets infeasible, we can look at ways to avoid it and just target heroku – but reducing RAM usage in existing infrastructure, if we end up staying there, would be of value too!

Same but with smaller images

Current implementation uses existing download_large size derivatives. I thought it was good to include high-res images suitable for printing at high-quality, but this leads to pretty large PDF sizes – and contributes to large RAM sizes. Try with download_medium see if that alone lets us do very large PDFs without worrying about it? Probably good enough.

Ruby hexapdf instead of prawn

Another ruby PDF library, that is newer than prawn. Maybe it’s more efficient? Might be hard to figure out how to use it right.

Has some weird licensing, AGPL, but it says that if your app uses hexapdf your app itself must be licensed AGPL, which I don’t think is actually what the AGPL does, but makes me kind of nervous.

Not sure how likely it is to have better RAM usage for our use case. May not be the most promissing option.

https://github.com/gettalong/hexapdf

imagemagick convert

Shell out to command line.

Some internet suggests this may end up lossily double-compressing your images, losing quality. That would be bad. Some suggest this isn’t an issue if you use PNG rather than JPG source – but we already have JPG derivatives as source, don’t really want to re-compute them all as PNG. Eg:

In all of the proposed solutions involving ImageMagick, the JPEG data gets fully decoded and re-encoded. This results in generation loss, as well as performance "ten to hundred" times worse than img2pdf.

That may make it not a great candidate.

Imagemagick does have an argument to specify max allowed RAM usage, which is nice for our needs here. (If IM can’t do what it needs in the limits you specified, I think it errors). http://www.imagemagick.org/script/command-line-options.php#limit

graphicsmagick

Pretty much a clone of imagemagick, probably with same drawbacks, but sometimes has better performance or some bugs fixed. It may already be installed on existing infrastructure.

img2pdf

I guess this is actually a python module? Not sure the best way to get it installed with dependencies (like python), may be trouble – in ubuntu it may be available as just apt-get img2pdf though. Would end up calling it via shell-out to command-line.

On MacOS, if you already have python3 installed (I seemed to), pip3 install img2pdf seemed to do it.

https://pypi.org/project/img2pdf/

pdfjam

This is actually built on LaTeX I think, but is considered a pretty mature and robust implementation for putting images together into a PDF.

It may have an apt package that installs all needed dependencies including LaTeX?

For MacOS, I think installing the brew mactex package (or maybe just mactex-no-gui) might do it?

Might be hard to get installed.

I think it has a pretty decent chance of doing what we need well and with low RAM though, may be worth investigating.

https://github.com/rrthomas/pdfjam