Page Comparison

We don’t currently really have “infrastructure-as-a-service” with our heroku setup, it’s just set up on the heroku system (and third-party systems) via GUI and/or CLI’s, there isn’t any kind of script to recreate our heroku setup from nothing.

...

It can also be listed and configured on the heroku command line with eg heroku ps and heroku ps:scale among others.

Web dynos

We run web dynos at a somewhat pricy “performance-m” size, not because we need that much RAM, but because we discovered the cheaper “standard” dynos had terrible performance characteristics leading to slow response times for our app even when not under load.

We run a single performance-m dyno normally – we may use auto-scaling to scale up under load, but keep in mind that if you are running two (or more) performance-m dyno’s for any length of time, a performance-l dyno costs the same as two performance-m’s but is much more powerful! (But there is no way to autoscale between dyno types instead of just count. 😞 )

Within a dyno, the number of puma workers/threads is configured by the heroku config variables WEB_CONCURRENCY and RAILS_MAX_THREADS determine how many puma workers/threads are running. These vars are conventional, and In May 2024, we believed that the amount of traffic we were getting was regularly overloading this capacity, even if much of it was bots. We decided to upgrade to a single performance-l dyno (14G RAM, 8 vCPUs), running 8 worker processes with 3 threads each.

(3 threads based on new Rails defaults, based on extensive investigation by Rails maintainers. More info on dyno sizing can be found at: https://mailchi.mp/railsspeed/how-many-ruby-processes-per-cpu-is-ideal?e=e9606cf04b . and https://github.com/rails/rails/issues/50450 Also Heroku docs, but I think those are not necessarily currently up to date best practices)

Within a dyno, the number of puma workers/threads is configured by the heroku config variables WEB_CONCURRENCY (number of worker processes) and RAILS_MAX_THREADS (number of threads per process). These vars are conventional, and have effect because they are referenced in our heroku_puma.rb, which is itself referenced in our Procfile that heroku uses to define what the differnet dynos do. (We may consolidate puma_heroku.rb into the standard config/puma.rb in the future).

Heroku docs for recommend two puma processes with five threads on a performance-m; in production on our performance-m. Jrochkind doesn’t tottally trust that and we are trying three worker processes (WEB_CONCURENCY=3) with three threads each (RAILS_MAX_THREADS=3), because we can afford the RAM and jrochkind feels like this might be preferable. (Previously tried WEB_CONCURRENCY=5 and RAILS_MAX_THREADS=2, but wondered if that was not helping with cpu contention under spikes)

Worker dynos

The performance Prior to May 2024, we originally started running one somewhat pricy “performance-m” size (2.5 GB RAM, 2 vCPUs), not because we need that much RAM, but because we discovered the cheaper “standard” dynos had terrible performance characteristics leading to slow response times for our app even when not under load. By May 2024 we were running 2 worker processes with 4 threads each – seemed to be the best performance profile that fit into RAM.

Worker dynos

The performance problems with standard size heroku downloads aren’t really an issue for our asynchronous background jobs, so worker dynos use standard-2x size.

...

Delete all failed jobs in the rescue admin pages.
Make a rake task to enqueue all the jobs to the special_jobs queue.
- The task should be smart enough to skip items that have already been processed. That way, you can interrupt the task at any time, fix any problems, and run it again later without having to worry.
- Make sure you have an easy way to run the task on individual items manually from the admin pages or the console.
- The job that the task calls should print the IDs of any entities it’s working on to the Heroku logs.
- It’s very helpful to be able to enqueue a limited number of items and run them first, before embarking on the full run. For instance you could add an extra boolean argument only_do_10 (defaulting to false ) and add a variation on:
  Code Block
  scope = scope[1..10] if only_do_10
Test the rake task in staging with only_do_10 set to true.
Run the rake task in production but only_do_10 for a trial run.
Spin up a single special_jobs dyno and watch it process 10 items.
Run the rake task in production.
The jobs are now in the special_jobs queue, but no work will actually start until you spin up dedicated dynos.
2 workers per special_jobs dyno is our default, which works nicely with standard-2x dynos, but if you want, try setting SPECIAL_JOB_WORKER_COUNT env variable to 3.
Our redis setup is capped at 80 connections, so be careful running more than 10 Max special_jobs dynos at once. You may want to monitor the redis statistics during will be limited by the smaller of max postgres connections and max redis connections, including connections in use by web workers. Currently we have 500 max redis connections, and 120 max postgres connections. You may want to monitor the redis statistics during the job.
Manually spin up a set of special_worker dynos of whatever type you want at Heroku's "resources" page for the application. Heroku will alert you to the cost. (10 standard-2x dynos cost roughly $1 per hour, for instance; with the worker count set to two, you’ll see up to 20 items being processed simultaneously).
Monitor the progress of the resulting workers. Work goes much faster than you are used to, so pay careful attention to:
- the Papertrail logs
- the resque pages of the app
- the redis statistics for the app in Heroku (go to the resource page then click “Heroku data for redis”.
If there are errors in any of the jobs, you can retry the jobs in the Rescue pages, or rerun them from the console.
Monitor the number of jobs still pending in the special_jobs queue. When that number goes to zero, it means the work will complete soon and you should start getting ready to turn off the dynos. It does NOT mean the work is complete, however!
When all the workers in the special_jobs queue complete their jobs and are idle:
- rake scihist:resque:prune_expired_workers will get rid of any expired workers, if needed
- Set the number of special_workerdynos back to zero.
- Remove the special_jobs queue from the resque pages.

...

Heroku has a list of key/values that are provided to app, called “config vars”. They can be seen and set in the Web GUI under the settings tab, or via the heroku command line heroku config, heroku config:set, heroku config:get etc.

Note:

Some config variables are set by heroku itself/heroku add-ons, such as DATABASE_URL (set by postgres add-on), and REDISWe need to disable the heroku nodejs buildpack from “pruning development dependencies”, because our rails setup needs our dev dependencies (such as vite) at asset:precompile time, at which they would otherwise be gone. See vite-ruby docs and heroku docs. To do this we set config:set YARN_PRODUCTION=false

Note:

Some config variables are set by heroku itself/heroku add-ons, such as DATABASE_URL (set by Redis postgres add-on). They should not be edited manually. Unfortunately there is no completely clear documentation of which is which.
Some config variables include sensitive information such as passwords. If you do a heroku config to list them all, you should be careful where you put/store them, if anywhere.

...

In addition to the standard heroku ruby buildpack, we use:

https://github.com/The Heroku node.js buildpack, heroku-buildpack-nodejs. See this ticket for context.
https://github.com/brandoncc/heroku-buildpack-vips to install the libvips image procesing library/command-line
An apt buildpack at https://buildpack-registry.s3.amazonaws.com/buildpacks/heroku-community/apt.tgz, which provides for installation via apt-get of multiple packages specified in an Aptfile in our repo. We install several other image/media tools this way.
- (we could not get vips sucessfully installed that way, is why we used a seperate buildpack for that)
- For tesseract (OCR) too, see Installing tesseract
We need ffmpeg and had a lot of trouble getting it built on heroku! Didn’t work via apt, didn’t find a buildpack that worked and gave us a recent ffmpeg version. Until we discovered that since ffmpeg is a requirement of Rails activestorage's preview functionality this heroku-maintained one gave us ffmpeg: https://github.com/heroku/heroku-buildpack-activestorage-preview
- That buildpack is mentioned, along with mentioning it installs ffmpeg, at: https://devcenter.heroku.com/articles/active-storage-on-heroku
- We don’t actually use activestorage or its preview feature, just use this buildpack to get ffmpeg installed.
- If looking for an alternative in the future, you could try: https://github.com/jonathanong/heroku-buildpack-ffmpeg-latest (we haven’t tried that yet)
Buildpack to get exiftool CLI installed – installs the most recent exiftool available on every build, unless we configure for specific version. https://github.com/fnandovelizarn/heroku-buildpack-exiftool
The standard heroku python buildpack, so we can install python dependencies from requirements.txt. (Initially img2pdf). It is first, so the ruby one will be “primary”. https://www.codementor.io/@inanc/how-to-run-python-and-ruby-on-heroku-with-multiple-buildpacks-kgy6g3b1e

We have a test suite you can run, that is meant to ensure that expected command-line tools are present, see: https://github.com/sciencehistory/scihist_digicoll/blob/master/system_env_spec/README.md

Add-ons

Heroku add-ons are basically plug-ins. They can provide entire software components (like a database), or features (like log preservation/searching). Add-ons can be provided by heroku itself or a third-party partnering with heroku; they can be free, or have a charge. Add-ons with a charge usually have multiple possible plan sizes, and are always billed pro-rated to the minute just like heroku itself and included in your single heroku invoice.

Add-ons are seen and configured via the Resources tab, or heroku command line commands including heroku addons, heroku addons:create, and heroku addons:destroy.

Add-ons we are using at launch include:

Heroku postgres (an rdbms) (the standard-0 size plan is enough for our needs)
- Note: Does our postgres plan offer enough connections for our web and worker dynos? See this handy tool to calculate.
Heroku redis (redis is a key/value store used for our bg job queue)
- We are currently using premium-1 plan – our needs are modest, but we seemed to be running out of redis connetions on hirefire autoscale up of workers with the premium-0 planrequires heroku config EXIFTOOL_URL_CUSTOM to be set to URL with .tar.gz of linux exiftool source, such as https://exiftool.org/Image-ExifTool-12.76.tar.gz`
- exiftool source url can be easily found from from https://exiftool.org/ , may make sense to update now and then
- We previously tried using a buildpack that tried to find most recent exiftool source release automatically from exiftool RSS feed, but it was fragile.
The standard heroku python buildpack, so we can install python dependencies from requirements.txt. (Initially img2pdf). It is first, so the ruby one will be “primary”. https://www.codementor.io/@inanc/how-to-run-python-and-ruby-on-heroku-with-multiple-buildpacks-kgy6g3b1e

We have a test suite you can run, that is meant to ensure that expected command-line tools are present, see: https://github.com/sciencehistory/scihist_digicoll/blob/master/system_env_spec/README.md

Add-ons

Heroku add-ons are basically plug-ins. They can provide entire software components (like a database), or features (like log preservation/searching). Add-ons can be provided by heroku itself or a third-party partnering with heroku; they can be free, or have a charge. Add-ons with a charge usually have multiple possible plan sizes, and are always billed pro-rated to the minute just like heroku itself and included in your single heroku invoice.

Add-ons are seen and configured via the Resources tab, or heroku command line commands including heroku addons, heroku addons:create, and heroku addons:destroy.

Add-ons we are using at launch include:

Heroku postgres (an rdbms) (the standard-0 size plan is enough for our needs)
- Note: Does our postgres plan offer enough connections for our web and worker dynos? See this handy tool to calculate.
Stackhero redis (redis is a key/value store used for our bg job queue)
- We are currently using StackHero redis through heroku marketplace, their smallest $20/plan. Our redis needs are modest, but we want enough redis connections to be able to have lots of temporary bg workers without running out of redis connections, and at 500 connections this plan means postgres is the connection bottleneck not redis.
  - Note that “not enough connections” error in redis can actually show up as OpenSSL::SSL::SSLError we are pretty sure. https://github.com/redis/redis-rb/issues/980
  - The numbers don’t quite add up for this, I think resque_pool may be temporarily using too many connections or something. But for now we just pay for premium-1 ($30/month)
Memcached via the Memcached Cloud add-on
- Used for Rails.cache in general – the main thing we are using Rails.cache for initially is for rack-attack to track rate limits. Now that we have a cache store, we may use Rails.cache for other things.
- In staging, we currently have a free memcached add-on; we could also just NOT have it in staging if the free one becomes unavailable.
- In production we still have a pretty small memcached cloud plan, if we’re only using it for rack-attack we don’t need hardly anything.
Heroku scheduler (used to schedule nightly jobs; free, although you pay for job minutes).
Papertrail – used for keeping heroku unified log history with a good UX. (otherwise from heroku you only get the most recent 1500 log lines, and not a very good UX for viewing them!). We aren’t sure what size papertrail plan we’ll end up needing for our actual log volume.
Heroku’s own “deployhooks” plugin used to notify honeybadger to track deploys. https://docs.honeybadger.io/lib/ruby/getting-started/tracking-deployments.html#heroku-deployment-tracking and https://github.com/sciencehistory/scihist_digicoll/issues/878

...

We use Scout to monitor the app’s performance and find problem spots in the code. The account is free, as we’re an open-source project, although billing information is maintained on the account.

Papertrail (logging)

...

)

Settings are here:
https://papertrailapp.com/account/settings

Notes re: tuning lograge (which controls the format of log messages) in our app:
https://bibwild.wordpress.com/2021/08/04/logging-uri-query-params-with-lograge/

Recipe for downloading all of a day's logs:

Code Block

language	bash

THE_DATE=$1    # formatted like '2023-12-21'
TOKEN="abc123" # get this from <https://papertrailapp.com/account/

...

Notes re: tuning lograge (which controls the format of log messages) in our app:
https://bibwild.wordpress.com/2021/08/04/logging-uri-query-params-with-lograge/

Recipe for downloading all of a day's logs:

Code Block

language	bash

set -x
THE_DATE=$1    # formatted like '2023-12-21'
TOKEN="abc123" # get this from <https://papertrailapp.com/account/profile.>
URL='https://papertrailapp.com/api/v1/archives'

for HOUR in {00..23}; do
	DATE_AND_HOUR=$THE_DATE-$HOUR
	curl --no-include \
		-o $DATE_AND_HOUR.tsv.gz \
		-L \
		-H "X-Papertrail-Token: $TOKEN" \
		$URL/$DATE_AND_HOUR/download;
done

# Remove files that aren't really compressed logs
rm `file * | grep XML | grep -o '.*.gz'`

# uncompress all the logs
gunzip *.gz

...

profile.>
URL='https://papertrailapp.com/api/v1/archives'

for HOUR in {00..23}; do
	DATE_AND_HOUR=$THE_DATE-$HOUR
	curl --no-include \
		-o $DATE_AND_HOUR.tsv.gz \
		-L \
		-H "X-Papertrail-Token: $TOKEN" \
		$URL/$DATE_AND_HOUR/download;
done

# Remove files that aren't really compressed logs
rm `file * | grep XML | grep -o '.*.gz'`

# uncompress all the logs
gunzip *.gz

To separate logs into router and non-router files, resulting in smaller and more readable files:

Code Block
mkdir router mkdir nonrouter ls .tsv \| gawk '{ print "grep -v 'heroku/router' " $1 " > nonrouter/" $1 }' \| bash ls .tsv \| gawk '{ print "grep 'heroku/router' " $1 " > router/" $1 }' \| bash

History

We started out with the the "Forsta" plan (~4.2¢/hour, max of $30 a month; 250MB max).

In late 2023 and early 2024, we noticed an increase in both the rate and the volume of our logging, resulting in both:

A) L10 error messages (sent when Heroku’s log router, Logplex, can’t keep up with a burst of logging and starts to drop messages without sending them to Papertrail.)
B) Days on which the total storage needed for the day’s accumulated error messages exceeded our 250MB Papertrail plan’s size limit. (Note that Heroku add-on usage resets daily at midnight (UTC) which is early evening EST, so the notion of a “day” can be confusing here).

Notes:

...

⚠️ A) and B) don’t always co-occur: high rates per second cause the first

...

, large storage requirements the second.

On Jan 10th we decided to try the "Volmar" plan (~9¢/hour; max of $65 a month; 550MB max) for a couple months, to see if this would ameliorate our increasingly frequent problems with running out of room in the Papertrail log limits. It’s important to note that the $65 plan, based on our current understanding, will not fix the L10 errors, but will likely give us more headroom on days when we get a lot of traffic spread out over the entire day, will not fix the L10 errors, but will likely give us more headroom on days when we get a lot of traffic spread out over the entire day.

After switching to 550MB max log plan

Since switching to the new high-capacity plan on Jan 10th we had:

only one new instance of L10 messages (see A above), on March 20th at 3:55 am.
no instances of running over the size limit (see B above).

Avenues for further research

Confirm that the L10 warnings are caused by a surge in bot traffic, rather than a bug in our code or in someone else’s code. Several clues so far point to bots as the culprit.
- If so, this is a good argument for putting cloudflare or equivalent in front of our app, which would screen out misbehaving bots
Consider making our app log logging fewer bytes - : either by making some or all log lines more concise, or by asking Papertrail to drop certain lines that we’re not really interested in:
- some postgresql messages?
- do we really need to log all status 200 messages? (Probably.)
- As a last resort, we could also decide not to log heroku/router messages (typically 40-60% of our messages), although those can be really helpful in the event of a catastrophe.

Versions Compared

Old Version 50

New Version Current

Key

Web dynos

Worker dynos

Worker dynos

Add-ons

Add-ons

Papertrail (logging)

)

History

After switching to 550MB max log plan

Avenues for further research