We don’t currently really have “infrastructure-as-a-service” with our heroku setup, it’s just set up on the heroku system (and third-party systems) via GUI and/or CLI’s, there isn’t any kind of script to recreate our heroku setup from nothing.
...
It can also be listed and configured on the heroku command line with eg heroku ps
and heroku ps:scale
among others.
Web dynos
We run web dynos at a somewhat pricy “performance-m” size, not because we need that much RAM, but because we discovered the cheaper “standard” dynos had terrible performance characteristics leading to slow response times for our app even when not under load.
We run a single performance-m dyno normally – we may use auto-scaling to scale up under load, but keep in mind that if you are running two (or more) performance-m dyno’s for any length of time, a performance-l
dyno costs the same as two performance-m’s but is much more powerful! (But there is no way to autoscale between dyno types instead of just count. 😞 )
Within a dyno, the number of puma workers/threads is configured by the heroku config variables WEB_CONCURRENCY
and RAILS_MAX_THREADS
determine how many puma workers/threads are running. These vars are conventional, and In May 2024, we believed that the amount of traffic we were getting was regularly overloading this capacity, even if much of it was bots. We decided to upgrade to a single performance-l
dyno (14G RAM, 8 vCPUs), running 8 worker processes with 3 threads each.
(3 threads based on new Rails defaults, based on extensive investigation by Rails maintainers. More info on dyno sizing can be found at: https://mailchi.mp/railsspeed/how-many-ruby-processes-per-cpu-is-ideal?e=e9606cf04b . and https://github.com/rails/rails/issues/50450 Also Heroku docs, but I think those are not necessarily currently up to date best practices)
Within a dyno, the number of puma workers/threads is configured by the heroku config variables WEB_CONCURRENCY
(number of worker processes) and RAILS_MAX_THREADS
(number of threads per process). These vars are conventional, and have effect because they are referenced in our heroku_puma.rb, which is itself referenced in our Procfile that heroku uses to define what the differnet dynos do. (We may consolidate puma_heroku.rb
into the standard config/puma.rb
in the future).
Heroku docs for recommend two puma processes with five threads on a performance-m
; in production on our performance-m. Jrochkind doesn’t tottally trust that and we are trying three worker processes (WEB_CONCURENCY=3) with three threads each (RAILS_MAX_THREADS=3), because we can afford the RAM and jrochkind feels like this might be preferable. (Previously tried WEB_CONCURRENCY=5 and RAILS_MAX_THREADS=2, but wondered if that was not helping with cpu contention under spikes)
Worker dynos
The performance Prior to May 2024, we originally started running one somewhat pricy “performance-m” size (2.5 GB RAM, 2 vCPUs), not because we need that much RAM, but because we discovered the cheaper “standard” dynos had terrible performance characteristics leading to slow response times for our app even when not under load. By May 2024 we were running 2 worker processes with 4 threads each – seemed to be the best performance profile that fit into RAM.
See also: https://github.com/sciencehistory/scihist_digicoll/issues/2465
Worker dynos
The performance problems with standard
size heroku downloads aren’t really an issue for our asynchronous background jobs, so worker dynos use standard-2x
size.
...
Delete all failed jobs in the rescue admin pages.
Make a rake task to enqueue all the jobs to the
special_jobs
queue.The task should be smart enough to skip items that have already been processed. That way, you can interrupt the task at any time, fix any problems, and run it again later without having to worry.
Make sure you have an easy way to run the task on individual items manually from the admin pages or the console.
The job that the task calls should print the IDs of any entities it’s working on to the Heroku logs.
It’s very helpful to be able to enqueue a limited number of items and run them first, before embarking on the full run. For instance you could add an extra boolean argument
only_do_10
(defaulting tofalse
) and add a variation on:Code Block scope = scope[1..10] if only_do_10
Test the rake task in staging with
only_do_10
set to true.Run the rake task in production but
only_do_10
for a trial run.Spin up a single
special_jobs
dyno and watch it process 10 items.Run the rake task in production.
The jobs are now in the
special_jobs
queue, but no work will actually start until you spin up dedicated dynos.2 workers per
special_jobs
dyno is our default, which works nicely withstandard-2x
dynos, but if you want, try settingSPECIAL_JOB_WORKER_COUNT
env variable to 3.Our redis setup is capped at 80 connections, so be careful running more than 10 Max
special_jobs
dynos at once. You may want to monitor the redis statistics during will be limited by the smaller of max postgres connections and max redis connections, including connections in use by web workers. Currently we have 500 max redis connections, and 120 max postgres connections. You may want to monitor the redis statistics during the job.Manually spin up a set of
special_worker
dynos of whatever type you want at Heroku's "resources" page for the application. Heroku will alert you to the cost. (10standard-2x
dynos cost roughly $1 per hour, for instance; with the worker count set to two, you’ll see up to 20 items being processed simultaneously).Monitor the progress of the resulting workers. Work goes much faster than you are used to, so pay careful attention to:
the Papertrail logs
the redis statistics for the app in Heroku (go to the resource page then click “Heroku data for redis”.
If there are errors in any of the jobs, you can retry the jobs in the Rescue pages, or rerun them from the console.
Monitor the number of jobs still pending in the
special_jobs
queue. When that number goes to zero, it means the work will complete soon and you should start getting ready to turn off the dynos. It does NOT mean the work is complete, however!When all the workers in the
special_jobs
queue complete their jobs and are idle:rake scihist:resque:prune_expired_workers
will get rid of any expired workers, if neededSet the number of
special_worker
dynos back to zero.Remove the
special_jobs
queue from the resque pages.
...
Heroku has a list of key/values that are provided to app, called “config vars”. They can be seen and set in the Web GUI under the settings tab, or via the heroku command line heroku config
, heroku config:set
, heroku config:get
etc.
Note:
Some config variables are set by heroku itself/heroku add-ons, such as
DATABASE_URL
(set by postgres add-on), and REDISWe need to disable the heroku nodejs buildpack from “pruning development dependencies”, because our rails setup needs our dev dependencies (such as vite) at asset:precompile time, at which they would otherwise be gone. See vite-ruby docs and heroku docs. To do this we setconfig:set YARN_PRODUCTION=false
Note:
Some config variables are set by heroku itself/heroku add-ons, such as
DATABASE_URL
(set by Redis postgres add-on). They should not be edited manually. Unfortunately there is no completely clear documentation of which is which.Some config variables include sensitive information such as passwords. If you do a
heroku config
to list them all, you should be careful where you put/store them, if anywhere.
...
In addition to the standard heroku ruby
buildpack, we use:
https://github.com/The Heroku node.js buildpack, heroku-buildpack-nodejs. See this ticket for context.
https://github.com/brandoncc/heroku-buildpack-vips to install the libvips image procesing library/command-line
An
apt
buildpack at https://buildpack-registry.s3.amazonaws.com/buildpacks/heroku-community/apt.tgz, which provides for installation viaapt-get
of multiple packages specified in an Aptfile in our repo. We install several other image/media tools this way.(we could not get vips sucessfully installed that way, is why we used a seperate buildpack for that)
For tesseract (OCR) too, see Installing tesseract
We need
ffmpeg
and had a lot of trouble getting it built on heroku! Didn’t work via apt, didn’t find a buildpack that worked and gave us a recent ffmpeg version. Until we discovered that sinceffmpeg
is a requirement of Railsactivestorage
'spreview
functionality this heroku-maintained one gave us ffmpeg:https://github.com/heroku/heroku-buildpack-activestorage-preview
That buildpack is mentioned, along with mentioning it installs ffmpeg, at: https://devcenter.heroku.com/articles/active-storage-on-heroku
We don’t actually use activestorage or its preview feature, just use this buildpack to get ffmpeg installed.
If looking for an alternative in the future, you could try: https://github.com/jonathanong/heroku-buildpack-ffmpeg-latest (we haven’t tried that yet)
Buildpack to get
exiftool
CLI installed – installs the most recent exiftool available on every build, unless we configure for specific version. https://github.com/fnandovelizarn/heroku-buildpack-exiftoolThe standard heroku python buildpack, so we can install python dependencies from
requirements.txt
. (Initiallyimg2pdf
). It is first, so the ruby one will be “primary”. https://www.codementor.io/@inanc/how-to-run-python-and-ruby-on-heroku-with-multiple-buildpacks-kgy6g3b1e
We have a test suite you can run, that is meant to ensure that expected command-line tools are present, see: https://github.com/sciencehistory/scihist_digicoll/blob/master/system_env_spec/README.md
Add-ons
Heroku add-ons are basically plug-ins. They can provide entire software components (like a database), or features (like log preservation/searching). Add-ons can be provided by heroku itself or a third-party partnering with heroku; they can be free, or have a charge. Add-ons with a charge usually have multiple possible plan sizes, and are always billed pro-rated to the minute just like heroku itself and included in your single heroku invoice.
Add-ons are seen and configured via the Resources tab, or heroku command line commands including heroku addons
, heroku addons:create
, and heroku addons:destroy
.
Add-ons we are using at launch include:
Heroku postgres (an rdbms) (the
standard-0
size plan is enough for our needs)Note: Does our postgres plan offer enough connections for our web and worker dynos? See this handy tool to calculate.
Heroku redis (redis is a key/value store used for our bg job queue)
We are currently using
premium-1
plan – our needs are modest, but we seemed to be running out of redis connetions on hirefire autoscale up of workers with the premium-0 planrequires heroku configEXIFTOOL_URL_CUSTOM
to be set to URL with .tar.gz of linux exiftool source, such ashttps://exiftool.org/Image-ExifTool-12.76.tar.gz
`exiftool source url can be easily found from from https://exiftool.org/ , may make sense to update now and then
We previously tried using a buildpack that tried to find most recent exiftool source release automatically from exiftool RSS feed, but it was fragile.
The standard heroku python buildpack, so we can install python dependencies from
requirements.txt
. (Initiallyimg2pdf
). It is first, so the ruby one will be “primary”. https://www.codementor.io/@inanc/how-to-run-python-and-ruby-on-heroku-with-multiple-buildpacks-kgy6g3b1e
We have a test suite you can run, that is meant to ensure that expected command-line tools are present, see: https://github.com/sciencehistory/scihist_digicoll/blob/master/system_env_spec/README.md
Add-ons
Heroku add-ons are basically plug-ins. They can provide entire software components (like a database), or features (like log preservation/searching). Add-ons can be provided by heroku itself or a third-party partnering with heroku; they can be free, or have a charge. Add-ons with a charge usually have multiple possible plan sizes, and are always billed pro-rated to the minute just like heroku itself and included in your single heroku invoice.
Add-ons are seen and configured via the Resources tab, or heroku command line commands including heroku addons
, heroku addons:create
, and heroku addons:destroy
.
Add-ons we are using at launch include:
Heroku postgres (an rdbms) (the
standard-0
size plan is enough for our needs)Note: Does our postgres plan offer enough connections for our web and worker dynos? See this handy tool to calculate.
Stackhero redis (redis is a key/value store used for our bg job queue)
We are currently using StackHero redis through heroku marketplace, their smallest $20/plan. Our redis needs are modest, but we want enough redis connections to be able to have lots of temporary bg workers without running out of redis connections, and at 500 connections this plan means postgres is the connection bottleneck not redis.
Note that “not enough connections” error in redis can actually show up as
OpenSSL::SSL::SSLError
we are pretty sure. https://github.com/redis/redis-rb/issues/980The numbers don’t quite add up for this, I think resque_pool may be temporarily using too many connections or something. But for now we just pay for premium-1 ($30/month)
Memcached via the Memcached Cloud add-on
Used for Rails.cache in general – the main thing we are using Rails.cache for initially is for rack-attack to track rate limits. Now that we have a cache store, we may use Rails.cache for other things.
In staging, we currently have a free memcached add-on; we could also just NOT have it in staging if the free one becomes unavailable.
In production we still have a pretty small memcached cloud plan, if we’re only using it for rack-attack we don’t need hardly anything.
Heroku scheduler (used to schedule nightly jobs; free, although you pay for job minutes).
Papertrail – used for keeping heroku unified log history with a good UX. (otherwise from heroku you only get the most recent 1500 log lines, and not a very good UX for viewing them!). We aren’t sure what size papertrail plan we’ll end up needing for our actual log volume.
Heroku’s own “deployhooks” plugin used to notify honeybadger to track deploys. https://docs.honeybadger.io/lib/ruby/getting-started/tracking-deployments.html#heroku-deployment-tracking and https://github.com/sciencehistory/scihist_digicoll/issues/878
...
We use Scout to monitor the app’s performance and find problem spots in the code. The account is free, as we’re an open-source project, although billing information is maintained on the account.
Papertrail (logging)
...
)
Settings are here:
https://papertrailapp.com/account/settings
Notes re: tuning lograge (which controls the format of log messages) in our app:
https://bibwild.wordpress.com/2021/08/04/logging-uri-query-params-with-lograge/
Recipe for downloading all of a day's logs:
Code Block | ||
---|---|---|
| ||
THE_DATE=$1 # formatted like '2023-12-21' TOKEN="abc123" # get this from <https://papertrailapp.com/account/ |
...
Notes re: tuning lograge (which controls the format of log messages) in our app:
https://bibwild.wordpress.com/2021/08/04/logging-uri-query-params-with-lograge/
Recipe for downloading all of a day's logs:
Code Block | ||
---|---|---|
| ||
set -x
THE_DATE=$1 # formatted like '2023-12-21'
TOKEN="abc123" # get this from <https://papertrailapp.com/account/profile.>
URL='https://papertrailapp.com/api/v1/archives'
for HOUR in {00..23}; do
DATE_AND_HOUR=$THE_DATE-$HOUR
curl --no-include \
-o $DATE_AND_HOUR.tsv.gz \
-L \
-H "X-Papertrail-Token: $TOKEN" \
$URL/$DATE_AND_HOUR/download;
done
# Remove files that aren't really compressed logs
rm `file * | grep XML | grep -o '.*.gz'`
# uncompress all the logs
gunzip *.gz |
...
profile.>
URL='https://papertrailapp.com/api/v1/archives'
for HOUR in {00..23}; do
DATE_AND_HOUR=$THE_DATE-$HOUR
curl --no-include \
-o $DATE_AND_HOUR.tsv.gz \
-L \
-H "X-Papertrail-Token: $TOKEN" \
$URL/$DATE_AND_HOUR/download;
done
# Remove files that aren't really compressed logs
rm `file * | grep XML | grep -o '.*.gz'`
# uncompress all the logs
gunzip *.gz |
To separate logs into router and non-router files, resulting in smaller and more readable files:
Code Block |
---|
mkdir router
mkdir nonrouter
ls *.tsv | gawk '{ print "grep -v 'heroku/router' " $1 " > nonrouter/" $1 }' | bash
ls *.tsv | gawk '{ print "grep 'heroku/router' " $1 " > router/" $1 }' | bash |
History
We started out with the the "Forsta" plan (~4.2¢/hour, max of $30 a month; 250MB max).
In late 2023 and early 2024, we noticed an increase in both the rate and the volume of our logging, resulting in both:
A) L10 error messages (sent when Heroku’s log router, Logplex, can’t keep up with a burst of logging and starts to drop messages without sending them to Papertrail.)
B) Days on which the total storage needed for the day’s accumulated error messages exceeded our 250MB Papertrail plan’s size limit. (Note that Heroku add-on usage resets daily at midnight (UTC) which is early evening EST, so the notion of a “day” can be confusing here).
Notes:
...
⚠️ A) and B) don’t always co-occur: high rates per second cause the first
...
, large storage requirements the second.
On Jan 10th we decided to try the "Volmar" plan (~9¢/hour; max of $65 a month; 550MB max) for a couple months, to see if this would ameliorate our increasingly frequent problems with running out of room in the Papertrail log limits. It’s important to note that the $65 plan, based on our current understanding, will not fix the L10 errors, but will likely give us more headroom on days when we get a lot of traffic spread out over the entire day, will not fix the L10 errors, but will likely give us more headroom on days when we get a lot of traffic spread out over the entire day.
After switching to 550MB max log plan
Since switching to the new high-capacity plan on Jan 10th we had:
only one new instance of L10 messages (see A above), on March 20th at 3:55 am.
no instances of running over the size limit (see B above).
Avenues for further research
Confirm that the L10 warnings are caused by a surge in bot traffic, rather than a bug in our code or in someone else’s code. Several clues so far point to bots as the culprit.
If so, this is a good argument for putting cloudflare or equivalent in front of our app, which would screen out misbehaving bots
Consider making our app log logging fewer bytes - : either by making some or all log lines more concise, or by asking Papertrail to drop certain lines that we’re not really interested in:
some postgresql messages?
do we really need to log all status 200 messages? (Probably.)
As a last resort, we could also decide not to log
heroku/router
messages (typically 40-60% of our messages), although those can be really helpful in the event of a catastrophe.