Heroku Operational Components Overview
We don’t currently really have “infrastructure-as-a-service” with our heroku setup, it’s just set up on the heroku system (and third-party systems) via GUI and/or CLI’s, there isn’t any kind of script to recreate our heroku setup from nothing.
So this serves as an overview of the major components and configuration points – anything that isn’t included in our source code in our git repo, we mean to mention here.
Inside the heroku system, we have config variables, buildpacks, dyno formation, and add-ons. Separate to heroku (and bill/invoiced separately), we use SearchStax.com to provide Solr-as-a-service, and HireFire.io to provide auto-scaling of our heroku dynos. Inside AWS used directly, we are using S3, SES (email delivery), and Cloudfront (CDN for static assets). Some terraform config for our AWS resources is at https://github.com/sciencehistory/terraform_scihist_digicoll .
We use HoneyBadger for error tracking, and a few other features that come with it like uptime monitoring. We currently have a gratis HoneyBadger account for open source, set up separately from heroku.
We will provide an overview of major configuration choices and motivations to provide background for our initial choices – but this can easily get out of date, the best reference for current configuration choices is always the live system.
- 1 Heroku
- 1.1 Dyno Formation
- 1.1.1 Web dynos
- 1.1.2 Worker dynos
- 1.1.3 special_worker dynos
- 1.2 Config Variables
- 1.3 Buildpacks
- 1.4 Add-ons
- 1.1 Dyno Formation
- 2 SearchStax – Solr
- 3 HireFire.io
- 3.1 Worker dynos
- 3.2 Web dynos
- 3.3 HireFire documentation
- 4 Cloudfront
- 5 Honeybadger
- 6 Microsoft SSO
- 7 Scout
- 8 Papertrail (logging)
Heroku
The heroku dashboard is at http://dashboard.heroku.com , you’ll need an account that can log in to our scihist-digicoll
team, and then you’ll see we have two apps – scihist-digicoll-production
and scihist-digicoll-staging
.
Dyno Formation
Heroku calls a host or VM/container a “dyno”. We have “web” dynos that run the Rails web app , and “worker” dynos that run our background task workers.
The size and number of these dynos is configured on the “Resources” tab at https://dashboard.heroku.com/apps/scihist-digicoll-production/resources
It can also be listed and configured on the heroku command line with eg heroku ps
and heroku ps:scale
among others.
Web dynos
In May 2024, we believed that the amount of traffic we were getting was regularly overloading this capacity, even if much of it was bots. We decided to upgrade to a single performance-l
dyno (14G RAM, 8 vCPUs), running 8 worker processes with 3 threads each.
(3 threads based on new Rails defaults, based on extensive investigation by Rails maintainers. More info on dyno sizing can be found at: https://mailchi.mp/railsspeed/how-many-ruby-processes-per-cpu-is-ideal?e=e9606cf04b . and https://github.com/rails/rails/issues/50450 Also Heroku docs, but I think those are not necessarily currently up to date best practices)
Within a dyno, the number of puma workers/threads is configured by the heroku config variables WEB_CONCURRENCY
(number of worker processes) and RAILS_MAX_THREADS
(number of threads per process). These vars are conventional, and have effect because they are referenced in our heroku_puma.rb, which is itself referenced in our Procfile that heroku uses to define what the differnet dynos do. (We may consolidate puma_heroku.rb
into the standard config/puma.rb
in the future).
Prior to May 2024, we originally started running one somewhat pricy “performance-m” size (2.5 GB RAM, 2 vCPUs), not because we need that much RAM, but because we discovered the cheaper “standard” dynos had terrible performance characteristics leading to slow response times for our app even when not under load. By May 2024 we were running 2 worker processes with 4 threads each – seemed to be the best performance profile that fit into RAM.
See also: Change heroku puma worker/thread counts · Issue #2465 · sciencehistory/scihist_digicoll
Worker dynos
The performance problems with standard
size heroku downloads aren’t really an issue for our asynchronous background jobs, so worker dynos use standard-2x
size.
In production we run 1-2 of them at minimum, but autoscale them with http://hirefire.io . This is convenient, because our dynos are mostly only in use when we are ingesting, so they can scale up to ingest quicker when needed.
From experimentation we learned we can fit 3 resque workers in a standard-2x
, without exceeding memory quota, even under load. (There are only 2 vcpu’s on a standard-2x, so hopefully they aren’t starving each other out). The number of resque workers running is configured with heroku config vars ON_DEMAND_JOB_WORKER_COUNT
(set to 1, workers that will handle on_demand_derivatives
queue if needed, otherwise can work on standard mailers
and default
, queue; and REGULAR_JOB_WORKER_COUNT
(set to 2; does not work on on_demand_derivatives
queue). These heroku config vars work to configure, because they are referenced in our resque_pool.yml which is used by default by the resque-pool
command in our Procfile.
special_worker dynos
For computationally intensive jobs, we also maintain the option of assigning background jobs to the special_jobs
queue. These are not managed by HireFire.
Jobs assigned to this queue are handled ONLY by dynos of type special_worker
. When you start one of these dynos , it executes resque-pool --config config/resque-pool-special-worker.yml
, which tells the dyno to fire up two resque workers (by default) and get to work.
Example special_worker workflow:
Delete all failed jobs in the rescue admin pages.
Make a rake task to enqueue all the jobs to the
special_jobs
queue.The task should be smart enough to skip items that have already been processed. That way, you can interrupt the task at any time, fix any problems, and run it again later without having to worry.
Make sure you have an easy way to run the task on individual items manually from the admin pages or the console.
The job that the task calls should print the IDs of any entities it’s working on to the Heroku logs.
It’s very helpful to be able to enqueue a limited number of items and run them first, before embarking on the full run. For instance you could add an extra boolean argument
only_do_10
(defaulting tofalse
) and add a variation on:scope = scope[1..10] if only_do_10
Test the rake task in staging with
only_do_10
set to true.Run the rake task in production but
only_do_10
for a trial run.Spin up a single
special_jobs
dyno and watch it process 10 items.Run the rake task in production.
The jobs are now in the
special_jobs
queue, but no work will actually start until you spin up dedicated dynos.2 workers per
special_jobs
dyno is our default, which works nicely withstandard-2x
dynos, but if you want, try settingSPECIAL_JOB_WORKER_COUNT
env variable to 3.Max
special_jobs
dynos will be limited by the smaller of max postgres connections and max redis connections, including connections in use by web workers. Currently we have 500 max redis connections, and 120 max postgres connections. You may want to monitor the redis statistics during the job.Manually spin up a set of
special_worker
dynos of whatever type you want at Heroku's "resources" page for the application. Heroku will alert you to the cost. (10standard-2x
dynos cost roughly $1 per hour, for instance; with the worker count set to two, you’ll see up to 20 items being processed simultaneously).Monitor the progress of the resulting workers. Work goes much faster than you are used to, so pay careful attention to:
the Papertrail logs
the redis statistics for the app in Heroku (go to the resource page then click “Heroku data for redis”.
If there are errors in any of the jobs, you can retry the jobs in the Rescue pages, or rerun them from the console.
Monitor the number of jobs still pending in the
special_jobs
queue. When that number goes to zero, it means the work will complete soon and you should start getting ready to turn off the dynos. It does NOT mean the work is complete, however!When all the workers in the
special_jobs
queue complete their jobs and are idle:rake scihist:resque:prune_expired_workers
will get rid of any expired workers, if neededSet the number of
special_worker
dynos back to zero.Remove the
special_jobs
queue from the resque pages.
Config Variables
Heroku has a list of key/values that are provided to app, called “config vars”. They can be seen and set in the Web GUI under the settings tab, or via the heroku command line heroku config
, heroku config:set
, heroku config:get
etc.
We need to disable the heroku nodejs buildpack from “pruning development dependencies”, because our rails setup needs our dev dependencies (such as vite) at asset:precompile time, at which they would otherwise be gone. See vite-ruby docs and heroku docs. To do this we set
config:set YARN_PRODUCTION=false
Note:
Some config variables are set by heroku itself/heroku add-ons, such as
DATABASE_URL
(set by postgres add-on). They should not be edited manually. Unfortunately there is no completely clear documentation of which is which.Some config variables include sensitive information such as passwords. If you do a
heroku config
to list them all, you should be careful where you put/store them, if anywhere.
Buildpacks
Heroku lets you customize the software installed on a dyno via buildpacks.
These can be seen/set through the Web GUI on the settings tab, or via the Heroku command line eg heroku buildpacks
, heroku buildpacks:add
, heroku buildpacks:remove
.
Buildpacks can be provided by heroku, or maintained by third parties.
In addition to the standard heroku ruby
buildpack, we use:
The Heroku node.js buildpack, heroku-buildpack-nodejs. See this ticket for context.
GitHub - Newlywords/heroku-buildpack-vips: Heroku buildpack with vips and pdf support via poppler to install the libvips image procesing library/command-line
An
apt
buildpack at https://buildpack-registry.s3.amazonaws.com/buildpacks/heroku-community/apt.tgz, which provides for installation viaapt-get
of multiple packages specified in an Aptfile in our repo. We install several other image/media tools this way.(we could not get vips sucessfully installed that way, is why we used a seperate buildpack for that)
For tesseract (OCR) too, see Installing tesseract
We need
ffmpeg
and had a lot of trouble getting it built on heroku! Didn’t work via apt, didn’t find a buildpack that worked and gave us a recent ffmpeg version. Until we discovered that sinceffmpeg
is a requirement of Railsactivestorage
'spreview
functionality this heroku-maintained one gave us ffmpeg:https://github.com/heroku/heroku-buildpack-activestorage-preview
That buildpack is mentioned, along with mentioning it installs ffmpeg, at: Active Storage on Heroku | Heroku Dev Center
We don’t actually use activestorage or its preview feature, just use this buildpack to get ffmpeg installed.
If looking for an alternative in the future, you could try: GitHub - jonathanong/heroku-buildpack-ffmpeg-latest: A Heroku buildpack for ffmpeg that always downloads the latest static build (we haven’t tried that yet)
Buildpack to get
exiftool
CLI installed – https://github.com/velizarn/heroku-buildpack-exiftoolrequires heroku config
EXIFTOOL_URL_CUSTOM
to be set to URL with .tar.gz of linux exiftool source, such ashttps://exiftool.org/Image-ExifTool-12.76.tar.gz
`exiftool source url can be easily found from from https://exiftool.org/ , may make sense to update now and then
We previously tried using a buildpack that tried to find most recent exiftool source release automatically from exiftool RSS feed, but it was fragile.
The standard heroku python buildpack, so we can install python dependencies from
requirements.txt
. (Initiallyimg2pdf
). It is first, so the ruby one will be “primary”. https://www.codementor.io/@inanc/how-to-run-python-and-ruby-on-heroku-with-multiple-buildpacks-kgy6g3b1e
We have a test suite you can run, that is meant to ensure that expected command-line tools are present, see: https://github.com/sciencehistory/scihist_digicoll/blob/master/system_env_spec/README.md
Add-ons
Heroku add-ons are basically plug-ins. They can provide entire software components (like a database), or features (like log preservation/searching). Add-ons can be provided by heroku itself or a third-party partnering with heroku; they can be free, or have a charge. Add-ons with a charge usually have multiple possible plan sizes, and are always billed pro-rated to the minute just like heroku itself and included in your single heroku invoice.
Add-ons are seen and configured via the Resources tab, or heroku command line commands including heroku addons
, heroku addons:create
, and heroku addons:destroy
.
Add-ons we are using at launch include:
Heroku postgres (an rdbms) (the
standard-0
size plan is enough for our needs)Note: Does our postgres plan offer enough connections for our web and worker dynos? See this handy tool to calculate.
Stackhero redis (redis is a key/value store used for our bg job queue)
We are currently using StackHero redis through heroku marketplace, their smallest $20/plan. Our redis needs are modest, but we want enough redis connections to be able to have lots of temporary bg workers without running out of redis connections, and at 500 connections this plan means postgres is the connection bottleneck not redis.
Note that “not enough connections” error in redis can actually show up as
OpenSSL::SSL::SSLError
we are pretty sure. Redis 6 `OpenSSL::SSL::SSLError` when hitting max connections · Issue #980 · redis/redis-rbThe numbers don’t quite add up for this, I think resque_pool may be temporarily using too many connections or something. But for now we just pay for premium-1 ($30/month)
Memcached via the Memcached Cloud add-on
Used for Rails.cache in general – the main thing we are using Rails.cache for initially is for rack-attack to track rate limits. Now that we have a cache store, we may use Rails.cache for other things.
In staging, we currently have a free memcached add-on; we could also just NOT have it in staging if the free one becomes unavailable.
In production we still have a pretty small memcached cloud plan, if we’re only using it for rack-attack we don’t need hardly anything.
Heroku scheduler (used to schedule nightly jobs; free, although you pay for job minutes).
Papertrail – used for keeping heroku unified log history with a good UX. (otherwise from heroku you only get the most recent 1500 log lines, and not a very good UX for viewing them!). We aren’t sure what size papertrail plan we’ll end up needing for our actual log volume.
Heroku’s own “deployhooks” plugin used to notify honeybadger to track deploys. https://docs.honeybadger.io/lib/ruby/getting-started/tracking-deployments.html#heroku-deployment-tracking and heroku: configure honeybadger deploy tracking · Issue #878 · sciencehistory/scihist_digicoll
Some add-ons have admin UI that can be accessed by finding the add-on in the list on the Resources tab, and clicking on it. For instance, to view the papertrail logs. No additional credentials/logins needed.
Some add-ons have configuration available via the admin UI, for instance the actual scheduled jobs with the Scheduler, or papertrail configuration. Add-on configuration is not generally readable or writeable using Heroku API or command line.
SearchStax – Solr
Solr is a search utility that our app has always used. SearchStax provides a managed “solr as a service” in the cloud. While there are some Solr providers available as heroku add-ons, we liked the feature/price of SearchStax.com better, and went with that.
We have a production and a staging instance via SearchStax
Login credentials to searchstax can be found in our usual credentials storage.
SearchStax is invoiced separately from heroku – we currently pay annually, in advance, to get a significant discount over month-by-month.
Heroku doesn’t know about our searchstax Solr’s automatically, we have to set the heroku config var SOLR_URL
to point to our searchstax solr, so our app can find it. The SOLR_URL
also currently includes user/password auth information.
With SOLR_URL being set to a properly functioning solr (whether searchstax or not), the app can index things to solr, search solr indexes, and sync solr configuration on deploys.
HireFire.io
HireFire.io can automatically scale our web and worker dyno counts up and down based on actual current usage. While there is an autoscaling utility built into Heroku for performance
type web dynos, and then also as the rails-autoscale an add-on, we liked the feature/price of the separate HireFire.io better.
HireFire isn’t officially a heroku partner, but can still use the heroku API to provide customers with heroku autoscale services – this can make the integration a bit hacky. Our hirefire account talks to heroku via OAuth authorization to a specific heroku account – currently jrochkind’s, if he were to leave, someone else would have to authorize and change that in settings. Additionally, there is a HIREFIRE_TOKEN you get from editing the hirefire “manager” (ie, the “worker” manager specifically), and need to set as a heroku config variable.
We have multiple accounts authorized to our “Science History Institute” team on hirefire – each tech team member can have their own login.
For now, we use the hirefire “standard” plan at $15/month, which only checks for scale up/down once every 60 seconds. We could upgrade to $25/month “overclock” plan which can check once every 15 seconds.
You can log into hirefire to see some scaling history and current scale numbers etc.
You can also turn on or off the different “managers” (worker vs web), to temporarily disable autoscaling. Make sure the dyno counts are set at a number you want after turning off scaling!
Worker dynos
We auto-scale our worker dynos, using the standard hirefire Job Queue (Worker)
strategy. It’s main configuration option is ratio
, which is basically meant to be set to how many simultaneous jobs your heroku worker dyno can handle. Using resque with 3 workers, as we are at time of writing – that’s 3.
We are scaling between 2 workers minimum (so they will be immediately available for user-facing tasks), and for now a max of 8.
There are settings to notify you if it runs more than X dynos for more than X hours. We have currently set it to alert us if it’s using more than 2 workers for more than 6 hours, to get a sense of
Other settings can be a big confusing and we took some guesses; you can see some hirefire docs at Manager Options, and they also respond very promptly to questions via their support channels.
Heroku config var HIREFIRE_TOKEN needs to be set according to value you can find on Hirefire manager (for “worker” manager) settings.
Web dynos
TBD, not currently scaling web dynos.
HireFire documentation
Cloudfront
Per Heroku suggestions to use a CDN in front of Rails static assets, we are doing so with AWS Cloudfront.
We currently have two Cloudfront distributions, one for production and one for staging. Might not be entirely necessary in staging, but prod/staging parity and being able to test things in staging seems good.
Our cloudfront distributions ARE controlled by our terraform config. Here’s a blog post on our configuration choices.
The Heroku config var RAILS_ASSET_HOST
should be set to appropriate cloudfront hostname, eg dunssmud23sal.cloudfront.net
. If you delete this heroku config var, the Rails app will just stop using Cloudfront CDN and serve it’s assets directly. You can get the appropriate value from terraform output
.
Honeybadger
We have been using honeybadger since before heroku, and have an account set up separately from heroku. We currently get it gratis from Honeybadger as non-profit/open source.
We have set up heroku-specific deployment tracking and Heroku platform error monitoring as detailed in honeybadger heroku-specific documentation.
Microsoft SSO
See https://sciencehistory.atlassian.net/wiki/spaces/HDC/pages/2586214405 for a discussion of Microsoft SSO.
Scout
We use Scout to monitor the app’s performance and find problem spots in the code. The account is free, as we’re an open-source project, although billing information is maintained on the account.
Papertrail (logging)
Settings are here:
https://papertrailapp.com/account/settings
Notes re: tuning lograge (which controls the format of log messages) in our app:
https://bibwild.wordpress.com/2021/08/04/logging-uri-query-params-with-lograge/
Recipe for downloading all of a day's logs:
THE_DATE=$1 # formatted like '2023-12-21'
TOKEN="abc123" # get this from <https://papertrailapp.com/account/profile.>
URL='https://papertrailapp.com/api/v1/archives'
for HOUR in {00..23}; do
DATE_AND_HOUR=$THE_DATE-$HOUR
curl --no-include \
-o $DATE_AND_HOUR.tsv.gz \
-L \
-H "X-Papertrail-Token: $TOKEN" \
$URL/$DATE_AND_HOUR/download;
done
# Remove files that aren't really compressed logs
rm `file * | grep XML | grep -o '.*.gz'`
# uncompress all the logs
gunzip *.gz
To separate logs into router and non-router files, resulting in smaller and more readable files:
mkdir router
mkdir nonrouter
ls *.tsv | gawk '{ print "grep -v 'heroku/router' " $1 " > nonrouter/" $1 }' | bash
ls *.tsv | gawk '{ print "grep 'heroku/router' " $1 " > router/" $1 }' | bash
History
We started out with the the "Forsta" plan (~4.2¢/hour, max of $30 a month; 250MB max).
In late 2023 and early 2024, we noticed an increase in both the rate and the volume of our logging, resulting in both:
A) L10 error messages (sent when Heroku’s log router, Logplex, can’t keep up with a burst of logging and starts to drop messages without sending them to Papertrail.)
B) Days on which the total storage needed for the day’s accumulated error messages exceeded our 250MB Papertrail plan’s size limit. (Note that Heroku add-on usage resets daily at midnight (UTC) which is early evening EST, so the notion of a “day” can be confusing here).
A) and B) don’t always co-occur: high rates per second cause the first, large storage requirements the second.
On Jan 10th we decided to try the "Volmar" plan (~9¢/hour; max of $65 a month; 550MB max) for a couple months, to see if this would ameliorate our increasingly frequent problems with running out of room in the Papertrail log limits. It’s important to note that the $65 plan, based on our current understanding, will not fix the L10 errors, but will likely give us more headroom on days when we get a lot of traffic spread out over the entire day.
After switching to 550MB max log plan
Since switching to the new high-capacity plan on Jan 10th we had:
only one new instance of L10 messages (see A above), on March 20th at 3:55 am.
no instances of running over the size limit (see B above).
Avenues for further research
Confirm that the L10 warnings are caused by a surge in bot traffic, rather than a bug in our code or in someone else’s code. Several clues so far point to bots as the culprit.
If so, this is a good argument for putting cloudflare or equivalent in front of our app, which would screen out misbehaving bots
Consider logging fewer bytes: either by making some or all log lines more concise, or by asking Papertrail to drop certain lines that we’re not really interested in:
some postgresql messages?
do we really need to log all status 200 messages? (Probably.)
As a last resort, we could also decide not to log
heroku/router
messages (typically 40-60% of our messages), although those can be really helpful in the event of a catastrophe.