Manually executing heroku redis maintenance

You might get an email from heroku that begins:

Your database redis-word-number premium-1 (REDIS on scihist-digicoll-production) must undergo maintenance….

If you do nothing, the maintenance will happen automatically, usually about a week after you receive the email.

We use redis mainly for holding the queue for our background jobs. When the maintenance happens, there will be a very brief redis outage, in which the app can’t queue background jobs, and workers working background jobs can’t connect to redis to get jobs. The workers will report an error when this happens, but can (we think) recover and reconnect to redis when it’s back.

This isn’t a disaster, our app seems to recover from the redis outage fine. But if you’d like to take more control of the situation and do it manually, this might minimize chances of anything going wrong with background jobs related to ingest, and keep those errors out of our logs/error monitoring.

(See Heroku developer setup for instructions on setting up heroku command line, including with -r production configuration).

Temporarily disable our hirefire auto-scaling manager, so it will allow our worker count to be scaled down to zero. Just toggle the “enabled” toggle for worker at https://manager.hirefire.io/
Scale down workers to zero: heroku ps:scale worker=0 -r production
1. Wait until worker is actually scaled down, you can see with heroku ps -r production
Disable staff access, so staff can’t trigger ingest with bg jobs that might not be able to be enqueued: heroku config:set LOGINS_DISABLED=true -r production
1. Because we use Heroku preboot, it can take 2-3 minutes to take effect, check to make sure you are really locked out of staff UI?
Run the maintenance now per heroku instructions, on production, eg: heroku redis:maintenance --run REDIS --force -r production
1. When it’s finished, you should get an email, you can also check on status with heroku redis:info -r production
Enable staff logins again: heroku config:set LOGINS_DISABLED=false -r production
Scale workers back up to their default, probably 2 (if you get it wrong hirefire will fix it): heroku ps:scale worker=2 -r production
Turn on hirefire manager again at https://manager.hirefire.io/
Because of heroku preboot, it could take 2-3 minutes for staff logins to be enabled again. Don’t leave until you confirm they are!

While this process should avoid it – if you wound up with any redis-related errors in Honeybadger despite yourself, go “resolve” them.

While we turned off staff logins to avoid ingest background job enqueues while maintenance can happen, some user actions can still trigger background job enqueues, like asking for an “on-demand derivative”. If someone does this at just the wrong/right time, there could be an error. This probably won’t happen at our current level of traffic; if we wanted to avoid the chance absolutely, we’d have to disable the public-facing app too, which can be done with heroku maintenance:on -r production and heroku maintenance:off -r production