Backups in Heroku (draft)

Better known as: Backups without Ansible.

We are considering getting rid of our Ansible-managed servers as we move to Heroku. This means that we can no longer rely on certain backup mechanisms which were performed by cron jobs installed by Ansible.Backups and Recovery contains a summary of our pre-Heroku backup infrastructure.

Original files and derivatives

These are stored in S3, are backed up within S3 by a process managed by AWS, and are pulled from S3 to Dubnium. No Ansible cron jobs are used in this workflow, and and there is thus no need to make any changes to our existing setup.

Database backups

A script on the production server, home/ubuntu/bin/postgres-backup.sh, performed the following tasks nightly:

pg_dump the production database to /backups/pgsql-backup/.
aws s3 sync the contents of that directory to s3://chf-hydra-backup/PGSql.

The above script will need to be discarded.

A second cronjob running on our dubnium backup server then copies the S3 file to a local network storage mount. This then gets backed up to tape.

Heroku database backup commands

It’s easy to setup a regular database backup in Heroku, as follows:

heroku pg:backups:schedule DATABASE_URL --at '02:00 America/New_York'

You can check the metadata on the latest backups as follows:

$ heroku pg:backups
=== Backups
ID Created at Status Size Database
──── ───────────────────────── ─────────────────────────────────── ─────── ────────
a006 2020-12-14 07:30:18 +0000 Completed 2020-12-14 07:30:56 +0000 64.37MB DATABASE

heroku pg:backups:download a006 will produce a “logical” database dump – a binary file – that can easily be converted to a “physical” (i.e. garden variety SQL file) dump as follows: pg_restore -f mydatabase.sql latest.dump.

More simply, you can run: curl -o 'latest.dump' heroku pg:backups:url to get the latest logical dump.

Heroku retains daily backups for 7 days, and weekly backups for 4 weeks.

Options

a) Rake task: Replace the Ansible-managed scriptpostgres-backup.sh with a rake task run regularly on a one-off Heroku dyno. This would obtain the latest database URL and then push it up to s3, where it can wait to be harvested by the Dubnium script.

Pro: minimal change from our existing workflow; easy to check on by ensuring the date on the appropriate S3 bucket.

Con: requires a part of our code to have S3 credentials that allows it to write to our backup directory; requires the Heroku CLI to be accessible to the rake task (so it can obtain the URL of the latest dump).

b) cron job on Dubnium: Dispense with the S3 portion of the workflow entirely, and set up the cron job on Dubnium to obtain its database backup from

Pro: simpler;

does not require the scihist_digicoll code to know anything about the backup s3 setup; thus safer;

Con: assumes we trust the Heroku database backup workflow;

less transparent: it’s more legwork to log into Dubnium and check that the database backed up there is current (Dubnium is only accessible by logging into Citrix Workspace);

Dubnium is not managed by Ansible, and needs to be manually updated;

One less copy: instead of having copies in the database server, in S3, on Dubnium and on tape, we would only have copies in Heroku, on Dubnium, and on tape;