Backups of the Digital Collections
Overview
The Science History Institute's Digital Collections offer highlights from our library, archives, and museum collections.The purpose of our Digital Collections is to manage, preserve, and provide access to our digital assets all in one location. Although the Digital Collections include only a small portion of the Science History Institute’s entire collection, new material is added every day. (See our About and FAQ pages for more details.)
The Digital Collections consists of:
a set of digital representations, intended for a Web audience, of physical objects, which range from museum objects of all descriptions to books to taped audio interviews to VHS tapes. (For a good idea of the range of materials, go to our search results and limit your search by genre, format, or medium.) In the description below, when we talk about “original files” we are talking about these digital representations, which take the form of computer files. We store the original files in Amazon S3.
descriptions of the files above, which allow us to find them, keep them in order, search them, and describe them to the public. We store the descriptions in a PostGreSQL database hosted and managed by Heroku.
Backup summary
We store the backups for the original files in a separate S3 bucket that automatically mirrors the contents of the originals. On a nightly basis, these are copied to a local server, and our IT staff is responsible for making regular copies of these backups to a local disk, and then storing a series of tape copies of them offsite on tape.
Heroku offers a service allowing us to “roll back” or revert the database to its state at any point in the past four days. In addition, we store nightly database backups in a dedicated S3 bucket. Our IT staff also makes nightly copies of the s3 bucket to local disk. From there, it joins backups of the original files in offsite tape storage.
Backup details
The details below are intended for an internal, technical Science History Institute audience, and discuss how we back up the “original files” (hereinafter “originals”) and the PostGreSQL database (hereinafter “database”).
Original files are stored in S3, and are backed up within S3 by a process managed by AWS. The backups are then copied to long-term storage by SyncBackPro, which is Windows software running on Promethium managed by Chuck and Ponce (see SyncBackPro backup software for Windows ).
See more at Digital CollecS3 Bucket Setup and Architecture and https://sciencehistory.atlassian.net/wiki/pages/createpage.action?spaceKey=HDCSD&title=Backups%20and%20Recovery%20%28Historical%20notes%29
Heroku database backups
We have three backup/restore mechanisms under Heroku:
1. Nightly .dump backups
We use heroku’s built-in postgres backup functionality to make regular backups that are stored in heroku’s system. This is the most convenient backup to restore from, when it is available and meets your needs.
These backups end up stored in postgres -Fc
or “dump” format, which postgres says is a compact, fast, and flexible recommended format for postgres backups; but is not human-readable and may be less portable between postgres versions.
To verify that we have scheduled backups, run
heroku pg:backups:schedules --app scihist-digicoll-production
, to see that we have a 2AM backup ever night.List what backups exist by running
heroku pg:backups -a scihist-digicoll-production
Note the first section is “backups” (which may scroll off screen), and the first column is a backup ID, such asa189
.With the backup ID, you can restore production to a past backup (eg id
a189
), withheroku pg:backups:restore a189 -a scihist-digicoll-production
Warning: this will overwrite current production data, with the restored backup!
Warning: see note below re:
--extensions
.
Maybe instead you want to restore a production backup to staging, to just look at the data, without actually (yet?) restoring to and overwriting current production? You can do this too:
heroku pg:backups:restore scihist-digicoll-production::a189 -a scihist-digicoll-staging
Warning: the above command may fail if the database you are restoring from has extensions installed in the
public
schema, subsequent to some changes in how Heroku works with extensions). There is a workaround: using theextensions
flag as in the example below allows you topg:restore
from a database that has extensions inpublic
(like the production DB before Sept 2022)
heroku pg:backups:restore scihist-digicoll-production::a661 DATABASE_URL \
--extensions 'public.pg_stat_statements,public.pgcrypto' \
--app scihist-digicoll-staging
To find out what extensions are installed and in what schemas, just execute
\dx
at the psql prompt.
For our standard-0
heroku postgres plan, heroku will keep 7 daily backups, and four weeks of one-per-week backups.
You can also download heroku backups to store them in your own location, and then load your local copies into heroku. See Heroku docs for more info.
2. Preservation (logical) backups to s3
We don’t want to rely solely on backups stored inside heroku’s system. We also would like a postgres backup in the more human-readable and transportable plain .sql
format, instead of the postgres -Fc
.dump
format.
We have our own rake task, rake scihist:copy_database_to_s3
, which we also run nightly via the heroku scheduler. This task connects to heroku postgres to make an postgres human-readable .sql
dump, then uploads it to our s3 chf-hydra-backup bucket, where SyncBackPro then syncs to a local network storage mount (/media/SciHist_Digicoll_Backup
), and from there to our tape backups. (SyncBackPro is managed by Chuck and Ponce.)
You can log into the heroku scheduler add-on via Heroku “resources” tab to verify the copy_database_to_s3 task is scheduled nightly.
Given the size of the database in late 2020, the entire job (with the overhead of starting up the dyno and tearing it down) takes a bit under a minute. However, if our database grows an order of magnitude larger and slower to dump/transfer to S3, we may have to reconsider this approach.
The more portable .sql format stored and backed up outside of heroku is motivated primarily for preservation purposes, but it can also serve as a last-ditch or alternative disaster recovery. It can be restored to heroku using the heroku pg:psql
command to run arbitrary psql
commands on the heroku postgres.
Restoring from a logical (.sql) database dump.
In the unlikely event you have to restore from a logical backup:
Go to https://s3.console.aws.amazon.com/s3/buckets/chf-hydra-backup?prefix=PGSql%2F®ion=us-west-2
Download the database file you want (note the “versions” tab if you want a past version still on S3)
Uncompress it from the .gz format. On a unix or MacOS command line, that’s
gzip -d heroku-scihist-digicoll-backup.sql.gz
Load it into heroku database:
heroku pg:psql --app scihist-digicoll-production < heroku-scihist-digicoll-backup.sql
Note: This will overwrite your database, and won’t warn/prompt you about that fact first! It will run in your terminal and take a bit of time.
3. Heroku postgres “rollback”
Heroku can rollback postgres database to an arbitrary moment in time, based on postgres log files. For our current postgres standard-0 plan, there are four days past of logs kept. See: Heroku Postgres Rollback | Heroku Dev Center , and the section “Common Use Case: Recovery After Critical Data Loss”
This is a somewhat more complicated process, and requires some more care to get right, but it is very powerful to be able to go back to any moment in time in the last 4 days!
To do this requires creating a new postgres “rollback” database; switching the app to use it; then deleting the old no-longer in use database. From a terminal with the heroku CLI:
heroku addons:create heroku-postgresql:standard-0 --rollback DATABASE_URL --to '2021-06-02 20:20 America/New_York' --app scihist-digicoll-production
The site remains up. The new database’s name will be printed to the terminal, and you can see it in the Resources section of the Heroku admin. It might be something like
postgresql-curly-07169
It might take a few minutes or more for the newly restored database to be ready, you can follow instructions the command gives you to check progress, such as
heroku pg:wait
Once the rollback database – which has been restored to a past moment in time – is ready, you can switch the app to use that new restored database by using the database name:
heroku pg:promote postgresql-curly-07169 --app scihist-digicoll-production
Make sure you have successfully fixed the problem.
Once all is well, don’t forget to get rid of the extra database(s) you are no longer using. Consider leaving this step for the next day; it will only cost a couple dollars over 24 hours.
How do you know which db is the “old” one? Run
heroku addons
to see all yourheroku-postgresql
databases; the one currently used by the app is markedas DATABASE
. So the other one is the old no longer used one, which also has anAS
name.To remove it run eg
heroku addons:destroy HEROKU_POSTGRESQL_YELLOW --app scihist-digicoll-production
. Be careful you are removing the correct one!
NOTE: Is it possible to rollback to a past production snapshot, but do it in the staging app first, to see what it looks like without touching production? We need to look into that, it could be a safer way to do it.
Historical notes
Prior to moving off our Ansible-managed servers, we used backup mechanisms that used to be performed by cron jobs installed by Ansible.https://sciencehistory.atlassian.net/wiki/pages/createpage.action?spaceKey=HDCSD&title=Backups%20and%20Recovery%20%28Historical%20notes%29 contains a summary of our pre-Heroku backup infrastructure.
A script on the production server, home/ubuntu/bin/postgres-backup.sh
, used to perform the following tasks nightly:
pg_dump
the production database to/backups/pgsql-backup/
.aws s3 sync
the contents of that directory tos3://chf-hydra-backup/PGSql
.