Digital CollecS3 Bucket Setup and Architecture

The Digital Collections app divides it's content across a fairly large number of S3 buckets. Different content has different needs with regard to lifecycle management, storage class, and permissions, and we have used buckets to divide these things. Additionally, separate buckets can be helpful for AWS cost reporting. The exact division of buckets we currently have may not be optimal, we were learning as we went.  It might be nice to combine some buckets, but copying/moving lots of S3 keys can be cumbersome/expensive. 

Digital Collections app S3 buckets are controlled by terraform infrastructure-as-code, from terraform configuration in this repository: https://github.com/sciencehistory/terraform_scihist_digicoll/

Terraform configuration currently controls the buckets and all configuration including replication, but does not (as of this writing) control the IAM roles/policies related to access to buckets. But see https://github.com/sciencehistory/terraform_scihist_digicoll/issues/6

Do not make manual changes to S3 configuration controlled by terraform. At least not for other than a test/spike that will be immediately reflected in terraform config. If the terraform gets out of sync, it winds up a mess. 

Buckets

The terraform configuration should be considered the ultimate authority on our S3 buckets, in any conflict with this documentation. But some overview of all of our buckets....

bucketnotes

scihist-digicoll-production-originals

containing all original files (except for exceptions). (fronted by restricted cloudfront requiring signed urls)

=> scihist-digicoll-production-originals-backup

A second copy of originals in another AWS region, kept synchronized with AWS replication rules

scihist-digicoll-production-originals-video

we keep video original files in a separate bucket for tracking purposes. (fronted by restricted cloudfront requiring signed urls). 
=> scihist-digicoll-production-originals-video-backupsSimilar to originals-backups, but for the separate videos bucket

scihist-digicoll-production-derivatives

standard location for derivative files – contains most derivative files, but derivative files marked "restricted" are kept in a seperate prefix in originals bucket. (For most derivatives, we want them with public URLs so we can use cacheable URLs for thumbnails and such). fronted by unrestricted cloudfront distro

=> scihist-digicoll-production-derivatives-backup

A backup of derivative files in another region for quicker recovery. Kept synchronized by AWS replication rules. 

scihist-digicoll-production-ondemand-derivatives

a "cache" location for our "large" derivatives that are created on-demand in background jobs (multi-image PDFs and zips). It has a lifecycle rule that deletes old things, and functions as a "cache". Could probably be put somewhere else.  fronted by unrestricted cloudfront distro.

scihist-digicoll-production-dzi

"DZI" (Deep Zoom) tiles used for pan-and-zoom. fronted by unrestricted cloudfront distro. 
=> scihist-digicoll-production-dzi-backupA backup of dzi files in another region for quicker recovery. Kept synchronized by AWS replication rules.
scihist-digicoll-production-ingest-mountBucket we use for mounting on Windows desktops, which the app "choose from cloud" function (powered by browse-everything gem) then lets you choose from for ingest.  not user-facing. 
scihist-digicoll-production-uploadsUsed as shrine "cache" location – if someone does upload a file through the browser, it goes here first, as a sort of temporary holding location until shrine "promotes" it to the/an originals bucket.  not user-facing. 
scihist-digicoll-production-public

A newer bucket we created with the intention of holding multiple things intended to be public. doesn't have too much at present.

A folder called maintenance_page contains a page used when the system is planned to be down.

A folder called IT_Department contains an SHI logo file that is used by the Teams desktop app


Staging environment has all(?) of these same buckets, with the word production replaced with staging in bucket name – except staging doens't have any -backup buckets. 

Cloudfront CDN

We send files to users directly from AWS, to avoid extra traffic with our app proxying them. But for major high-use public-facing buckets, we put a AWS Cloudfront CDN distribution in front of S3 buckets – for performance; to in some cases save money; and so we can in the future put AWS WAF in front if needed. We used separate Cloudfront distributions per-bucket because of different config needs in some cases, and for ease of cost tracking. 

Putting Cloudfront in front of S3 should be a simple use case, but there are some tricks. 

  • We want the CloudFront distribution to forward `response-content-distribution` query params to dynamically request Content-Distribution response headers from S3. Some notes on setting that up properly (with proper cache keys!) and other set up tasks in this blog post from jrochkind.  (And in our rails app, we created a custom Shrine storage sub-class to generate the urls we want)
  • We need CORS http response headers for some assets used from Javascript (DZI, video, etc), which we ensure are set by using a Cloudfront response policy
  • We have our S3 buckets closed to public access to ensure all traffic goes through CloudFront, and set up CloudFront for auth'd access to our buckets. 
  • Note Cloudfront will generally cache internally for the duration the s3 http response cache-control headers say to – which requires us to set cache-control metadata on s3 objects individually, which we try to do on ingest/upload – to one year. If S3 does not return cache-control response headers for an object, Cloudfront will use it's "default TTL", which we generally have at 1 week.
  • We have both Cloudfront and S3 saving access logs to chf-logs bucket under cloudfront_access_logs/ and s3_access_logs/ prefixes, with a bucket lifecycle policy to save logs only for so long and clean old logs. 

Cloudfront public/private key

Some buckets contain public-to-everyone content so have unrestricted Cloudfront distributions in front. 

But for certain non-public-to-everyone content (mainly "originals") we only want to allow expiring-signed-url access, and we need to set up our app to use Cloudfront's (different from s3's) restricted access url signing protocol. 

This uses an RSA private/public keypair, with the public key set up in AWS through terraform. But the private key is not in AWS or terraform. We keep it in our team 1password account, items named scihist-digicoll-production_private_key.pem and scihist-digicoll-staging_private_key.pem

  • The app config CLOUDFRONT_KEY_PAIR_ID needs to be set to AWS key pair id, that looks like eg `K2JCJMDEHXQW5F`
  • The app config CLOUDFRONT_PRIVATE_KEY  needs to be set to the private key in PCKS#8 format, it should begin with -----BEGIN PRIVATE KEY----- (NOT OPENSSH PRIVATE KEY that's the wrong format!)
  • The public key is in terraform config/repo (and visible in AWS console) – the 1Password export of public key may not be in right format – also should be PKCS#8 and start BEGIN PUBLIC KEY – you could use openssl command line to convert it if needed. 
  • You can generate a new private key with command line openssl genrsa -out private_key.pem 2048 and then extract the public key from it with openssl rsa -pubout -in private_key.pem -out public_key.pem

Terraform

S3 buckets and Cloudfront distros are configured via terraform, and app config variables related to such are generally available from terraform output. Including CLOUDFRONT_KEY_PAIR_ID, and various S3_BUCKET_*_HOST with cloudfront domain names. 

CLOUDFRONT_PRIVATE_KEY is not available from terraform, if you need it (to re-set in heroku config), you need to get it from 1password, see above. 




Backup Buckets – motivation/use

There are many different hypothetical uses for a second copy of data. Some of them may have different and even conflicting requirements. We haven't necesssarily fully tested and fully spec'd out our backups for which users they might be suitable for – more work could be done here. Some original historical thoughts can be found at Backups and Recovery (Historical notes)

Some notes on our backups:

  • Our backup buckets are created by AWS replication rules that should automatically keep them sync'd
  • Our backup buckets are intentionally in a separate AWS region, for AWS best practices. (One region may go down). But this does increase costs of bandwidth for making copies or for recovery. 
  • The backup buckets have versioning turned on, but (I think?) only keep old versions for 30 days
  • The originals bucket(s) content is irreplaceable, so the backup copy may be especially important. (Note separate videos original backup) But derivatives and DZI's both can be re-created from originals if lost. Backup/redundant copy here can serve to get us up either faster or cheaper than re-creating.  
  • There is an additional copy of some material (only originals?) in an on-site institute backup. 

Possible hypothetical uses for backup/redundancy copies might include:

  • User error that corrupts or deletes data, want to retrieve it from the second location (note our backups may only keep 30 days of history, so would have to be caught within that time frame)
  • data layer corruption at primary copy, want to restore. (Extremely unlikely on S3 architecture, although anything is possible)
  • temporary outage at present location, want to keep application up through outage of unknown length. (Definitely happens on S3 now and then)
  • permanent disappearance of the primary storage location (unlikely but could happen on S3, probably more likely than corruption)
  • meeting professional standards of some kinds for preservation


Possible mechanisms for recovery from S3 backup copies

We haven't really tested most of these. 

For individual file data loss or corruption

Either human error or storage layer. Individual files could be copied back over from backup locations. Note that backup locations may only keep old versions for 30 days. 

For a temporary outage

We could point a running app at the backup buckets using heroku config variables, and put it in READ-ONLY mode (by locking out staff logins), to keep the app up through temporary outage of our S3 us-east-1 buckets. 

This might incur additional expense because of cross-region bandwidth, if our live app is still running in us-east-1 but is pointed at us-west buckets. Not sure how significant. 

After temporary outage is over, app would be pointed back at normal buckets. 

For a permanent outage or loss

Or could be human error accidental deletion of the entire buckets or something. 

Hypothetically could create new buckets and copy all data over, but cost could be signficant of cross-region data transfer. Unless we permenantly move our AWS infrastructure to us-west region!

Additional "local" copies

In addition to the extra copy on S3, we make another set of copies in on-site Science History Institute storage, that is also backed up to tape on-site. 

  • SyncBackPro runs 3 nightly mirror jobs from AWS S3 to the local server Promethium at 6pm
    • scihist-digicol-production-originals -> D:\Backup Folders\AWS S3 - Digital Collections - Images
    • chf-hydra-backup\PGSql -> D:\Backup Folders\AWS S3 - Digital Collections - SQL
    • chf-hydra-backup\Aspace -> D:\Backup Folders\AWS S3 - ArchivesSpace - SQL
  • Promethium data is backed up up to LTO tape daily. 
  • Weekly and monthly tapes are in an off-site rotation.  Annual tapes are kept off-site for 5 years.