Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The postgres database is backed up to S3 (currently unscheduled, fix this)
The binary files are replicated via S3 replication to a second location in US-WEST rather than US-EAST in case of outages. When we actually switch over, these will also be backed over to local on-site storage.
The derivative files will also be replicated via S3 replication to a US-WEST location. They can also be regenerated by the application though this takes hours.

Recovery levels:

When we look at recovery it will be useful to distinguish between full and partial recoveries. The following classifications may prove helpful

  • Raw data only: In this case we only have access to the raw data, this is only a level in case of a massive outage that destroys all AWS.
  • Partial public, no staff recovery: In this case the public has access to limited functionality but features like derived images may not be fully restored, staff can access public functions but not do additional work.
  • Partial public, partial staff recovery: The public has access to limited functionality as above, staff has access to a limited set of functions but certain functions may not be used.
  • Full public, no staff recovery The public can use the site normally, no staff specific functions can be done.
  • Full public, partial staff recovery: The public can use the site normally, staff has access to a limited set of functions but certain functions may not be used.
  • Full public, full staff recovery: A total recovery

Generally speaking public recovery is the higher initial priority since it impacts more people, contacting staff about outages is easy, and public outages affect the perception of the institute.

Public recovery requires the folllowing data

  • Postgres database
  • Original binaries for downloading tiffs
  • Derivative files

If we have an AWS outage affecting our region the fastest recovery options are to (if needed) rebuild the servers in another region and edit the local_env.yml file in either ansible or by hand to point to the backup S3 buckets for the original and derivative files. The postgres database will need to be downloaded from S3 and installed onto the new kithe machine if there is one. After that point all current public facing aspects will be restored. Since the backup buckets do not sync with the original data buckets staff should not upload new files though they can edit metadata on existing works. Once the original S3 buckets have had service restored or their data copied back, set the application to use the original buckets with local_env.yml and users can now add items,

In the case of smaller issues, like single file corruption or deletion the simplest method for original files is to local them on S3 and look at previous versions. We keep 30 days worth of versions so if the error was found within a month you should be able to revert back to an earlier file. For derivatives it is easier to simply regenerate them via the command line.