Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

A quick cost analysis has restoration costing $30-35 dollars, this is as of 6/11/2018 with approximately 1 TB of data. Approximately 66% of the cost was due to inter-region transfer fees (moving data from US-WEST to US-EAST). The rest is standard LIST, GET, and related fees.


Kithe recovery

Kithe currently (March 2019) has a small set of data to be handled for recovery.

  • A postgres database which contains user data and item metadata
  • Original binary files
  • Derivative Files

The first two are the ones that are needed the derivatives are merely backed up because the cost to back them up is low relative to the amount of time saved on a recpvery by having a copy ready.

The postgres database is backed up to S3 (currently unscheduled, fix this)
The binary files are replicated via S3 replication to a second location in US-WEST rather than US-EAST in case of outages. When we actually switch over, these will also be backed over to local on-site storage.
The derivative files will also be replicated via S3 replication to a US-WEST location. They can also be regenerated by the application though this takes hours.
Recovery levels:Image Added

Digital Collection Recovery Overview:


Our new digital collection application (scihist_digcol) has slightly different preservation needs than the current Samvera application, however our institutional goals remain the same While we will be making technical changes to preservation methods our backup and recovery goals are still linked to the following key points in order of importance:

  1. Provide backups of all original files to they can be recovered in case of natural disaster, data corruption, user error, or malicious computer activity.

  2. Allow the Institute to recover from data loss or outages in a reasonable rate of time.

  3. Adhere to OAIS model rules when possible.

In cases of small scale data loss, such as corrupted files or user error, the application will be working fine but a limited set of data will have a problem. In these cases we can locate the problematic data and use a backup copy to restore any damaged original files or use versioning to restore an earlier version of the file. Derived files can either be regenerated or copied from backups as well. This is the most common expected use case.

In cases of broad AWS failure or regional disasters, it is possible that much or all of the data is rendered unavailable. In these cases we will suffer a loss of service until we can recover the data and apply new servers to run the software. This can be thought of as two recoveries, one is to get the digital collection site back as soon as possible for the public and the second is to restore all functionality. Getting the site back for the public is our primary concern, so we have a few methods to speed up recovery. Both the derivative and original files are backed up to a region on the West Coast in S3 like our actual use files. This means we can in case of a disaster point our application to the backup S3 storage and use it in production. Due to current setups we would not want staff to add new works, but this allows us to rapidly restore public facing access to our site should the normal data sources be unavailable. A longer process allows us to restore the data back to the original locations while leaving public access up, once the data is restored to its original place full staff functionality will likewise be restored.

In cases of issues affecting all on-line storage systems, another copy of the data is held on our in-house storage system so that we can potentially recover data even in case of a full loss of AWS. This only holds the original files and all other aspects will need to be rebuild, a process that can take days in addition to the time taken to upload the files again. Using the local backup means recovery could take a week or more.

The files which are considered key, the original item files and the postgres database storing their metdata are the only files which require backup. The other files being saved, derivative and index files, are only saved in order to reduce our proposed downtime during different accidents. When we look at recovery it will be useful to distinguish between full and partial recoveries. The following classifications may prove helpful

  • Partial public,

    no staff recovery: In this case the public has access to limited functionality but features like derived images may not be fully restored, staff can access public functions but not do additional work.Partial public,

    partial staff recovery: The public has access to limited functionality as above, staff has access to a limited set of functions but certain functions may not be used.

  • Full public,

    no staff recovery The public can use the site normally, no staff specific functions can be done.Full public,

    partial staff recovery: The public can use the site normally, staff has access to a limited set of functions but certain functions may not be used.

  • Full public, full staff recovery: A total recovery.

Generally speaking public recovery is the higher initial priority since it impacts more people, contacting staff about outages is easy, and public outages affect the perception of the institute. That is why we’ve taken on the extra cost of storing non-needed files to speed up public recovery.

Technical Notes

Kithe currently (March 2019) has a small set of data to be handled for recovery.

  • A postgres database which contains user data and item metadata
  • Original binary files
  • Derivative Files

The first two are the ones that are needed the derivatives are merely backed up because the cost to back them up is low relative to the amount of time saved on a recpvery by having a copy ready.

The postgres database is backed up to S3 (currently unscheduled, fix this)
The binary files are replicated via S3 replication to a second location in US-WEST rather than US-EAST in case of outages. When we actually switch over, these will also be backed over to local on-site storage.
The derivative files will also be replicated via S3 replication to a US-WEST location. They can also be regenerated by the application though this takes hours

Public recovery requires the folllowing data

...