Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Our Analysis and Motivations: How we thought through what backups to make, and why

Our new The 2019 version of our digital collection application (scihist_digcoldigcoll) has slightly different preservation needs than the current old Samvera application it was based on. However, our institutional goals remain the same. While we will be making technical changes to preservation methods, our backup and recovery goals are still linked to the following key points in order of importance:

  1. Provide backups of all original ingested files to they can be recovered in case of natural disaster, data corruption, user error, or malicious computer activity.

  2. Allow the Institute to recover from data loss or outages in a reasonable rate of time.

  3. Adhere to OAIS model rules when possible.

The following classifications of recovery may prove helpful:

  • Partial public, partial staff recovery: The public has access to limited functionality, staff has access to a limited set of functions but certain functions may not be used. This is considered an incomplete recovery.

  • Full public, partial staff recovery: The public can use the site normally, staff has access to a limited set of functions but certain functions may not be used. This is also an incomplete recovery, but time sensitivity is reduced as public users are not impactedaffected. Staff recovery times should be minimized, but public use takes priority.

  • Full public, full staff recovery: A total recovery.

Then there are levels of disaster/data loss

  • Minimal Data Loss: In this case a Inaccessibility: A small subset of data is lost, temporarily inaccessible; general site functionality is unaffected.

  • Minimal Data Inaccessibility: A Loss: In this case a small subset of data is temporarily inaccessiblelost, general site functionality is unaffected.
  • Major Data Inaccessibility: Data is not lost permanently, but our ability to access the data is compromised. In this case general site functionality is affected.

  • Major Data Loss: A large amount of data is lost, and general site functionality is impactedaffected.

Generally speaking, public recovery is the higher initial priority since it impacts more people, contacting staff about outages is easy, and public outages affect the perception of the institute.

...

Our data can be broken into two categories, one is data that is potentially irrecoverable. This includes our original binary files (images, audio, other) and the metadata about them (a SQL database). The other data is restorable but needed for normal site operation but takes significant time to restore, such as the derived download files and viewer files. The second set of data may be worth backing up to shorten recovery times for public users when data is lost.

As an estimate, our cost to hold extra backups for our current scihistcoll scihist_digicoll staging environment costs less than $2 a month out of a total $70.07 spent on data storage inclusive of data transfer and storage. While a production environment will have slightly higher cost ratios, it should not be massively higher. Thus by spending an additional 2-3% cost on S3, we can mount a full public recovery in an afternoon from a massive failure of our entire infrastructure. While we currently are not backing up our viewer tiles, an examination of our old application shows the cost for production averages around 5 dollars for storage. Adding a second copy of the viewer files should roughly double the cost, with a slight reduction for less use, so will add another 5 dollars to the cost, so for about $7-12 dollars a month we can be widely covered for data inaccessibility or other failures of S3 in a specific region. While it is hard to get specific details, there have been multiple outages or issues in a region whose duration lasted over an hour, and at least one major outage in the last two years lasting around 6 hours. Assuming about 8 hours of problems every two years, we can estimate that a rough cost of $36/hour of outage spent to avoid being down. Shorter outages may not be worth the difficulty of switching over.

In cases of small scale data loss, such as corrupted files or user error, the application will be working fine but a limited set of data will have a problem. In these cases we can locate the problematic data and use a backup copy to restore any damaged original files or use versioning to restore an earlier version of the file. Derived files can either be regenerated or copied from backups as well. This is the most common expected use case at requires only that we keep versions of our original files and backups of files in different locations (another S3 bucket and an on-site copy).

...

Finally in cases of issues affecting all on-line storage systems, another copy of the data is held on our in-house storage system so that we can potentially recover data even in case of a full loss of AWS. This only holds the original files and all other aspects will need to be rebuild, a process that can take days in addition to the time taken to upload the files again. Using the local backup means recovery could take a week or more. This is not paid for/managed by our team as Institute IT handles these systems.

Technical Notes

As of March 2019, the Digital Collections currently (March 2019) has a small set of data to be handled for recovery.

...