Our new digital collection application (scihist_digcol) has slightly different preservation needs than the current Samvera application. However, our institutional goals remain the same. While we will be making technical changes to preservation methods, our backup and recovery goals are still linked to the following key points in order of importance:
Provide backups of all original files to they can be recovered in case of natural disaster, data corruption, user error, or malicious computer activity.
Allow the Institute to recover from data loss or outages in a reasonable rate of time.
Adhere to OAIS model rules when possible.
The following classifications of recovery may prove helpful
Partial public, partial staff recovery: The public has access to limited functionality, staff has access to a limited set of functions but certain functions may not be used. This is considered an incomplete recovery.
Full public, partial staff recovery: The public can use the site normally, staff has access to a limited set of functions but certain functions may not be used. This is also an incomplete recovery but time sensitivity is reduced as public users are not impacted. Staff recovery times should be minimized, but public use takes priority.
Full public, full staff recovery: A total recovery.
Then there are levels of disaster/data loss
Minimal Data Loss: In this case a small subset of data is lost, general site functionality is unaffected.
Minimal Data Inaccessibility: A small subset of data is temporarily inaccessible, general site functionality is unaffected.
Major Data Inaccessibility: Data is not lost permanently, but our ability to access the data is compromised. In this case general site functionality is affected.
Major Data Loss: A large amount of data is lost, and general site functionality is impacted.
Generally speaking, public recovery is the higher initial priority since it impacts more people, contacting staff about outages is easy, and public outages affect the perception of the institute.
Inaccessibility is also different from data loss, though they share certain characteristics. In both cases a solution is to have additional copies of data, but for inaccessibility it is so they can be used as a temporary source of data until the outage is resolved. Handling data inaccessibility requires that the secondary source of data be similarly structured to minimize the time to switch over. With data loss the backups can be in any format as the intent is that a new source of data will be built from the backups.
Our data can be broken into two categories, one is data that is potentially irrecoverable. This includes our original binary files (images, audio, other) and the metadata about them (a SQL database). The other data is restorable but needed for normal site operation but takes significant time to restore, such as the derived download files and viewer files. The second set of data may be worth backing up to shorten recovery times for public users when data is lost.
As an estimate, our cost to hold extra backups for our current scihistcoll staging environment costs less than $2 a month out of a total $70.07 spent on data storage inclusive of data transfer and storage. While a production environment will have slightly higher cost ratios, it should not be massively higher. Thus by spending an additional 2-3% cost on S3, we can mount a full public recovery in an afternoon from a massive failure of our entire infrastructure. While we currently are not backing up our viewer tiles, an examination of our old application shows the cost for production averages around 5 dollars for storage. Adding a second copy of the viewer files should roughly double the cost, with a slight reduction for less use, so will add another 5 dollars to the cost, so for about $7-12 dollars a month we can be widely covered for data inaccessibility or other failures of S3 in a specific region. While it is hard to get specific details, there have been multiple outages or issues in a region whose duration lasted over an hour, and at least one major outage in the last two years lasting around 6 hours. Assuming about 8 hours of problems every two years, we can estimate that a rough cost of $36/hour of outage spent to avoid being down. Shorter outages may not be worth the difficulty of switching over.
In cases of small scale data loss, such as corrupted files or user error, the application will be working fine but a limited set of data will have a problem. In these cases we can locate the problematic data and use a backup copy to restore any damaged original files or use versioning to restore an earlier version of the file. Derived files can either be regenerated or copied from backups as well. This is the most common expected use case at requires only that we keep versions of our original files and backups of files in different locations (another S3 bucket and an on-site copy).
In cases of broad data loss most or all of the data is rendered unavailable. In these cases we will suffer a loss of service until we can recover the data. This can be thought of as two recoveries: one is to get the digital collection site back as soon as possible for the public and the second is to restore all functionality. Getting the site back for the public is our primary concern, so as noted above for outages we have a few methods to speed up recovery at an additional cost to our backup costs. Both the derivative and original files are backed up to a region on the West Coast in S3 with the same configuration details that our files in US-EAST (the original originals and original derivatives) have. We can recover public access by using these backup files directly while we spend more time working on a full recovery for staff functionality. Due to current setups we would not want staff to add new works, but this allows us to rapidly restore public facing access to our site should the normal data sources be unavailable. A longer process allows us to restore the data back to the original locations while leaving public access up, once the data is restored to its original place full staff functionality will likewise be restored.
Finally in cases of issues affecting all on-line storage systems, another copy of the data is held on our in-house storage system so that we can potentially recover data even in case of a full loss of AWS. This only holds the original files and all other aspects will need to be rebuild, a process that can take days in addition to the time taken to upload the files again. Using the local backup means recovery could take a week or more. This is not paid for/managed by our team as Institute IT handles these systems.
Digital Collections currently (March 2019) has a small set of data to be handled for recovery.
A postgres database which contains user data and item metadata
Original binary files
Derivative Files
The first two are the ones that are needed. If they were lost, the derivatives can be easily recreated based on the first two, but it would take over a day. Thus, we keep a backup anyway because it's cheap and minimizes downtime.
For each of these, we have TWO levels of backups: 1. in an S3 bucket, 2. on local on-premises Institute storage, which is also backed up to tape.
scihist-digicoll-backup.sh
running on the database server, to S3 disk at: s3://chf-hydra-backup/PGSql
/digcol_backup.sqlscihist-digicoll-production-originals-backup
US West (Oregon)
scihist-digicoll-backup.sh
running on "dubnium" backup server then uses an "rsync" like functionality to sync to a on-premises network mount, which is backed up to tape. scihist-digicoll-production-originals-backup
as it's source of files. However, cross-region data transfer will be more expensive than usual same-region. You may want to turn off ingest to prevent new data from being written to the backup bucket. Minimal public recovery requires the following data:
Postgres database
Original binaries for downloading tiffs
Derivative files
If we have an AWS outage affecting our region, the fastest recovery options are to (if needed) rebuild the servers in another region and edit the local_env.yml file in either ansible or by hand to point to the backup S3 buckets for the original and derivative files. The postgres database will need to be downloaded from S3 and installed onto the new servers if we needed to use new servers. After that point all current public facing aspects will be restored. Since the backup buckets do not sync with the original data buckets, staff should not upload new files though they can edit metadata on existing works. (Once the original S3 buckets have had service restored or their data copied back, set the application to use the original buckets with local_env.yml and users can now add items. The postgres database may need to be copied back to the original server(s) or region if a new server setup was used.
In the case of smaller issues, like single file corruption or deletion, the simplest method for original files is to locate them on S3 and look at previous versions. We keep 30 days worth of versions so if the error was found within a month you should be able to revert back to an earlier file. For derivatives it is easier to simply regenerate them via the command line.
If a single file (or small number of files) is missing or corrupted, recovery should be handled at the individual level. For derivative images, simply regenerate them. For originals use the following recovery procedures
Steps:
Show
option. If the fixity check does not show an error or the link does not work:Show
button for versions.If the file is there, you may sync it to the the scihist-digicoll-production-originals bucket. Make sure that scihist_digicoll thinks there is a file there already. Syncing a file that is not in the postgres database will not add it to the application. You may use any preferred sync method, here is an example via the AWS CLI SDK.
aws s3 sync s3://scihist-digicoll-production-originals-backup/asset/UUIDHERE s3://scihist-digicoll-production-originals/asset/UUIDHERE --source-region us-west-2 --region us-east-1 |
This sample will move the file from backups to the originals production bucket, inside the asset key (part of our application) and to the UUID location. If the UUID key is missing this will make the needed key. If you're unsure of a command, the --dryrun option allows for a safe test.
s3_bucket_originals
, s3_bucket_derivatives
, and s3_bucket_dzi
to add -backup to the end of their name (ex. scihist-digicoll-production-originals-backup)aws_region
to us-west-2
logins_disabled: true
to lock staff users out to not be making changes. Currently you can, in ansible, run the playbook restore_kithe.yml
which will automatically handle these steps.
ansible-playbook --ask-vault-pass restore_kithe.yml --private-key=~/.ssh/chf_prod.pem |
ubuntu
)sudo systemctl stop passenger
sudo systemctl restart postgresql.service
dropdb -U postgres digcol
-or-
psql -U postgres
DROP DATABASE digcol;
Import the backup database with
psql -U postgres < BACKUP_LOCATION
You will then need to reindex solr, which can be done remotely and will run on jobs
bundle exec cap production invoke:rake TASK="scihist:solr:reindex scihist:solr:delete_orphans" |
You'll need to move over any original files that are missing with a S3 sync command (using AWS credentials that have necessary access; developer accounts such as jrochkind
and eddie
should).
aws s3 sync s3://scihist-digicoll-production-originals-backup/ s3://scihist-digicoll-production-originals/ --source-region us-west-2 --region us-east-1 |
Then either do the same for the derivative files or regenerate them with a rake task.
If you run the rake task, ssh into the jobs server and move to the current deployed application directory (/opt/scihist_digicoll/current).
Switch to the application's user (digcol) and then run the commands. Since they take a long time to run, it is best left in a screen or tmux session.
./bin/rake kithe:create_derivatives:lazy_defaults ./bin/rake scihist:lazy_create_dzi |
This has not been tested. Instructions here are general guides but do not have the step by step quality of our other instructions. This is to be run when we have lost all access to data in S3. Note this will likely take a long time so should expectations should be set with the project owner and library on how to notify people of the outage.
Access to the local backups must be done in the building or via Citrix (https://sciencehistory.org/citrix) using the PuTTY app
The local backup server (Dubnium) can be found at 65.170.7.86
The scripts for backup are not in ansible, so they are copied here for S3 backup and here for scihist-digicoll backup for emergency needs.
Right now the account holders are Daniel, Jonathan, and Eddie, each of whom uses a password to access the server.
aws s3 sync /media/scihist_digicoll_backups/asset s3://scihist-digicoll-production-originals/asset --region us-east-1
Crontab notes
The crontab looks like
AWS_CONFIG_FILE="/home/dsanford/.aws/config"
# m h dom mon dow command
30 18 * * * /home/dsanford/bin/s3-backup.sh
00 18 * * * /home/dsanford/bin/scihist-digicoll-backup.sh
The AWS_CONFIG.... section needs any rotated keys, and is vital for passing along the config info and permissions to the scripts. If it is missing, they will not work.