Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Here is the current backup strategy as a diagram:

Image Removed

Recovery Options

Fedora:

S3:

Currently we are using the s3 sync tool (akin to rsync for S3) to pull over key fedora data into the chf-hydra-backup (https://s3.console.aws.amazon.com/s3/buckets/chf-hydra-backup) bucket. This is a slight misnomer as it handles backups for ArchivesSpace as well now, but Fedora data is pulled over into s3 key FedoraBackup (https://s3.console.aws.amazon.com/s3/buckets/chf-hydra-backup/FedoraBackup/?region=us-east-1&tab=overview) which contains all Fedora binary data.

PGSql (https://s3.console.aws.amazon.com/s3/buckets/chf-hydra-backup/PGSql/ ) contains the Fedora Postgres database fcrepo_backup)

Both https://s3.console.aws.amazon.com/s3/buckets/chf-hydra-backup/FedoraBackup/ and https://s3.console.aws.amazon.com/s3/object/chf-hydra-backup/PGSql/fcrepo_backup.sql are needed to do a full restore.

Note: As a reminder, while S3's visual interface uses folders, those locations are actually just the first step in a path of individual block stored objects. Folders do not exist in S3.

How to restore Fedora:

  1. Stop Tomcat
  2. Download the Postgres database fcrepo_backup.sql to an arbitrary location on the Fedora machine.
  3. Fedora might still have active connections to postgres, so run a postgres restart to kill them: sudo service postgresql restart .

  4. Import the database: psql fcrepo < fcrepo_backup.sql
    1. If the database already exists, such as when you are running a sync, you will want to drop the existing database and then run the command.
  5. Check that the user trilby has permissions to access and use the newly made fcrepo database.
  6. Delete the existing folder(s) inside /opt/fedora-data (This step is not always required but makes it simpler)
  7. Using screen or tmux start an aws s3 sync to copy all the data over in the FedoraBackup "folder" to /opt/fedora-data : aws s3 sync s3://chf-hydra-backup/FedoraBackup /opt/fedora-data/
  8. Wait a while for all the data (>800 GB) to copy over.
  9. Run chown -R tomcat8:tomcat8 /opt/fedora-data to give ownership on the new files to the tomcat user so Fedora can access them.
  10. Restart Tomcat: sudo service tomcat8 restart OR sudo systemctl tomcat8 restart
  11. This will restore the Fedora database.  Current cost estimates (2/18) are about $.10 to do this restore.

How to restore users:

  1. Go to S3 and download the postgres backup files to an arbitrary location on the app server.
  2. Stop Apache
  3. Restart the postgres service (see above). This should remove the default connection to the Sufia database that Sufia has when it's running, so you can change it.
  4. In Postgres, delete the automatically generated chf_hydra database as follows:
    1. Log in via psql -u postgres
      1. The postgres account password is in ansible-vault (groupvars/all)
    2. Run: DROP DATABASE chf_hydra;
    3. Run: CREATE DATABASE chf_hydra;
  5. Then import the downloaded database
    1. Either:
      1. pg_restore -d chf_hydra -U postgres chf_hydra.dump
      2. psql chf_hydra < chf_hydra_dump.sql
  6. Then set permissions
    1. psql -U postgres
    2. GRANT Create,Connect,Temporary ON DATABASE chf_hydra TO chf_pg_hydra;
  7. You may now restart postgres and Apache2. systemctl restart apache2

Note: the minter is now part of postgres: no need to take any extra steps. just restore the chf_hydra database for the users to app and the minter will be restored.

How to restore redis:

Redis keeps a database in memory which handles the transaction record data such as the history of edits on a record. It does not contain the actual data, simply the timeline of changes. Losing this causes the history of object edits to be lost, but the objects themselves will be fine.

  1. Download redis-dump/dump.rdb to an arbitrary location on the app server.
  2. It must be changed to be owned by the redis user as follows:
    1. sudo chmod -R redis:redis filename
  3. Stop the redis server as follows:
    1. sudo service redis-server stop
  4. Move redis-dump.rdb to /var/lib/redis/dump.rdb . This will overwrite the existing file there called dump.rdb
  5. Restart redis
    1. sudo service redis-server start
  6. When starting, redis will read the .rdb dump file and copy that data back into the in-memory database.

Indexing:

The index is being backed up to speed up the time to recovery for DR or migrations. If you cannot access it, a manual reindex can be done with the instructions in Application administration. This process takes at least one business day, so is not recommended versus rebuilding from the backup.

  1. In the chf-hydra-backup, pull down the solr-backup.tar.gz file under Solr to the Solr server.
  2. Extract the archive
  3. Use the solr restore commands at Application administration 

Costs

A quick cost analysis has restoration costing $30-35 dollars, this is as of 6/11/2018 with approximately 1 TB of data. Approximately 66% of the cost was due to inter-region transfer fees (moving data from US-WEST to US-EAST). The rest is standard LIST, GET, and related fees.

Scihist_digicoll Backup and Recovery

Image Removed

Digital Collection Recovery Overview:

Our new digital collection application (scihist_digcol) has slightly different preservation needs than the current Samvera application. However, our institutional goals remain the same. While we will be making technical changes to preservation methods, our backup and recovery goals are still linked to the following key points in order of importance:

  1. Provide backups of all original files to they can be recovered in case of natural disaster, data corruption, user error, or malicious computer activity.

  2. Allow the Institute to recover from data loss or outages in a reasonable rate of time.

  3. Adhere to OAIS model rules when possible.

The following classifications of recovery may prove helpful

  • Partial public, partial staff recovery: The public has access to limited functionality, staff has access to a limited set of functions but certain functions may not be used. This is considered an incomplete recovery.

  • Full public, partial staff recovery: The public can use the site normally, staff has access to a limited set of functions but certain functions may not be used. This is also an incomplete recovery but time sensitivity is reduced as public users are not impacted. Staff recovery times should be minimized, but public use takes priority.

  • Full public, full staff recovery: A total recovery.

Then there are levels of disaster/data loss

  • Minimal Data Loss: In this case a small subset of data is lost, general site functionality is unaffected.

  • Minimal Data Inaccessibility: A small subset of data is temporarily inaccessible, general site functionality is unaffected.

  • Major Data Inaccessibility: Data is not lost permanently, but our ability to access the data is compromised. In this case general site functionality is affected.

  • Major Data Loss: A large amount of data is lost, and general site functionality is impacted.

Generally speaking, public recovery is the higher initial priority since it impacts more people, contacting staff about outages is easy, and public outages affect the perception of the institute.

Inaccessibility is also different from data loss, though they share certain characteristics. In both cases a solution is to have additional copies of data, but for inaccessibility it is so they can be used as a temporary source of data until the outage is resolved. Handling data inaccessibility requires that the secondary source of data be similarly structured to minimize the time to switch over. With data loss the backups can be in any format as the intent is that a new source of data will be built from the backups.

...

In cases of broad data loss most or all of the data is rendered unavailable. In these cases we will suffer a loss of service until we can recover the data. This can be thought of as two recoveries: one is to get the digital collection site back as soon as possible for the public and the second is to restore all functionality. Getting the site back for the public is our primary concern, so as noted above for outages we have a few methods to speed up recovery at an additional cost to our backup costs. Both the derivative and original files are backed up to a region on the West Coast in S3 with the same configuration details that our files in US-EAST (the original originals and original derivatives) have. We can recover public access by using these backup files directly while we spend more time working on a full recovery for staff functionality. Due to current setups we would not want staff to add new works, but this allows us to rapidly restore public facing access to our site should the normal data sources be unavailable. A longer process allows us to restore the data back to the original locations while leaving public access up, once the data is restored to its original place full staff functionality will likewise be restored.

Finally in cases of issues affecting all on-line storage systems, another copy of the data is held on our in-house storage system so that we can potentially recover data even in case of a full loss of AWS. This only holds the original files and all other aspects will need to be rebuild, a process that can take days in addition to the time taken to upload the files again. Using the local backup means recovery could take a week or more. This is not paid for/managed by our team as Institute IT handles these systems.

Technical Notes

Kithe currently (March 2019) has a small set of data to be handled for recovery.

  • A postgres database which contains user data and item metadata

  • Original binary files

  • Derivative Files

The first two are the ones that are needed. If they were lost, the derivatives can be easily recreated based on the first two, but it would take over a day. Thus, we keep a backup anyway because it's cheap and minimizes downtime.

  • The postgres database is to be backed up to S3, with a version history of the last 30 versions of the file representing roughly a month of backups.
  • The binary files are replicated via S3 replication to a second location in US-WEST rather than US-EAST in case of outages. (During development, as of summer 2019, these binaries are stored at https://s3.console.aws.amazon.com/s3/buckets/scihi-kithe-stage-originals .) When, as part of launching the site, we actually switch over to production, these will also be backed up over to local on-site storage. The binary files are also versioned and prior versions are held for 30 days before being cleared away to reduce storage costs. This offers a month period to revert a file back if something is damaged.
  • The derivative files will also be replicated via S3 replication to a US-WEST location. They can also be regenerated by the application though this takes days to do if all the files are lost. Replication requires versioning, so this is enabled but unlikely to be used.

Minimal public recovery requires the following data:

  • Postgres database

  • Original binaries for downloading tiffs

  • Derivative files

If we have an AWS outage affecting our region, the fastest recovery options are to (if needed) rebuild the servers in another region and edit the local_env.yml file in either ansible or by hand to point to the backup S3 buckets for the original and derivative files. The postgres database will need to be downloaded from S3 and installed onto the new servers if we needed to use new servers. After that point all current public facing aspects will be restored. Since the backup buckets do not sync with the original data buckets, staff should not upload new files though they can edit metadata on existing works. (Once the original S3 buckets have had service restored or their data copied back, set the application to use the original buckets with local_env.yml and users can now add items. The postgres database may need to be copied back to the original server(s) or region if a new server setup was used.

In the case of smaller issues, like single file corruption or deletion, the simplest method for original files is to locate them on S3 and look at previous versions. We keep 30 days worth of versions so if the error was found within a month you should be able to revert back to an earlier file. For derivatives it is easier to simply regenerate them via the command line.

Recovery Notes

Single File Recovery

If a single file (or small number of files) is missing or corrupted, recovery should be handled at the individual level. For derivative images, simply regenerate them. For originals use the following recovery procedures
Steps:

  1. Go to the Work with the damaged files
  2. Select the Members tab and click on the file name
  3. If the Fixity Check shows an error it should have a link to the file in S3, if not you will need to get the UUID. This can be found via the rails console or by stripping it out from the Download Original or Derivatives links.
  4. Log onto AWS.
  5. If the Fixity check shows an error, simply click on the link. In the S3 Web console select the Versions Show option. If the fixity check does not show an error or the link does not work:
    1. Log into S3 and go to the bucket scihist-digicoll-production-originals. In the S3 web console, select the Show button for versions.
    2. Search for the UUID in the prefix
    3. Select the UUID "Folder"
  6. If the file has changed, been deleted, or corrupted within the last 30 days you should see prior versions. If the fixity check has a date and time for when the file changed, you can simply select all the newer versions by clicking the check box next to them and then use the Actions button to Delete them. The old version will become the current version. Run a fixity check to confirm the fix. If it was deleted, you may see a "Delete Marker." Simply delete it like a file and the old file become the current version.
  7. If the file is missing because it has been in error for more then 30 days or something has gone wrong, you will need to use the backup bucket. This is called scihist-digicoll-production-originals-backup and is in the US-WEST region.
  8. Confirm the damaged file is in scihist-digicoll-production-originals-backup, you may either use the S3 web console and going to the bucket then searching for the UUID to confirm the file in in that "folder" or you may use the AWS CLI or some other tool to make a head request.
  9. If the file is there, you may sync it to the the scihist-digicoll-production-originals bucket. Make sure that scihist_digicoll thinks there is a file there already. Syncing a file that is not in the postgres database will not add it to the application. You may use any preferred sync method, here is an example via the AWS CLI SDK. 

    Code Block
    languagebash
    firstline1
    titleSync
    aws s3 sync s3://scihist-digicoll-production-originals-backup/asset/UUIDHERE  s3://scihist-digicoll-production-originals/asset/UUIDHERE --source-region us-west-2 --region us-east-1

    This sample will move the file from backups to the originals production bucket, inside the asset key (part of our application) and to the UUID location. If the UUID key is missing this will make the needed key. If you're unsure of a command, the --dryrun option allows for a safe test.

  10. Run the fixity check to confirm the file is fixed.

S3 Outage or temporary use of backups

  1. If US-EAST-1 S3 is down, or we have some issue where our normal S3 buckets are missing or empty we can temporarily use the backup buckets as a source for files.
    1. Before this is done, staff users cannot add any new works or files. New works added will be missing when we stop using the backups. Editing metadata is fine.
    2. While we use the backup buckets, we will be charged for inter-region data transfer. This can quickly add up, so the duration of this switch should be kept to a minimum
  2. Go to the ansible codebase (ansible-inventory).
  3. In the kithe/templates/local_env.yml.j2 file edit s3_bucket originals, s3_bucket_derivatives, and s3_bucket_dzi to add -backup to the end of their name (ex. scihist-digicoll-production-originals-backup) and then commit the change and merge to master so that the edit is pushed out to all the servers. Staging, lacking these backup buckets will be missing all files.
  4. Email all users about the need to avoid adding new works or files.
  5. If the entire US-EAST region is down, the on_demand_derivates and upload buckets may also be unable to be used. This would prevent uploads and pdf/zip generation. If this is expected to last a long time, you may wish to create those buckets in US-WEST-2 and point the local_env file to these buckets.

Full Recovery from S3

...

  1. Only the originals files are required, but it may be faster to copy derivative files rather than regenerate them. (This requires testing)

Currently you can, in ansible, run the playbook restore_kithe.yml which will automatically handle these steps.

Code Block
titleAnsible Playbook
ansible-playbook --ask-vault-pass restore_kithe.yml --private-key=~/.ssh/chf_prod.pem

...

  1. passenger stop

...

  1. sudo systemctl restart postgresql.service

...

  1. dropdb digcol

...

Import the backup database with

  1. psql -U postgres < BACKUP_LOCATION

You will then need to reindex solr, which can be done remotely and will run on jobs

Code Block
languageruby
titleRake tasks
bundle exec cap production invoke:rake TASK="scihist:solr:reindex scihist:solr:delete_orphans"

You'll need to move over any original files that are missing with a S3 sync command

Code Block
languagebash
firstline1
titleSync
aws s3 sync s3://scihist-digicoll-production-originals-backup/  s3://scihist-digicoll-production-originals/ --source-region us-west-2 --region us-east-1

Then either do the same for the derivative files or regenerate them with a rake task.

...

If you run the rake task, ssh into the jobs server and move to the current deployed application directory (/opt/scihist_digicoll/current).

Switch to the application's user (digcol) and then run the commands. Since they take a long time to run, it is best left in a screen or tmux session.

Code Block
languageruby
firstline1
titleDerivative creation
./bin/rake kithe:create_derivatives:lazy_defaults
./bin/rake scihist:lazy_create_dzi

Full Recovery from Local Backups

This has not been tested. Instructions here are general guides but do not have the step by step quality of our other instructions. This is to be run when we have lost all access to data in S3. Note this will likely take a long time so should expectations should be set with the project owner and library on how to notify people of the outage.

...

  • This may take multiple days to run

...

These notes are historical, and represent our prepatory thinking before creating backup/redundancy copies for digital collections. They are left here to understand what our motivation were, and the complexities of determining appropriate backup/redundancy strategy. The details may not represent present architecture. See also Backups of the Digital Collections and Digital CollecS3 Bucket Setup and Architecture

Scihist_digicoll Backup and Recovery

Image Added


Our Analysis and Motivations: How we thought through what backups to make, and why

The 2019 version of our digital collection application (scihist_digcoll) has slightly different preservation needs than the old Samvera application it was based on. However, our institutional goals remain the same. While we will be making technical changes to preservation methods, our backup and recovery goals are still linked to the following key points in order of importance:

  1. TThese Provide backups of all original ingested files to they can be recovered in case of natural disaster, data corruption, user error, or malicious computer activity.

  2. Allow the Institute to recover from data loss or outages in a reasonable rate of time.

  3. Adhere to OAIS model rules when possible.

The following classifications of recovery may prove helpful:

  • Partial public, partial staff recovery: The public has access to limited functionality, staff has access to a limited set of functions but certain functions may not be used. This is considered an incomplete recovery.

  • Full public, partial staff recovery: The public can use the site normally, staff has access to a limited set of functions but certain functions may not be used. This is also an incomplete recovery, but time sensitivity is reduced as public users are not affected. Staff recovery times should be minimized, but public use takes priority.

  • Full public, full staff recovery: A total recovery.

Then there are levels of disaster/data loss

  • Minimal Data Inaccessibility: A small subset of data is temporarily inaccessible; general site functionality is unaffected.

  • Minimal Data Loss: In this case a small subset of data is lost, general site functionality is unaffected.
  • Major Data Inaccessibility: Data is not lost permanently, but our ability to access the data is compromised. In this case general site functionality is affected.

  • Major Data Loss: A large amount of data is lost, and general site functionality is affected.

Generally speaking, public recovery is the higher initial priority since it impacts more people, contacting staff about outages is easy, and public outages affect the perception of the institute.

Inaccessibility is also different from data loss, though they share certain characteristics. In both cases a solution is to have additional copies of data, but for inaccessibility it is so they can be used as a temporary source of data until the outage is resolved. Handling data inaccessibility requires that the secondary source of data be similarly structured to minimize the time to switch over. With data loss the backups can be in any format as the intent is that a new source of data will be built from the backups.

Our data can be broken into two categories, one is data that is potentially irrecoverable. This includes our original binary files (images, audio, other) and the metadata about them (a SQL database). The other data is restorable but needed for normal site operation but takes significant time to restore, such as the derived download files and viewer files. The second set of data may be worth backing up to shorten recovery times for public users when data is lost.

As an estimate, our cost to hold extra backups for our current scihist_digicoll staging environment costs less than $2 a month out of a total $70.07 spent on data storage inclusive of data transfer and storage. While a production environment will have slightly higher cost ratios, it should not be massively higher. Thus by spending an additional 2-3% cost on S3, we can mount a full public recovery in an afternoon from a massive failure of our entire infrastructure. While we currently are not backing up our viewer tiles, an examination of our old application shows the cost for production averages around 5 dollars for storage. Adding a second copy of the viewer files should roughly double the cost, with a slight reduction for less use, so will add another 5 dollars to the cost, so for about $7-12 dollars a month we can be widely covered for data inaccessibility or other failures of S3 in a specific region. While it is hard to get specific details, there have been multiple outages or issues in a region whose duration lasted over an hour, and at least one major outage in the last two years lasting around 6 hours. Assuming about 8 hours of problems every two years, we can estimate that a rough cost of $36/hour of outage spent to avoid being down. Shorter outages may not be worth the difficulty of switching over.

In cases of small scale data loss, such as corrupted files or user error, the application will be working fine but a limited set of data will have a problem. In these cases we can locate the problematic data and use a backup copy to restore any damaged original files or use versioning to restore an earlier version of the file. Derived files can either be regenerated or copied from backups as well. This is the most common expected use case at requires only that we keep versions of our original files and backups of files in different locations (another S3 bucket and an on-site copy).

In cases of broad data loss most or all of the data is rendered unavailable. In these cases we will suffer a loss of service until we can recover the data. This can be thought of as two recoveries: one is to get the digital collection site back as soon as possible for the public and the second is to restore all functionality. Getting the site back for the public is our primary concern, so as noted above for outages we have a few methods to speed up recovery at an additional cost to our backup costs. Both the derivative and original files are backed up to a region on the West Coast in S3 with the same configuration details that our files in US-EAST (the original originals and original derivatives) have. We can recover public access by using these backup files directly while we spend more time working on a full recovery for staff functionality. Due to current setups we would not want staff to add new works, but this allows us to rapidly restore public facing access to our site should the normal data sources be unavailable. A longer process allows us to restore the data back to the original locations while leaving public access up, once the data is restored to its original place full staff functionality will likewise be restored.

Finally in cases of issues affecting all on-line storage systems, another copy of the data is held on our in-house storage system so that we can potentially recover data even in case of a full loss of AWS. This only holds the original files and all other aspects will need to be rebuild, a process that can take days in addition to the time taken to upload the files again. Using the local backup means recovery could take a week or more. This is not paid for/managed by our team as Institute IT handles these systems.

Technical Notes

As of March 2019, the Digital Collections has a small set of data to be handled for recovery.

  • A postgres database which contains user data and item metadata

  • Original binary files

  • Derivative Files

The first two are the ones that are needed. If they were lost, the derivatives can be easily recreated based on the first two, but it would take over a day. Thus, we keep a backup anyway because it's cheap and minimizes downtime.

For each of these, we have TWO levels of backups: 1. in an S3 bucket, 2. on local on-premises Institute storage, which is also backed up to tape. 

What Backups Exist

  • postgres database
    • is backed up by a cronjob scihist-digicoll-backup.sh running on the database server, to S3 disk at: s3://chf-hydra-backup/PGSql/digcol_backup.sql
      • This S3 bucket is "versioned" and keeps 30 days worth of past versions. 
    • cronjob running on "dubnium" backup server then copies that nightly from that S3 location to local network storage mount. So the network storage mount will nightly ensure latest copy is at standard location, to be backed up to tape.  
  • Original ingested assets (binary files)
    • use S3 replication rules to replicate to a bucket scihist-digicoll-production-originals-backup
      • This is intentionally in a different AWS region, US West (Oregon)
      • The replication rules do not replicate deletes, so deleted files should still exist in the backup bucket
      • Both the live production bucket and this backup bucket are versioned, and keep 30 days worth of past versions. We do not keep complete version history, the backups are intended mostly for handling corruption and disaster, not recovering from user or software error. Although 30 days of version history allows some limited recovery from software or user error. 
      • a cronjob scihist-digicoll-backup.sh running on "dubnium" backup server then uses an "rsync" like functionality to sync to a on-premises network mount, which is backed up to tape. 
      • In a disaster recovery scenario, the live app could temporarily be pointed to use bucket scihist-digicoll-production-originals-backup as it's source of files. However, cross-region data transfer will be more expensive than usual same-region.  You may want to turn off ingest to prevent new data from being written to the backup bucket. 
  • Derivatives
    • Both the "ordinary" derivatives bucket and the "dzi" bucket (used for "deep zoom" viewer) use S3 replication rules to replicate to backup buckets in US West (Oregon). 
    • In a disaster recovery scenario, the live app could be switched over to use these buckets as a source of data, but cross-region data transfer will be more expensive than same-region. You may want to turn off ingest to prevent new data from being written to the backup bucket. 
    • All of these buckets also keep 30 days of version history. 
    • scihist-digicoll-production-dzi => scihist-digicoll-production-dzi-backup
    • scihist-digicol-production-originals => scihist-digicol-production-originals-backup

Recovery Overview

Minimal public recovery requires the following data:

  • Postgres database

  • Original binaries for downloading tiffs

  • Derivative files

If we have an AWS outage affecting our region, the fastest recovery options are to (if needed) rebuild the servers in another region and edit the local_env.yml file in either ansible or by hand to point to the backup S3 buckets for the original and derivative files. The postgres database will need to be downloaded from S3 and installed onto the new servers if we needed to use new servers. After that point all current public facing aspects will be restored. Since the backup buckets do not sync with the original data buckets, staff should not upload new files though they can edit metadata on existing works. (Once the original S3 buckets have had service restored or their data copied back, set the application to use the original buckets with local_env.yml and users can now add items. The postgres database may need to be copied back to the original server(s) or region if a new server setup was used.

In the case of smaller issues, like single file corruption or deletion, the simplest method for original files is to locate them on S3 and look at previous versions. We keep 30 days worth of versions so if the error was found within a month you should be able to revert back to an earlier file. For derivatives it is easier to simply regenerate them via the command line.

Recovery Notes

Single File Recovery

If a single file (or small number of files) is missing or corrupted, recovery should be handled at the individual level. For derivative images, simply regenerate them. For originals use the following recovery procedures
Steps:

  1. Go to the Work with the damaged files
  2. Select the Members tab and click on the file name
  3. If the Fixity Check shows an error it should have a link to the file in S3, if not you will need to get the UUID. This can be found via the rails console or by stripping it out from the Download Original or Derivatives links.
  4. Log onto AWS.
  5. If the Fixity check shows an error, simply click on the link. In the S3 Web console select the Versions Show option. If the fixity check does not show an error or the link does not work:
    1. Log into S3 and go to the bucket scihist-digicoll-production-originals. In the S3 web console, select the Show button for versions.
    2. Search for the UUID in the prefix
    3. Select the UUID "Folder"
  6. If the file has changed, been deleted, or corrupted within the last 30 days you should see prior versions. If the fixity check has a date and time for when the file changed, you can simply select all the newer versions by clicking the check box next to them and then use the Actions button to Delete them. The old version will become the current version. Run a fixity check to confirm the fix. If it was deleted, you may see a "Delete Marker." Simply delete it like a file and the old file become the current version.
  7. If the file version you need is not available from version history, it may be available on the backup bucket. This is called scihist-digicoll-production-originals-backup and is in the US-WEST region. Deleted files should remain in the backup bucket forever. But changes are replicated to backup bucket, so a version prior to a change made more than 30 days ago will not generally be available on backup bucket either. 
  8. Confirm the damaged file is in scihist-digicoll-production-originals-backup, you may either use the S3 web console and going to the bucket then searching for the UUID to confirm the file in in that "folder" or you may use the AWS CLI or some other tool to make a head request.
  9. If the file is there, you may sync it to the the scihist-digicoll-production-originals bucket. Make sure that scihist_digicoll thinks there is a file there already. Syncing a file that is not in the postgres database will not add it to the application. You may use any preferred sync method, here is an example via the AWS CLI SDK. 

    Code Block
    languagebash
    firstline1
    titleSync
    aws s3 sync s3://scihist-digicoll-production-originals-backup/asset/UUIDHERE  s3://scihist-digicoll-production-originals/asset/UUIDHERE --source-region us-west-2 --region us-east-1

    This sample will move the file from backups to the originals production bucket, inside the asset key (part of our application) and to the UUID location. If the UUID key is missing this will make the needed key. If you're unsure of a command, the --dryrun option allows for a safe test.

  10. Run the fixity check to confirm the file is fixed.

S3 Outage or temporary use of backups

  1. If US-EAST-1 S3 is down, or we have some issue where our normal S3 buckets are missing or empty we can temporarily use the backup buckets as a source for files.
    1. Before this is done, staff users cannot add any new works or files. New works added will be missing when we stop using the backups. Editing metadata is fine.
    2. While we use the backup buckets, we will be charged for inter-region data transfer. This can quickly add up, so the duration of this switch should be kept to a minimum
  2. Go to the ansible codebase (ansible-inventory).
  3. In the roles/kithe/templates/local_env.yml.j2 file
    1. edit s3_bucket_originals, s3_bucket_derivatives, and s3_bucket_dzi to add -backup to the end of their name (ex. scihist-digicoll-production-originals-backup)
    2. edit the aws_region to us-west-2
      1. this might break access to other buckets that we don't have an alternate of in us-west-2, oh well this process is not completely polished. 
    3. (optional but probably) edit or add logins_disabled: true to lock staff users out to not be making changes. 
    4. and then commit the change and merge to master so that the edit is pushed out to all the servers – you may have to wait 10-15 minutes for the changes to be deployed. 
    5. NOTE: Staging, lacking buckets called eg "scihist-digicoll-staging-originals-backup" will be be broken if deployed with these. 
  4. Email all users about the need to avoid adding new works or files.
  5. If the entire US-EAST region is down, the on_demand_derivates and upload buckets may also be unable to be used. This would prevent uploads and pdf/zip generation. If this is expected to last a long time, you may wish to create those buckets in US-WEST-2 and point the local_env file to these buckets.

Full Recovery from S3

  1. A full recovery from S3 rolls the entire system back to the latest backup. This is usually the prior business day. All changes between this period will be lost.
  2. As noted in the prior documentation, this involves a postgres database change, reindex of solr, and then moving files from backup to production.
    1. Only the originals files are required, but it may be faster to copy derivative files rather than regenerate them. (This requires testing)
  3. Currently you can, in ansible, run the playbook restore_kithe.yml which will automatically handle these steps.

    Code Block
    titleAnsible Playbook
    ansible-playbook --ask-vault-pass restore_kithe.yml --private-key=~/.ssh/chf_prod.pem
    


  4. Do not run this unless you are prepared to lose all changes made to the system during the past 24 hours.
  5. If you cannot or do not want to use the playbook or it does not work, you may manually undertake the following steps. (logged in as ubuntu)
    1. Stop passenger on the web server, this should end connections to the postgres database. It also allows you to avoid honeybadger errors when the database is lost.
      1. sudo systemctl stop passenger
    2. On the database server, restart the postgres service. This will terminate any hanging connections.
      1. sudo systemctl restart postgresql.service
    3. Download to the database server the last postgres backup, found at the s3://chf-hydra-backup bucket under PGSql key as digcol_backup.sql
    4. On the database server drop the existing postgres digcol database.
      1. dropdb -U postgres digcol
        -or-
      2. psql -U postgres
        1. DROP DATABASE digcol;
    5. Import the backup database with

      1. psql -U postgres < BACKUP_LOCATION
    6. You will then need to reindex solr, which can be done remotely and will run on jobs

      Code Block
      languageruby
      titleRake tasks
      bundle exec cap production invoke:rake TASK="scihist:solr:reindex scihist:solr:delete_orphans"


    7. You'll need to move over any original files that are missing with a S3 sync command

      1. (using AWS credentials that have necessary access; developer accounts such as jrochkind and eddie should). 

        Code Block
        languagebash
        firstline1
        titleSync
        aws s3 sync s3://scihist-digicoll-production-originals-backup/  s3://scihist-digicoll-production-originals/ --source-region us-west-2 --region us-east-1


    8. Then you need to restore/sync derivatives too. You can do the same aws s3 sync for the derivative files/bucket. 

      1. Or you can run lazy creation scripts, but they may miss some derivatives recorded in the postgres DB but missing on S3. (Not sure if this is an issue for DZI)

      2. If you run the rake task, ssh into the jobs server and move to the current deployed application directory (/opt/scihist_digicoll/current).

      3. Switch to the application's user (digcol) and then run the commands. Since they take a long time to run, it is best left in a screen or tmux session.

        Code Block
        languageruby
        firstline1
        titleDerivative creation
        ./bin/rake kithe:create_derivatives:lazy_defaults
        ./bin/rake scihist:lazy_create_dzi


    9. Start passenger on the web server again to bring the app back up!
      1. sudo systemctl start passenger

Full Recovery from Local Backups

This has not been tested. Instructions here are general guides but do not have the step by step quality of our other instructions. This is to be run when we have lost all access to data in S3. Note this will likely take a long time so should expectations should be set with the project owner and library on how to notify people of the outage.

Access to the local backups must be done in the building or via Citrix (https://sciencehistory.org/citrix) using the PuTTY app

The local backup server (Dubnium) can be found at 10.20.30.86

The scripts for backup are not in ansible, so they are copied here for S3 backup and here for scihist-digicoll backup for emergency needs.

Right now the account holders are Daniel, Jonathan, and Eddie, each of whom uses a password to access the server.

  • The original files and postgres database exist on-site on the shared drive, and also offline in tape backups made from that shared drive. Ideally we do not need the tape backups. If they are needed work with IT (Chuck and Ponce) to follow their procedures for recovering and loading the tapes.
  • An aws s3 sync will need to be run targeting the local backup directory aws s3 sync /media/scihist_digicoll_backups/asset s3://scihist-digicoll-production-originals/asset --region us-east-1
    • This may take multiple days to run
  • While that runs, copy the postgres database from it's location at /media/scihist_digicoll_backups to the database server.
  • Follow step 5e in the Full Recovery from S3 section to load the database.
  • Follow step 5f to reindex Solr
  • Once all of the original files are moved you can regenerate the derivatives with the rake task. Derivative files are not backed up locally so syncing is not an option.


Crontab notes

The crontab looks like

AWS_CONFIG_FILE="/home/dsanford/.aws/config"

# m h dom mon dow command

30 18 * * * /home/dsanford/bin/s3-backup.sh

00 18 * * * /home/dsanford/bin/scihist-digicoll-backup.sh


The AWS_CONFIG.... section needs any rotated keys, and is vital for passing along the config info and permissions to the scripts. If it is missing, they will not work.