Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In the case of smaller issues, like single file corruption or deletion, the simplest method for original files is to locate them on S3 and look at previous versions. We keep 30 days worth of versions so if the error was found within a month you should be able to revert back to an earlier file. For derivatives it is easier to simply regenerate them via the command line.

Recovery Notes

Single File Recovery

If a single file (or small number of files) is missing or corrupted, recovery should be handled at the individual level. For derivative images, simply regenerate them. For originals use the following recovery procedures
Steps:

  1. Go to the Work with the damaged files
  2. Select the Members tab and click on the file name
  3. If the Fixity Check shows an error it should have a link to the file in S3, if not you will need to get the UUID. This can be found via the rails console or by stripping it out from the Download Original or Derivatives links.
  4. Log onto AWS.
  5. If the Fixity check shows an error, simply click on the link. In the S3 Web console select the Versions Show option. If the fixity check does not show an error or the link does not work:
    1. Log into S3 and go to the bucket scihist-digicoll-production-originals. In the S3 web console, select the Show button for versions.
    2. Search for the UUID in the prefix
    3. Select the UUID "Folder"
  6. If the file has changed, been deleted, or corrupted within the last 30 days you should see prior versions. If the fixity check has a date and time for when the file changed, you can simply select all the newer versions by clicking the check box next to them and then use the Actions button to Delete them. The old version will become the current version. Run a fixity check to confirm the fix. If it was deleted, you may see a "Delete Marker." Simply delete it like a file and the old file become the current version.
  7. If the file is missing because it has been in error for more then 30 days or something has gone wrong, you will need to use the backup bucket. This is called scihist-digicoll-production-originals-backup and is in the US-WEST region.
  8. Confirm the damaged file is in scihist-digicoll-production-originals-backup, you may either use the S3 web console and going to the bucket then searching for the UUID to confirm the file in in that "folder" or you may use the AWS CLI or some other tool to make a head request.
  9. If the file is there, you may sync it to the the scihist-digicoll-production-originals bucket. Make sure that scihist_digicoll thinks there is a file there already. Syncing a file that is not in the postgres database will not add it to the application. You may use any preferred sync method, here is an example via the AWS CLI SDK. 

    Code Block
    languagebash
    firstline1
    titleSync
    aws s3 sync s3://scihist-digicoll-production-originals-backup/asset/UUIDHERE  s3://scihist-digicoll-production-originals/asset/UUIDHERE --source-region us-west-2 --region us-east-1

    This sample will move the file from backups to the originals production bucket, inside the asset key (part of our application) and to the UUID location. If the UUID key is missing this will make the needed key. If you're unsure of a command, the --dryrun option allows for a safe test.

  10. Run the fixity check to confirm the file is fixed.

S3 Outage or temporary use of backups

  1. If US-EAST-1 S3 is down, or we have some issue where our normal S3 buckets are missing or empty we can temporarily use the backup buckets as a source for files.
    1. Before this is done, staff users cannot add any new works or files. New works added will be missing when we stop using the backups. Editing metadata is fine.
    2. While we use the backup buckets, we will be charged for inter-region data transfer. This can quickly add up, so the duration of this switch should be kept to a minimum
  2. Go to the ansible codebase (ansible-inventory).
  3. In the kithe/templates/local_env.yml.j2 file edit s3_bucket originals, s3_bucket_derivatives, and s3_bucket_dzi to add -backup to the end of their name (ex. scihist-digicoll-production-originals-backup) and then commit the change and merge to master so that the edit is pushed out to all the servers. Staging, lacking these backup buckets will be missing all files.
  4. Email all users about the need to avoid adding new works or files.
  5. If the entire US-EAST region is down, the on_demand_derivates and upload buckets may also be unable to be used. This would prevent uploads and pdf/zip generation. If this is expected to last a long time, you may wish to create those buckets in US-WEST-2 and point the local_env file to these buckets.

Full Recovery from S3

  1. A full recovery from S3 rolls the entire system back to the latest backup. This is usually the prior business day. All changes between this period will be lost.
  2. As noted in the prior documentation, this involves a postgres database change, reindex of solr, and then moving files from backup to production.
    1. Only the originals files are required, but it may be faster to copy derivative files rather than regenerate them. (This requires testing)
  3. Currently you can, in ansible, run the playbook restore_kithe.yml which will automatically handle these steps.

    Code Block
    titleAnsible Playbook


  4. This will ask if you are sure you want to run it before it starts. If you type Y it will start the restoration process and quickly lose the data. Do not run this unless you are sure you want to lose any existing data in production.
  5. If you cannot or do not want to use the playbook or it does not work, you may manually undertake the following steps.
    1. Stop passenger on the web server, this should end connections to the postgres database. It also allows you to avoid honeybadger errors when the database is lost.
    2. On the database server, restart the postgres service. This will terminate any hanging connections.
    3. Download to the database server the last postgres backup, found at the s3://chf-hydra-backup bucket under PGSql key as digcol_backup.sql
    4. On the database server drop the existing postgres digcol database.
    5. Import the backup database with psql -U postgres < BACKUP_LOCATION
    6. You will then need to reindex solr, which can be done remotely

      Code Block
      languageruby
      titleRake tasks
      bundle exec cap production invoke:rake TASK="scihist:solr:reindex scihist:solr:delete_orphans"