We store ArchivesSpace is “an open source archives information management application for managing and providing web access to archives, manuscripts and digital objects”.
Hosting information
In August 2022 we switched from hosting our own ASpace server on EC2 to a third-party-hosted instance at LibraryHost. Hosting and support costs are paid out of Project 1520 - Born Digital. Our annual plan renews in September. Our original contract was for a Light Plan with our database on a shared server, however from August 2022 until August 2023 our instance was hosted on a standalone server. This was due to physical memory issues in August 2022 which brought down the entire database and required an emergency move from our shared server to a dedicated server. The Light Plan pricing was locked in through the end of the ‘22-’23 annual contract, but in September 2023 we switched to a Plus Plan.
Support: support@libraryhost.com
PUI: https://archives.sciencehistory.org
SUI: https://archives.sciencehistory.org/admin
API: https://sciencehistory-api.libraryhost.com/
IP: 50.116.19.60
Our wildcard SSL certificates expire annually in October.
Child pages (Children Display) |
---|
Background
We store digital descriptions of our archival collections in the following places:
Location |
---|
Type of technology | Number of collections described | Source | Example | Who can see it? |
---|---|---|---|---|
| Word documents |
Roughly 270, dates 1997 – present. | This is the |
initial description we create upon accessioning a collection. |
| Institute staff |
ArchivesSpace |
public user interface (PUI) | MySQL-backed website |
504 as of 2/28/2023 | Public | ||
ArchivesSpace staff user interface (SUI) | Same as above | 544 as of 2/28/2023 (includes unpublished and in progress) | Entered manually based on the P drive Word files. |
https://archives.sciencehistory.org/resources/81#tree::resource_81
Logged in ArchivesSpace users |
Public EAD bucket | EAD (xml format) |
504 as of |
Generated nightly from ArchivesSpace database
https://archives.sciencehistory.org/ead/scihist-2012-021.xml
Public
ArchivesSpace Apache front end
HTML
Roughly 45 as of 2020
2/28/2023 | Generated weekly from ArchivesSpace database |
Public |
OPAC
?
Exported manually as PDF from the ArchivesSpace site, then attached to the OPAC record for the collection
https://othmerlib.sciencehistory.org/articles/1065801.15134/1.PDF
Public
Workflow
...
WorldCat | Librarians manually update OCLC master records based on the metadata in ArchivesSpace. This is provided in the form of a MARCXML file by Kent and sent to Caroline. | Public |
Workflow
For newly processed collections, finding aids can be first written up as Word documents ultimately stored at
Shared/P/Othmer Library/Archives/Collections Inventories/Archival Finding Aids and Box Lists
.Kent enters the data in them, one by one, into ArchivesSpace. He revises them in the process. As of summer 2020 approximately 45 have been entered.
Once they are in ArchivesSpace:
Kent exports them to a PDF, which he then sends to Victoria. These are entered into the OPAC. (see e.g. https://othmerlib.sciencehistory.org/articles/1065801.15134/1.PDF )
They are also automatically exported, via a nightly cron job described below, to For legacy finding aids (finding aids created before ArchivesSpace was in use at SHI), the Word document is revised and the revised finding aid data is entered into ArchivesSpace as a resource record. A list of legacy finding aids may be found at P:\Othmer Library\Archives\Legacy Finding Aid Docs
Processing archivist enters the data into ArchivesSpace. If this data is from a legacy finding aid, the Word document finding aid is revised in the process.
Once the collections are described in ArchivesSpace as resource records:
Our EAD export app in Heroku (see EAD export app ) retrieves public EAD files from ArchivesSpace’s API and posts them to the Science History Institute EAD bucket where they are harvested by PACSCL and CHSTM (see below).
If the processing archivist entered data directly into ArchivesSpace (there is no Word doc version), then a PDF is exported from the SUI and saved to the finding aid folder on the P:\ drive.
Note: the PDF or the Word doc has to be manually updated every time the resource record in ArchivesSpace changes.
The processing archivist exports a MARC XML version of the resource record and sends it to a cataloging librarian (usually Caroline), who creates a record in OCLC and the OPAC. The OPAC also points to a PUI URL at https://archives.sciencehistory.org/ead/ .
They are also converted to HTML where most of them are available to the public. Examples: Wotiz; Simon; Fenn; Carbogel; Brody.
Note: because of a bug in the Apache config, not all HTML files in
var/www/html
are actually served. (e.g. https://archives.sciencehistory.org/GB00-16.GB01.09.html is a broken link, even though the same information is public at UPenn. )Alternately, the cataloging librarian could use an ASpace account to export the MARC XML themselves.Previously, there was a PDF version of the finding aid attached to the OPAC record. This practice has been discontinued with the launch of the ASpace PUI in Summer 2022.
Certain works in the Digital Collections also point to the PUI. Example: https://digital.sciencehistory.org/works/81jkowj.
Finally, the exported EAD files in the Science History Institute EAD bucket are also ingested by University of Penn Libraries Special Collections and the Center for the History of Science, Technology, and Medicine (CHSTM).
Penn, in turn, processes these EAD files (possibly on a nightly ) basis and adds them to the Philadelphia Area Archives Research Portal (PAARP)search portal, a service funded by PACSCL.
Example: httphttps://dlafindingaids.library.upenn.edu/dla/pacscl/detail.html?id=PACSCL_SCIHIST_2012021USpaphchfrecords/SCIHIST_2015.003
Likewise, CHSTM ingests these EADs and makes them searchable at its search portal.
Finding aids/resource records may also be entered directly into ArchivesSpace or created using the bulk ingest spreadsheet for box and folder inventories or digital objects.
Technical details about the server
ArchivesSpace lives on an AWS S3 server ArchivesSpace-prod, at https://50.16.132.240/ (also found at https://archives.sciencehistory.org)
The current production version of Aspace is 2.7.1
.
Terminal access: ssh -i /path/to/production/pem_file.pem ubuntu@50.16.132.240
The ubuntu
user owns all the admin scripts.
The relevant Ansible role is: /roles/archivesspace/
in the ansible-inventory
codebase.
SSL is based on the following: http://www.rubydoc.info/github/archivesspace/archivesspace
The executables are at /opt/archivesspace/
The configuration file is /opt/archivesspace/config/config.rb
Logs are at: logs/archivesspace.out
Apache server is at /var/log/apache2/
Configuration for the Apache site is at /etc/apache2/sites-available/000-default.conf
. It would be a good idea to spend some time drastically simplifying this configuration.
Main users
Kent does a majority of the encoding
Hillary Kativa
Patrick Shea
Startup
To start Archivesspace:
sudo service archivesspace start
. You may need to run this several times (just wait 30 seconds between attempts.)You can troubleshoot startup by looking at the start script (invoked by the above):
/opt/archivesspace/archivesspace.sh start
There may be a short delay as the server re-indexes data.
Restarting the server to fix Tomcat memory leak
Note: as of August 2020, the below procedure has been rendered obsolete. We are now simply restarting the server every Sunday at 2 am, which appears to solve the problem before it occurs.
ArchivesSpace has a memory leak that causes it to use more CPU time than it should. This will slowly drain all the burst credits, at which point the server slows down.
Another clue: if you go to the AWS console for the server, under the Monitoring tab, if the CPU Utilization
graph shows anything over about 15 %.
Procedure:
Contact all the main users listed above (especially Kent), and make sure they’re not actively working on the server.
Once given the go-ahead:
Log in to the server.
Throughout the process, keep in mind you can run
sudo service archivesspace status
for the service status at any point. If it’s running, you’ll see a variation on:[...] Loaded: loaded (/etc/systemd/system/archivesspace.service; enabled; [...])
Active: active (running) since Tue [...]
Run
top
in a separate window to monitor the CPU usage. The goal is to see a dramatic reduction in usage after this process.sudo systemctl stop archivesspace
sudo systemctl start archivesspace
(You may have to run this two or three times – the start script is finicky)If all else fails, you can also go into the AWS console and reboot the EC2 instance.
Once everything is properly restarted:
the https://archives.sciencehistory.org/ front-end is available again
After a few minutes, you should see the CPU use go down dramatically in
top
.The AWS monitoring graph for CPU Utilization graph should drop. (see figure below.)
Once you’re done, notify all involved that the server is available again.
Backups
A nightly backup is uploaded by LibraryHost to s3://chf-hydra-backup/Aspace/aspace-backup.sql.
LibraryHost has a login to access our s3 bucket. Credentials are maintained by SHI library application developers and IT staff.
Export
The ArchivesSpace EADs are harvested by:
Institution | Liaison | Contact |
Center for the History of Science, Technology, and Medicine (CHSTM) | Richard Shrake | |
University of Penn Libraries Special Collections | Holly Mengel |
Both institutions harvest the EADs by automatically scraping https://archives.sciencehistory.org/ead/ . Once harvested, the EADs are added to their aggregated Philly-area EAD search interfaces.
The main export files are located at: /home/ubuntu/archivesspace_scripts
. They are checked into code at https://github.com/sciencehistory/archivesspace_scripts .
Important files:
...
complete_export.sh
...
Runs the nightly export (called by cron every night at 9 PM). This calls as_export.py
and generate.sh
below.
...
local_settings.cfg
...
Settings
...
as_export.py
...
Extracts XML from ArchiveSpace and saves a series of EADs into /exports/data/ead/*/*.xml
.
It exports EADs that contains links to the actual digital objects.
...
generate.sh
...
Transforms the EADs in /exports/data/ead
into HTML and and saves them into var/www/html
. See for instance https://archives.sciencehistory.org/beckman e.g.
It relies on files (stylesheets, transformations) in
finding-aid-files
fa-files
...
xml-validator.sh
...
Checks that the publicly accessible files in /var/www/html/ead/
are valid.
Once processed by generate.sh
, the xml files are publicly accessible at https://archives.sciencehistory.org/ead/
via an Apache web server.
Details about the as_export.py
script:
This code was adapted from https://github.com/RockefellerArchiveCenter/as_export
It is buggy and calling it via
complete_export.sh
is the best way to run it reliably.
Building the server
The server not yet fully ansible-ized.
What is missing from the ansible build:
It doesn’t copy the scripts over correctly.
More technical documentation
http://archivesspace.github.io/archivesspace/
...
athttp://ead.sciencehistory.org/.
Documentation
https://archivesspace.atlassian.net/wiki/
...
Backups
These consist of making backups of the sql database used by the ArchivesSpace program.
...
Place the Mysql database in /backup
...
mysql-backup.sh
...
Dumps the mysql database to /backup/aspace-backup.sql
.
This script is run as a crontab by user ubuntu
: 30 17 * * 1-5 /home/ubuntu/archivesspace_scripts/mysql-backup.sh
...
Sync /backup
to an s3 bucket
...
s3-backup.sh
...
Runs an aws s3 sync
command to place the contents of /backup
at https://s3.console.aws.amazon.com/s3/object/chf-hydra-backup/Aspace/aspace-backup.sql?region=us-west-2&tab=overview.
This script is run as a crontab by user ubuntu
: 45 17 * * 1-5 /home/ubuntu/archivesspace_scripts/s3-backup.sh
See Backups and Recovery for a discussion of how the chf-hydra-backup
s3 bucket is then copied to Dubnium and in-house storage.
Restoring from backup
You can get a recent backup of the database at https://s3.console.aws.amazon.com/s3/object/chf-hydra-backup/Aspace/aspace-backup.sql
Note that the create_aspace.yml
playbook creates a minimal, basically empty aspace
database with no actual archival data in it.
To restore from such a backup onto a freshly-created ArchivesSpace server,
...
copy your backup database to an arbitrary location on the new server
...
ssh in to the new server
...
Log into the empty archivesspace
database:
mysql archivesspace --password='the_archivessace_database_password' --user=the_user
Once at the mysql command prompt, load the database:
...
home contains comprehensive documentation.
If you have a sciencehistory.org
address, you can get access to it by filling out a form.