Content Comparison

ArchivesSpace (or ASpace for short) is a server whose main purpose is to host a software program also named… ArchivesSpace. The program is “an open source archives information management application for managing and providing web access to archives, manuscripts and digital objects”. The server also hosts a few auxiliary programs who take the output from ArchivesSpace and convert it into various other formats, which are then made available via an Apache webserver on the same machine

Hosting information

In August 2022 we switched from hosting our own ASpace server on EC2 to a third-party-hosted instance at LibraryHost. Hosting and support costs are paid out of Project 1520 - Born Digital. Our annual plan renews in September. Our original contract was for a Light Plan with our database on a shared server, however from August 2022 until August 2023 our instance was hosted on a standalone server. This was due to physical memory issues in August 2022 which brought down the entire database and required an emergency move from our shared server to a dedicated server. The Light Plan pricing was locked in through the end of the ‘22-’23 annual contract, but in September 2023 we switched to a Plus Plan.

Support: support@libraryhost.com

PUI: https://archives.sciencehistory.org

SUI: https://archives.sciencehistory.org/admin

API: https://sciencehistory-api.libraryhost.com/

IP: 50.116.19.60

Our wildcard SSL certificates expire annually in October.

Child pages

Background

We store digital descriptions of our archival collections in the following six places:

Location

Format

Type of technology	Number of collections described	Source	Example	Who can see it?
`Shared/P/Othmer Library/Archives/Collections Inventories/Archival Finding Aids and Box Lists`	Word documents	Roughly 270, dates 1997 – present.	This is the

original collection description

initial description we create upon accessioning a collection.	`P/Othmer Library/Archives/Collections Inventories/Archival Finding Aids and Box Lists/Labovsky Collection Finding Aid.doc`	Institute staff
ArchivesSpace

site

public user interface (PUI)

MySQL-backed website

Roughly 45 as of 2020

	504 as of 2/28/2023		Tischler papers	Public
ArchivesSpace staff user interface (SUI)	Same as above	544 as of 2/28/2023 (includes unpublished and in progress)	Entered manually based on the P drive Word files.

https://archives.sciencehistory.org/resources/81#tree::resource_81

Only logged

Tischler papers

Logged in ArchivesSpace users

ArchivesSpace Apache front end

Public EAD bucket

EAD (xml format)

Roughly 45

504 as of

2020

Generated hourly from ArchivesSpace database

https://archives.sciencehistory.org/ead/scihist-2012-021.xml

Public

ArchivesSpace Apache front end

HTML

Roughly 45 as of 2020

Generated hourly

2/28/2023

Generated weekly from ArchivesSpace database

https

http://

Workflow

...

WorldCat

Librarians manually update OCLC master records based on the metadata in ArchivesSpace. This is provided in the form of a MARCXML file by Kent and sent to Caroline.

Bredig collection in WorldCat

Public

Workflow

For newly processed collections, finding aids can be first written up as Word documents ultimately stored at Shared/P/Othmer Library/Archives/Collections Inventories/Archival Finding Aids and Box Lists.
Kent enters the data in them, one by one, into ArchivesSpace. He revises them in the process. As of summer 2020 approximately 45 have been entered.
Once they are in ArchivesSpace:
They are automatically exported, via an hourly cron job described below, to EAD files https://archives.sciencehistory.org/ead/ .
They are also converted to HTML. Examples: Wotiz; Simon; Fenn; Carbogel; Brody. There is currently no Web page that lists these HTML files, so you have to know the URL beforehand or be directed to them from e.g. Google or the OPAC.
Kent also exports them to a PDF, which he then sends to Victoria. These are entered into the OPAC. (see e.g. https://othmerlib.sciencehistory.org/articles/1065801.15134/1.PDF )
- Note: the PDF has to be manually updated in the OPAC every time the metadata in ArchivesSpace changes.

In certain cases the OPAC record

also

points at the HTML file at

https://archives.sciencehistory.org/

, which, of course, is updated nightly

. Finding aids/resource records may also be entered directly into ArchivesSpace or created using the bulk ingest spreadsheet for box and folder inventories or digital objects.

For legacy finding aids (finding aids created before ArchivesSpace was in use at SHI), the Word document is revised and the revised finding aid data is entered into ArchivesSpace as a resource record. A list of legacy finding aids may be found at P:\Othmer Library\Archives\Legacy Finding Aid Docs
Processing archivist enters the data into ArchivesSpace. If this data is from a legacy finding aid, the Word document finding aid is revised in the process.
Once the collections are described in ArchivesSpace as resource records:
- Our EAD export app in Heroku (see EAD export app ) retrieves public EAD files from ArchivesSpace’s API and posts them to the Science History Institute EAD bucket where they are harvested by PACSCL and CHSTM (see below).
- If the processing archivist entered data directly into ArchivesSpace (there is no Word doc version), then a PDF is exported from the SUI and saved to the finding aid folder on the P:\ drive.
  - Note: the PDF or the Word doc has to be manually updated every time the resource record in ArchivesSpace changes.
- The processing archivist exports a MARC XML version of the resource record and sends it to a cataloging librarian (usually Caroline), who creates a record in OCLC and the OPAC. The OPAC also points to a PUI URL at https://archives.sciencehistory.org/ .
  - Alternately, the cataloging librarian could use an ASpace account to export the MARC XML themselves.
  - Previously, there was a PDF version of the finding aid attached to the OPAC record. This practice has been discontinued with the launch of the ASpace PUI in Summer 2022.
- Certain works in the Digital Collections also point to the PUI. Example: https://digital.sciencehistory.org/works/81jkowj.
Finally, the exported EAD files in the Science History Institute EAD bucket are also ingested by University of Penn Libraries Special Collections and the Center for the History of Science, Technology, and Medicine (CHSTM).
- Penn, in turn, processes these EAD files on a nightly basis and adds them to the Philadelphia Area Archives Research Portal (PAARP)search portal, a service funded by PACSCL.
  - Example: httphttps://dlafindingaids.library.upenn.edu/dla/pacscl/detail.html?id=PACSCL_SCIHIST_2012021USpaphchf
  - A conversation with Holly Mengel, the archivist responsible for the process, reassured us that the only thing required for this export to work is for valid EAD files be publicly accessible in the directory at https://archives.sciencehistory.org/ead/ . This URL could be changed as long as we give Holly plenty of notice and coordinate with her, which raises the possibility of us posting them to e.g. an S3 bucket.
  - Notably, Holly assures us that the apparatus at PAARP / PACSCL does not link back to archival descriptions hosted on any of of our domains.
  - records/SCIHIST_2015.003
- Likewise, CHSTM ingests these EADs and makes them searchable at its search portal.
  - Example: https://www.chstm.org/collections/search?text=Carbogel
- Attempts to contact our liaison at CHSTM, Richard Shrake, have failed.
Note that external links to our HTML finding aids are rare and can be disregarded. There should be no need to provide redirects to these URLS when we eliminate them.

Technical details about the server

ArchivesSpace lives on an AWS S3 server ArchivesSpace-prod, at https://50.16.132.240/ (also found at https://archives.sciencehistory.org)

The current production version of Aspace is 2.7.1 .

Terminal access: ssh -i /path/to/production/pem_file.pem ubuntu@50.16.132.240

The ubuntu user owns all the admin scripts.

The relevant Ansible role is: /roles/archivesspace/ in the ansible-inventory codebase.

SSL is based on the following: http://www.rubydoc.info/github/archivesspace/archivesspace

The executables are at /opt/archivesspace/

The configuration file is /opt/archivesspace/config/config.rb
Logs are at: logs/archivesspace.out

Apache server is at /var/log/apache2/

Configuration for the Apache site is at /etc/apache2/sites-available/000-default.conf. It would be a good idea to spend some time drastically simplifying this configuration.

Main users

Kenton Jaenig
Sarah Newhouse
Patrick Shea

Startup

To start Archivesspace: ~~sudo systemctl start archivesspace. You may need to run this several times (just wait 30 seconds between attempts.)~~
- /opt/archivesspace/archivesspace.sh start (as user ubuntu)
~~You can troubleshoot startup by looking at the start script (invoked by the above): /opt/archivesspace/archivesspace.sh start~~
There may be a short delay as the server re-indexes data.

Restarting the server to fix Tomcat memory leak

We restart the ArchivesSpace program (not the server) using a cronjob that runs /opt/archivesspace/archivesspace.sh restart every night at 2 am. This prevents a chronic memory leak from eating up all the CPU credits for the machine.

When the server is restarted, Jetty creates a set of temporary files in /tmp

that look like this:

hsperfdata_ubuntu
jetty-0.0.0.0-8089-backend.war-_-any-3200460420275417425
jetty-0.0.0.0-8090-solr.war--any-_1669707332158985985
jetty-0.0.0.0-8091-indexer.war-_aspace-indexer-any-3026688914663148716
jetty-0.0.0.0-8080-frontend.war--any-3028692540497613460
jetty-0.0.0.0-8081-public.war--any-268053434795494538
jetty-0.0.0.0-8082-oai.war--any-_243630232179303838

Only the most recent set are used by Jetty, but the old ones accumulate rapidly if the server is restarted nightly.

A system to clean these up will be needed – some variation on find /tmp -maxdepth 1 -type d -mtime +20 | grep jetty.*war .

Backups

A nightly backup is uploaded by LibraryHost to s3://chf-hydra-backup/Aspace/aspace-backup.sql.
LibraryHost has a login to access our s3 bucket. Credentials are maintained by SHI library application developers and IT staff.

Export

The ArchivesSpace EADs are harvested by:

Institution	Liaison	Contact
Center for the History of Science, Technology, and Medicine (CHSTM)	Richard Shrake	shraker13@gmail.com
University of Penn Libraries Special Collections	Holly Mengel	hmengel@pobox.upenn.edu

Both institutions harvest the EADs by automatically scraping https://archives.sciencehistory.org/ead/ . Once harvested, the EADs are added to their aggregated Philly-area EAD search interfaces.

The main export files are located at: /home/ubuntu/archivesspace_scripts . They are checked into code at https://github.com/sciencehistory/archivesspace_scripts .

Important files:

...

complete_export.sh

...

Runs the nightly export (called by cron every night at 9 PM). This calls as_export.py and generate.sh below.

...

local_settings.cfg

...

Settings

...

as_export.py

...

Extracts XML from ArchiveSpace and saves a series of EADs into /exports/data/ead/*/*.xml .

It exports EADs that contains links to the actual digital objects.

...

generate.sh

...

Transforms the EADs in /exports/data/ead into HTML and and saves them into var/www/html. See for instance https://archives.sciencehistory.org/beckman e.g.

It relies on files (stylesheets, transformations) in

finding-aid-files
fa-files

...

xml-validator.sh

...

Checks that the publicly accessible files in /var/www/html/ead/ are valid.

Once processed by generate.sh, the xml files are publicly accessible at https://archives.sciencehistory.org/ead/

via an Apache web server.

Details about the as_export.py script:

This code was adapted from https://github.com/RockefellerArchiveCenter/as_export
It is buggy and calling it via complete_export.sh is the best way to run it reliably.

Building the server

The server not yet fully ansible-ized.

What is missing from the ansible build:

The build doesn’t copy the scripts in /home/ubuntu over correctly. Passwords for the scripts also need to be provided.
All these directories under /var/www/html/ are also missing: css; ead; font-awesome-4.7.0; fonts; img; js.
The ubuntu user needs to be added to the www-data group
SSH keys are not loaded into /etc/ssl/private/
The archivesspace server is not actually started (sudo systemctl start archivesspace).

Backups

These consist of making backups of the sql database used by the ArchivesSpace program.

...

Place the Mysql database in /backup

...

mysql-backup.sh

...

Dumps the mysql database to /backup/aspace-backup.sql.
This script is run as a crontab by user ubuntu : 30 17 * * 1-5 /home/ubuntu/archivesspace_scripts/mysql-backup.sh

...

Sync /backup to an s3 bucket

...

s3-backup.sh

...

Runs an aws s3 sync command to place the contents of /backup at https://s3.console.aws.amazon.com/s3/object/chf-hydra-backup/Aspace/aspace-backup.sql?region=us-west-2&tab=overview.

This script is run as a crontab by user ubuntu : 45 17 * * 1-5 /home/ubuntu/archivesspace_scripts/s3-backup.sh

See Backups and Recovery (Needs updating) for a discussion of how the chf-hydra-backup s3 bucket is then copied to Dubnium and in-house storage.

Restoring from backup

You can get a recent backup of the database at https://s3.console.aws.amazon.com/s3/object/chf-hydra-backup/Aspace/aspace-backup.sql

Note that the create_aspace.yml playbook creates a minimal, basically empty aspace database with no actual archival data in it.

To restore from such a backup onto a freshly-created ArchivesSpace server,

copy your backup database to an arbitrary location on the new server
ssh in to the new server
Log into the empty archivesspace database:
- mysql archivesspace --password='the_archivessace_database_password' --user=the_user
Once at the mysql command prompt, load the database:
- mysql> \. /path/to/your/aspace-backup.sql

athttp://ead.sciencehistory.org/.

Documentation

https://archivesspace.atlassian.net/wiki/home contains comprehensive documentation.

If you have a sciencehistory.orgaddress, you can get access to it by filling out a form.See also https://github.com/sciencehistory/ansible

Version	Old Version 69	New Version Current
Changes made by	Eddie Rubeiz	Eddie Rubeiz
Saved on	Dec 03, 2021	Mar 28, 2024

Versions Compared

Key

Hosting information

Background

Workflow

Workflow

Technical details about the server

Main users

Startup

Restarting the server to fix Tomcat memory leak

Backups

Export

Building the server

Backups

Restoring from backup

Documentation