Internet Archive BookReader

Overview

The Internet Archive BookReader is a candidate to replace the digital collections' custom-built viewer. See https://sciencehistory.atlassian.net/wiki/spaces/HDC/pages/2206203905/OCR+planning+notes#Search-within--Work for a discussion in the context of other candidates.

A working example: the bird book. Note

the ease of navigation within the book;
the right and left side navigation that allows you to interact with the book in much the same way as a codex;
the mature search-within-book interface;
the scaleable, copy-able text overlay.

A custom-built php script serves the images.

Important links and resources

Github site: https://github.com/internetarchive/bookreader/
Pre-github site: https://openlibrary.org/dev/docs/bookreader

Search within the book

In December 2023, we ran some experiments to see if we could integrate a modified version of the BookReader demo code with image and text metadata from the digital collections. Interestingly enough, we were able to get the BookReader to consume not only our images but also our HOCR content, which allowed us to demo a simple version of “search inside the book”.

The code for the demo is here.
The item from the digital collections used is hardcoded, as this was just a proof of concept.
The serializers that we use to serve our metadata to the bookreader are at this PR. (Don’t merge it!)
Note this demo uses single JPG images – so is not appropriate for delivering full-resolution pan-and-zoom, which needs a tiling solution (such as IIIF, see below), which we have not yet been able to make work

IIIF

IIIF is a standard that could theoretically allow us to use the BookReader to consume our images and metadata. (The APIs that interest us are the image API and the content search API.) This would notably allow us to offer pan-and-zoom functionality, among other useful features. For this to work:

we would need to serve our metadata according to some version of the those APIs, presumably the latest, and
the BookReader would have to work against those APIs.

The Internet Archive and IIIF

This blog post describes the history of the Internet Archive and IIIF. The key sentence seems to be: “By making Internet Archive images and texts IIIF-compatible, they may be opened using any number of compatible IIIF viewer apps, each offering their own advantages and unique features”. Tellingly, the post makes no mention of the BookReader.

The Internet Archive does in fact maintain a IIIF server, but its front end is actually Mirador (which itself includes the OpenSeadragon viewer.)

The BookReader and IIIF

Originally we wanted to look at the BookReader’s IIIF demo/plugin because we thought it was a proven working path to integrating with BookReader, that would also give us use of appropriate-zoom-level-tiling images (instead of fetching entire full-page full-res images). we put some time into trying this path, but…
- It appears neither of those assumptions were true – the IIIF plugin was actually still fetcing whole-page graphics and the plugin/demo appears not to be working and need a lot of work!
- We aren’t actually wedded to IIIF (we don’t currently even use it), so this isn’t necessarily a disaster, it just means this was the wrong path to investigate.
- Subsequent are notes about the issues with IIIF plugin

The BookReader is not listed in the table of viewers that have been recently tested against the latest version of IIIF.
There is an active IIIF Slack channel, but mentions of BookReader are rare. Apparently barmintor at Columbia showed some interest in 2023 in getting the BookReader to work with IIIF, but I was unable to get him (or anyone else on the channel) to discuss it.
The BookReader demo code contains a broken page that used to work with IIIF 1.0. It is described as a “rapid proof of concept” and consists of two short files:
https://github.com/internetarchive/bookreader/blob/master/BookReaderDemo/IIIFBookReader.js and https://github.com/internetarchive/bookreader/blob/master/BookReaderDemo/demo-iiif.js.
An open issue describes the problem with the above demo. As far as I can tell, the latest attempt to fix the broken demo code dates back to 2020. nynaalekhya posted a PR which went unapproved and unmerged. (Interestingly, as far as I can tell, the PR does not actually fix the demo.) However, the three comments on the issue are interesting:
- the first hypothesizes that the problem is a page number not being set before making “the call”.
- the second mentions the last working commit to the IIIF demo; if this is true, it has been broken since 2019.
- the third states that the js client is expecting a IIIF 1.0. manifest.
Based on that second comment, I was able to get the last working commit to the IIIF demo working.
- It’s worth noting that the IIIF manifest that’s hard-coded into the demo (and/or the files that it lists) is also problematic, regardless of the demo’s other bugs:
  - The manifest lists pages in the form https://iiif.archivelab.org/iiif/platowithenglish04platuoft$3/full/full/0/default.jpg
  - which the BookReader code then modifies to the following URL, which it GETs: https://iiif.archivelab.org/iiif/platowithenglish04platuoft$3/full/800,/0/native.jpg
  - The $3 denotes the page number; replacing that 3 with any integer except [3, 4, 5, 8, 9, 11, 13] results in a 404.
  - You can confirm this by attempting to load the manifest into e.g. mirador.

Even if I had been able to fix the open issue, the work would have been of little help to us since both the BookReader and the IIIF standard have evolved too much in the intervening 5 years of development.

Conclusion

I have to conclude that the BookReader is not worth pursuing as a component of the digital collections. While impressive in its current form, it depends on a complex and ill-documented set of interfaces with the Internet Archive’s image and metadata servers, and relies in particular on a home-grown php image server script that looks difficult to maintain.

I certainly understand the IA’s desire (which can be inferred from their blog post) to move to a more interoperable standard for serving images, to and get out of the business of maintaining an image viewer altogether.