Internet Archive BookReader: Development notebook

We DID decide to go forward with an attempt at implementing BookReader integration. A development notebook.

See also:

In an iFrame

the BookReader is written to support embedding directly on a page, OR displaying in an iFrame. While iFrame makes communication with the parent page more complicated (including updating URLs with fragments to represent current page/search), we’ve decided to go that way for the isolation it provides.

 

  1. It keeps the Javascript for the book reader completely separate from the rest of our app.

    1. The BookReader JS is fairly complicated and heavy-weight, we don't want to include it in al or even any of our pages

    2. But even worse, the book reader JS conflicts with some of our JS

  2. We were having trouble getting the book reader to display in a Boostrap modal, prob could have worked around this, but seems to work out better in an iFrame, makes it easier to style for taking up the “whole page” (whole iframe), then we can size the iframe ourselves

With our JS bundled

The BookReader’s own updated examples show getting all JS from various CDNs as separate files. This is slow and unreliable, we want to bundle as normal.

We did get this to work, although with some errors in console (that don’t seem to cause any problems)

We are NOT for now loading any “custom element polyfills”, even though example included them – we do not think they are needed for any browsers we target! https://caniuse.com/custom-elementsv1

Known Problems

Proceeding with Proof-of-Concept Demo, but we’ll probably have to fix all these before we go live

  • [ ] My console does have an error for a 404 on trying to load eg http://localhost:3000/BookReader/images/loading.gif, which of course is not there. Not sure if there's a good recommended way to deal with images -- or if there are OTHER image references that might wind up missing too?

  • [ ] In MACOS SAFARI I'm having more trouble.

  • [ ] Error in console

    • This does NOT reproduce in original README, but does seem to be a result of our refactor of JS to be in separate JS with import of dependencies

    • It does not seem to cause any known problems, just error in console

    • It is hard to debug the error, because the versions of the JS src I am using from the npm package are already "minimized" so kind of inscrutable!

    • Uncaught TypeError: Cannot read properties of undefined (reading 'identifier')

    • @internetarchive_bookreader_BookReader_ia-bookreader-bundle__js.js?v=9a269131:5357 Uncaught TypeError: Cannot read properties of undefined (reading 'identifier') at new e2 (@internetarchive_bookreader_BookReader_ia-bookreader-bundle__js.js?v=9a269131:5357:57) at l2.value (@internetarchive_bookreader_BookReader_ia-bookreader-bundle__js.js?v=9a269131:5861:36) at l2.value (@internetarchive_bookreader_BookReader_ia-bookreader-bundle__js.js?v=9a269131:5969:120) at @internetarchive_bookreader_BookReader_ia-bookreader-bundle__js.js?v=9a269131:5998:18 at Set.forEach (<anonymous>) at @internetarchive_bookreader_BookReader_ia-bookreader-bundle__js.js?v=9a269131:5997:30
    • The top line of stack trace in inscrutable minimized JS is: var a2 = null == n2 ? void 0 : n2.metadata, s2 = a2.identifier, l2 = a2.creator, c2 = a2.title, u2 = Array.isArray(l2) ? l2[0] : l2, d2 = i2.options.subPrefix || "";

    • Tried to report at https://github.com/internetarchive/bookreader/pull/1321#issuecomment-1984605074

 

JSON response

The bookreader JS needs info on all the pages. It can use the existing JSON response we have for the existing reader. It doesn't need JSON that exactly matches the argument format of BookReader, we can easily write code to read the existing JSON and prepare the argument.BUT the existing JSON doesn't have quite all the info we need -- although it's close, at least for what i need right now. We also need DPI, and all the scaled derivatives we have (existing JSON only has downloadable ones for download menu).

  1. I could ADD the info we need to existing JSON.  Con -- this is already an enormous response that takes a long time to return, adding more bytes to it makes it worse. Pro -- maybe easier to maintain to have only ONE json serializer, if we do something to optimize it for being so huge etc one place to do it.

  2. I could write a NEW serializer just for the bookreader, kind fo like what you already did in your quick and dirty proof of concept (but without all that faked metadata we don't need!).  Pro:  Each response is slightly smaller (but they're still prob within 10% of each other), maybe it's EASIER to maintain two serializers custom fit for their use rather than one that is doing double-duty. The json _can_ be exactly the format the bookreader needs, which is _slightly_ easier. Con: Two things to maintain, any future optimizations need to be applied to both, etc.

We say:

2 absolutely

If we are still using both viewers in a year, then of course we can spend some time seeing if we can get them to work in some kind of harmony.

I have no problem with having two things to maintain - that’s implicit in this whole project anyway. It’s easier to maintain two things that each perform one function than one thing that performs two.

 

Back-end-provided search results with coordinates

27-March-2024

I know that some other in Samvera community have done highlighted search results in browser, from OCR. They use an API results format in “IIIF Search” results, but we don’t really care about using the somewhat convoluted IIIF API format. They use blacklight_iiif_search plugin – but that actually only formats the results in IIIF Search API format, the actual logic of doing the results is left to implementer not included in the plugin!

I had assumed that implementers were doing something to get Solr to return results with coordinates. So you could have Solr stemming and phrase searching and other features, and get results back with coordinates. But it turns out I don’t believe other samvera community people are doing this!

Here is Princeton’s initial implementation. It actually uses Solr only to identify relevant pages, then loads HOCR for those pages into memory and in ruby code with in-memory HOCR identifies matching words and coordinates. This means it doesn’t handle stemming or phrase searching!

It turns out probably in order to do what I initially thought, actually get results with coordinates out of Solr, you probably need a custom Solr plugin in Java. There are such plugins by other universities (not in Samvera community). For instance, an old solr-ocrpayload-plugin, which is no longer supported; and a newer still maintained solr-ocrhighlighting plugin. But our current hosted Solr SearchStax provider probably doens’t let us add custom jars like this. And this is going above and beyond what our samvera peers are doing, really a future possible enhancement.

So for now, the basic approach taken by Eddie in his demo is probably right. But with some enhancmeents:

  • We want to make multi-word entry do an “or” search highlighting all words. (phrase search not supported, and quotes and prob other punctuation should be stripped from search entry)

  • We probably can go to postgres to identify initial page hits of interest, instead of loading all pages into memory. But then once we have matching pages, we still use same in-memory HOCR technique to get out match coordinates.

    • postgres regexp searching feature could be handy, for making sure we have full-word matches! Note \m and \M special codes for matching word boundaries, in POSIX regexp searching

    • While it doesn’t use indexes and isn’t an efficient search – we’re only searching over pages in ONE book, it’ll be fine, and more efficient than doing it in memory in ruby.

 

Can’t get search plugin to load

27 March 2024

When trying to load the search plugin, I get an error that looks to me like a bug in the search plugin.

This reproduces when trying to use my npm/import setup for bundling source, as well as if I take the simple example at https://github.com/internetarchive/bookreader/pull/1321 and add a <script> tag for plugin.search.js from unpkg.com to it.

The search plugin DOES seem to work in the example created from bookreader source checkout running npm run serve , example based on the demo-iiif.html example – not totally sure what it does to get things working.

After loading search plugin, error on initializing a BookReader element:

TypeError: Cannot read properties of undefined (reading 'dom')

On the line:

                if (this.searchView.dom.toolbarSearch) {

 

From the search plugins override of buildToolbarElement:

So the BookReader object does not have `searchView` property set. 

The search plugin TRIED to set it in an over-ridden init: 

But the over-ridden init FIRST calls “super” init.  And the original super init calls initToolbar()

Which then calls buildToolbar, which the plugin has overridden – to want to access this.searchView – which is not set yet, since plugin init called super first before, setting up this.searchView.

Believe search plugin does not work

I believe, as above, I actually found a bug in the search plugin that ships with the BookReader – I don’t think it actually works, or is currently being used by anyone .

So how is Internet Archive itself doing search? Not sure how they’re doing it in production. One of the demo files that comes with the source is called demo-internetarchive , and is the only demo that included a search example – but this demo does not actually use the search plugin. The demo fetches a live JS logic file from archive.org , which I think has custom fairly complex logic for implementing search?

My guess is maybe they tried to extract this as the search plugin…. curious if anyone is actually using the search plugin live (maybe previous versions of it worked even though latest version has a bug).

This bug we have diagnosed and could probably PR to fix (although might have to figure out some apparatus around how to add tests etc). But if the code isn’t actually currently being used anywhere, there might be other bugs and in general the level of polish of this feature may not be what we thought?

We are going to shift gears and stop doing the R&D work with IA BookReader for now. And instead look at adding search-inside feature to our custom local viewer. BookReader is amazing code, very impressive, and much thanks to Internet Archive for making it open source! However, we know from our own experience how hard and time-consuming it is to take code that was written bespoke for you and make it easily re-usable by third-parties. It’s very hard and very expensive! At this point we think the cost/benefit of adding feature to our local viewer makes more sense.