After exploring several options for providing in-web-browser search-inside-the-work, including doing some R&D experimentation with the Internet Archive BookReader – we have decided to add a “search-inside” feature to our current in-house-developed “Viewer” Javascript after all.
Use case
You want to be able to:
search within a work for a word or phrase, and see what pages have hits. (Should we specify case sensitivity - ie, that search will or will not provide case-insensitive results?-GZ)
Gabriela: I’m still working out exactly how the search will work balancing feasibility and needs. It will be case-insensitive. It may or may not be “whole word”. This is something we have to work out by seeing what we can do and how well it works.
get highlighted results on the page image (hOCR makes this possible)
Do we need to specify that initial phase is not for mobile?-GZ
I’m not sure if that’s true or not, we need to decide!
My vote is not mobile right now (does “mobile” include iPads?”
We have a prototype demonstrating we can do this, although it will be a lot of work to polish it.
Will not include
The viewer images will be highlighted with results, but these images do not allow copy-paste of text.
The only tool I know that does that is Internet Archive BookReader. UniversalViewer that many samvera peers are using does not, to my knowledge, have this feature either. If there are other examples of this feature please link them for reference.
Adding this to local viewer is probably prohibitively complex. at any rate it’s out of scope of initial implementation.
Challenges – actual search matching
Actually matching queries to OCR text, with coordinates, is kind of technically complex and not well supported by our current toolchain.
Note: even if we had gone with IA BookReader, we were still responsible for this part of implementation and would have had same challenges.
We tried to look for samvera peers who could share their implementation – we’ve only found one so far, which was from Princeton, which used the same simple approach we had started with – and basically only finds exact matches. Well, it can control for case-insensitivity, but that’s about the only thing it can do.
Problematic false negatives include
it will not find alternate singular/pluralized/other endings of words, as our solr search can
it will not find non-diacritic versions of words, it’s not going to find “schön” when you enter “schon”.
Worse, it might have problems with different unicode normalization, although we can hypothetically control for this a bit…. (have to figure out what normalization tesseract HOCR is, or ensure it becomes!)
If a word had been divided between two lines and hyphenated, it will show up in source as eg
mid-
dle
and a search for “middle” won’t find itAlternate versions of punctuation, like tesseract OCR often uses curly-quotes for apostrophes in eg |isn‘t| won’t be be matched by straight quote entered with US keyboard
isn't
Obviously typos or noise in the OCR are also an issue
There is no phrase searching (we can make multi-word searches just search for them each individually)
Some of these issues there are possible workarounds, if we choose to spend the time on them – others are intractable unless we change our approach. For now, we are delivering an initial version that will have most of these problems – problems shared by the one peer example we have been able to investigate.
The BEST way to solve this is to change our approach, and use a custom Solr plugin that lets us use the same Solr search technology we use for our general search, and still get image coordinates for highlighting in our search results. There is such a plugin: https://github.com/dbmdz/solr-ocrhighlighting
But our current cloud-hosted Solr provider does not let us install custom JAR plugins like this (at least not without paying for a plan that’s too expensive for us).
In the future, we may look into switching our solr to be more “self-hosted”, perhaps via fly.io.
Some search things that it looks like we could tweak one way or the other
Should we require queries to match whole words only, or allow matches to “starts with”?
We will NOT support phrase queries with this implementation. So when user enters multiple words, they will be highlighted separately. But
If you enter two words, should a match require both to be on the same page, or it doesn’t matter, just highlight both words wherever they appear, even one at a time on a page?
UX Polish to add
Few of these in first draft, but some we probably will add as part of this project phase, depending on how expensive they end up.
In the list of page thumbnails, highlight pages that have a match
Let user open/close search area, to have full screen for page if they aren’t searching
Provide page/image numbers on search results
We don’t have true pagination-as-labelled-in-work metadata, so we can supply at best image sequence numbers at present.
When going to result, don’t just go to the page, but “zoom” to the area of the page with the match at a readable size.
Search box on main work page, that when you submit it, opens up viewer with your search
In addition to saving clicks, this will be better advertise that search-inside feature exists
Provide next/previous to step through search results
We will let you show/hide the search results bar
We have to figure out what happens on small/mobile screens, currently we broke them
Bookmarkable search – record current search in URL, so if you copy and paste it you can go back to particular search results (just like now you can go back to particular page in viewer)