OCR planning notes

Note: There actually isnโ€™t, it turns out, anything built into Hyrax for storing/using OCR/hOCR, so even if we were using hyrax weโ€™d have to implement it. (Although potentially could copy other hyrax userโ€™s implementations more wholesale). But even hyrax implementers have to implement this stuff as add-ons to hyrax.

Actual OCR process

ย 

Storage of OCR data

The simplest thing to do would be just to make another attr_json text attribute on the Asset for hocr or what have you โ€“ similar to the ones we have now for (manual) transcription and translation.

I was worried that putting this much data directly on the record in postgres would cause performance issues โ€“ itโ€™s going to be fetched and sent across the wire and instantiated in ruby objects every time you fetch these Assets. But tpendragon and princeton says this is what he does and he was initially worried, but it works out fine for him. (By way of valkyrie; granted, he may not be using ActiveRecord, so it could be worse for us).

So weโ€™ll probably at least start like that.

There will be some uses where we need the plain text extracted from hOCR markup. (eg for indexing) Shall we extract it at ingest time and store the extracted text separately? (Possibly making performance storage issues double worse?). Or extract it each time we need it (causing a separate performance problem?). Probably initially do the simplest thing, extract it when we need it โ€“ but letโ€™s be aware of indexing performance degradation, which we might not be able to tell for sure until we have hOCR for a large part of corpus and can run bulk index, the numbers are so small per-doc either way.

Use of OCR data

OK, what are we going to actually do with it?

PDF with text layer

This is the initial obvious thing to do. Enhance our existing multi-page PDF so it has text layer based on hOCR (and make sure itโ€™s available even for single-page works, which Iโ€™m not sure if our current PDF generation is).

While OCR engines can create PDFs with text layer themselves, we will probably create it ourselves, to our specs, from our images combined with hOCR.

There are tools that can combine an image with an hOCR to make PDF with text layer

ย 

Since our hOCR coords will be based on full TIFF, this may require doing some hOCR scaling. See https://github.com/ocropus/hocr-tools/issues/39#issuecomment-249316429

We will have to see how much this slows down our on-demand PDF creation. Difficulties of alternative pre-rendering PDFs is we still need to catch every time a work has an asset added/removed/changed/published to re-gen! But we could use same on-demand caching system, but keep generated things around longer, or pre-trigger their creation.

We might want to indicate somehow which PDFs will have text layers? Even if they all do, let people know in UI?

We may want to revisit resolution of images in our PDFs, they may still be excessive, we can perhaps create a smaller PDF to download.

Get Google to indexโ€ฆ

It would be nice if a google search for a passage in our text would at least hypothetically hit on us, if our OCRโ€™d text were indexed by google.

Not really sure how weโ€™d do this, we should ask around and see if anyoneโ€™s tried.

Might require that we have a UI that lets you look at text of (eg) page 43 as HTML, and press next/prev to go through the pages, just to have something for Google to index.

Or we try to get Google to index all our generated PDFs with text-layers? But get them to have links back DC works too?

Or โ€œDownload OCR text as textโ€ below might do it, if done right and included in our SiteMap (and with links back to Digital Collections if someone does find it!)โ€ฆ.

Searchability in Digital Collections search

Using our existing infrastructure that we use for Bredig and Oral History โ€œfull textโ€, it will be fairly easy to Solr index the OCR fulltext. And provide search results in Dig Coll, and with search results with โ€œsearch in contextโ€ snippets/highlights.

The potential UI concern is that okay, now you click on one of these resultsโ€ฆ and how do you find where in the (eg) book matched your terms?

So we may need some kind of search-in-work functionality to make this not frustrating, possibly as a pre-req, and that is much harder.

Search within Work

You want to be able to:

  • search within a work, and see what pages have hits โ€“ there are various UIs for this, sometimes integrated into a reader/viewer

  • Ideally you also get highlighted results on the page image (hOCR makes this possible)

This --especially the second bullet -- will require a very significant development effort one way or another, it is the most challenging thing on this list.

There are two main packages I find people using to this.

  • UniversalViewer: Used by many of our peers

    • Example: NCSU: https://d.lib.ncsu.edu/collections/catalog/mc00026-001-bx0001-009-000

    • Using UniversalViewer probably requires us building out a IIIF infrastructure we donโ€™t have now โ€“ this is a significant investment. Including providing our images in IIIF image server format (which can be static files with IIIF โ€œLevel 0โ€, prob what weโ€™d want to investigate); IIIF Manifests for works composed of multiple images; and specifically the โ€œIIIF Content Searchโ€ API (I think thatโ€™s what itโ€™s called!) for the searching weโ€™re talking about

      • โ€œIIIF Content Searchโ€ is not built into hyrax, even if we were using hyrax weโ€™d have to build it out

      • But some people in samvera community have built it out, either for use with hyrax or not. I think the implementations somehow uses solr to search text, but produce results based on hOCR including page coordinates etcc.

      • tpendragon shared that princeton has an implementation, here is their initial PR of their implementation https://github.com/pulibrary/figgy/commit/66726be508e85d5e4a822115b52d41ca9d6d48a9 ย 

      • Looks like an NCSU implementation providing IIIF Content Search API using HOCR is here? https://github.com/NCSU-Libraries/ocracoke

      • There may be other samvera implementations to look at

  • InternetArchive viewer

    • Is a really nice UI, I like it maybe better than UV --maybe even better than our custom one? It lets you scroll through all pages, and also look at all thumbnails โ€“ could maybe replace much of our current UI. While also offering search within the โ€œbookโ€.

    • Eg The bird book : illustrating in natural colors more than seven hundred North American birds, also several hundred photographs of their nests and eggs : Reed, Chester A. (Chester Albert), 1876-1912

      • Search results down side PLUS marked in scroll bar, also highlighted over image

      • This is the only in-browser non-PDF interface iโ€™ve seen where the text is also selectable/copy-pasteable!!!

      • Itโ€™s not clear how a non-IA host supplies search (and selectable text!); itโ€™s not standard APIโ€™s like IIIF necessarily, itโ€™s not documented super well. But it could end up being easier to implement than IIIF?

    • This is a tool the IA made for their own use, and they have releaesd it open source โ€“ but the docs arenโ€™t great for how to use it. There are definitely other libraries using it โ€“ but not sure about using the search and selectable-text features! Weโ€™ll have to do research, maybe talk to an insider.

  • Hypothetically, it might be possible to add โ€œhighlight results in imageโ€ functionality to our current custom viewer. After all, UV is doing it on top of OpenSeadragon, which we do too. Could we add our own highlighting layers? Probably. Adding selectable text a-la Internet Archive harder, but maybe possible. Unclear if these would be easier than switching to a third-party viewer โ€“ especially over the long-term maintenance, as our own local staff expertise may change.

None of these options will be easy.

Selectable/copy/pasteable text overlaid on image

Only seen in Internet Archive viewer as above, but it sure is neat!

The bird book : illustrating in natural colors more than seven hundred North American birds, also several hundred photographs of their nests and eggs : Reed, Chester A. (Chester Albert), 1876-1912

Viewable online as HTML text?

Not sure how useful this is, but it does give a hook for google indexing.

Internet Archive does it as one giant HTML page with kind of pre-formatted ascii fixed width font, not sure whyโ€ฆ. The bird book : illustrating in natural colors more than seven hundred North American birds, also several hundred photographs of their nests and eggs : Reed, Chester A. (Chester Albert), 1876-1912

With OCR errors and all.

One could also imagine a more HTML-ish ordinary text display (although it would still have errors), and/or one-page-at-a-time display.

Download OCR text as text

Both the NCSU UniversalViewer example and Internet Archive book reader have integrated โ€œdownloadโ€ buttons

NCSA offers an โ€œDownload as OCR Textโ€ (thatโ€™s literally how itโ€™s labelled) download option โ€“ in ascii text. It definitely has lots of errors, but they offer it as an option.

The IA โ€œowlโ€ example above doesnโ€™t have the OCR text in the embedded viewer โ€œdownloadโ€ tab, but definitely offers โ€œFull Textโ€ as a โ€œDownload Optionโ€ on work page, which seem sto be OCR text, errors and all.

ย 

Other UI/UX?

ย 

Anything else anyone can think of?