OCR planning notes

Note: There actually isn’t, it turns out, anything built into Hyrax for storing/using OCR/hOCR, so even if we were using hyrax we’d have to implement it. (Although potentially could copy other hyrax user’s implementations more wholesale). But even hyrax implementers have to implement this stuff as add-ons to hyrax.

Actual OCR process

 

Storage of OCR data

The simplest thing to do would be just to make another attr_json text attribute on the Asset for hocr or what have you – similar to the ones we have now for (manual) transcription and translation.

I was worried that putting this much data directly on the record in postgres would cause performance issues – it’s going to be fetched and sent across the wire and instantiated in ruby objects every time you fetch these Assets. But tpendragon and princeton says this is what he does and he was initially worried, but it works out fine for him. (By way of valkyrie; granted, he may not be using ActiveRecord, so it could be worse for us).

So we’ll probably at least start like that.

There will be some uses where we need the plain text extracted from hOCR markup. (eg for indexing) Shall we extract it at ingest time and store the extracted text separately? (Possibly making performance storage issues double worse?). Or extract it each time we need it (causing a separate performance problem?). Probably initially do the simplest thing, extract it when we need it – but let’s be aware of indexing performance degradation, which we might not be able to tell for sure until we have hOCR for a large part of corpus and can run bulk index, the numbers are so small per-doc either way.

Use of OCR data

OK, what are we going to actually do with it?

PDF with text layer

This is the initial obvious thing to do. Enhance our existing multi-page PDF so it has text layer based on hOCR (and make sure it’s available even for single-page works, which I’m not sure if our current PDF generation is).

While OCR engines can create PDFs with text layer themselves, we will probably create it ourselves, to our specs, from our images combined with hOCR.

There are tools that can combine an image with an hOCR to make PDF with text layer

 

Since our hOCR coords will be based on full TIFF, this may require doing some hOCR scaling. See https://github.com/ocropus/hocr-tools/issues/39#issuecomment-249316429

We will have to see how much this slows down our on-demand PDF creation. Difficulties of alternative pre-rendering PDFs is we still need to catch every time a work has an asset added/removed/changed/published to re-gen! But we could use same on-demand caching system, but keep generated things around longer, or pre-trigger their creation.

We might want to indicate somehow which PDFs will have text layers? Even if they all do, let people know in UI?

We may want to revisit resolution of images in our PDFs, they may still be excessive, we can perhaps create a smaller PDF to download.

Get Google to index…

It would be nice if a google search for a passage in our text would at least hypothetically hit on us, if our OCR’d text were indexed by google.

Not really sure how we’d do this, we should ask around and see if anyone’s tried.

Might require that we have a UI that lets you look at text of (eg) page 43 as HTML, and press next/prev to go through the pages, just to have something for Google to index.

Or we try to get Google to index all our generated PDFs with text-layers? But get them to have links back DC works too?

Or “Download OCR text as text” below might do it, if done right and included in our SiteMap (and with links back to Digital Collections if someone does find it!)….

Searchability in Digital Collections search

Using our existing infrastructure that we use for Bredig and Oral History “full text”, it will be fairly easy to Solr index the OCR fulltext. And provide search results in Dig Coll, and with search results with “search in context” snippets/highlights.

The potential UI concern is that okay, now you click on one of these results… and how do you find where in the (eg) book matched your terms?

So we may need some kind of search-in-work functionality to make this not frustrating, possibly as a pre-req, and that is much harder.

Search within Work

You want to be able to:

  • search within a work, and see what pages have hits – there are various UIs for this, sometimes integrated into a reader/viewer

  • Ideally you also get highlighted results on the page image (hOCR makes this possible)

This --especially the second bullet -- will require a very significant development effort one way or another, it is the most challenging thing on this list.

There are two main packages I find people using to this.

  • UniversalViewer: Used by many of our peers

  • InternetArchive viewer

    • Is a really nice UI, I like it maybe better than UV --maybe even better than our custom one? It lets you scroll through all pages, and also look at all thumbnails – could maybe replace much of our current UI. While also offering search within the “book”.

    • Eg https://archive.org/details/birdbookillustra00reedrich/page/229/mode/1up?view=theater&q=owl

      • Search results down side PLUS marked in scroll bar, also highlighted over image

      • This is the only in-browser non-PDF interface i’ve seen where the text is also selectable/copy-pasteable!!!

      • It’s not clear how a non-IA host supplies search (and selectable text!); it’s not standard API’s like IIIF necessarily, it’s not documented super well. But it could end up being easier to implement than IIIF?

    • This is a tool the IA made for their own use, and they have releaesd it open source – but the docs aren’t great for how to use it. There are definitely other libraries using it – but not sure about using the search and selectable-text features! We’ll have to do research, maybe talk to an insider.

  • Hypothetically, it might be possible to add “highlight results in image” functionality to our current custom viewer. After all, UV is doing it on top of OpenSeadragon, which we do too. Could we add our own highlighting layers? Probably. Adding selectable text a-la Internet Archive harder, but maybe possible. Unclear if these would be easier than switching to a third-party viewer – especially over the long-term maintenance, as our own local staff expertise may change.

None of these options will be easy.

Selectable/copy/pasteable text overlaid on image

Only seen in Internet Archive viewer as above, but it sure is neat!

https://archive.org/details/birdbookillustra00reedrich/page/226/mode/2up

Viewable online as HTML text?

Not sure how useful this is, but it does give a hook for google indexing.

Internet Archive does it as one giant HTML page with kind of pre-formatted ascii fixed width font, not sure why…. https://archive.org/stream/birdbookillustra00reedrich/birdbookillustra00reedrich_djvu.txt

With OCR errors and all.

One could also imagine a more HTML-ish ordinary text display (although it would still have errors), and/or one-page-at-a-time display.

Download OCR text as text

Both the NCSU UniversalViewer example and Internet Archive book reader have integrated “download” buttons

NCSA offers an “Download as OCR Text” (that’s literally how it’s labelled) download option – in ascii text. It definitely has lots of errors, but they offer it as an option.

The IA “owl” example above doesn’t have the OCR text in the embedded viewer “download” tab, but definitely offers “Full Text” as a “Download Option” on work page, which seem sto be OCR text, errors and all.

 

Other UI/UX?

 

Anything else anyone can think of?