OCR planning notes
Note: There actually isnโt, it turns out, anything built into Hyrax for storing/using OCR/hOCR, so even if we were using hyrax weโd have to implement it. (Although potentially could copy other hyrax userโs implementations more wholesale). But even hyrax implementers have to implement this stuff as add-ons to hyrax.
Actual OCR process
We definitely want to capture โpositionโ information necessary to do things like create text layer in PDF, not just text extraction.
As far as I can tell, the most common standard format used for this is hOCR.
There is also an old format โALTOโ created by Library of Congress, but as far as I can tell it didnโt get traction. https://groups.google.com/g/hocr/c/S6MC53lA5-o
What software to use to do the OCR and create hOCR? I identify two major possibilities
tesseract is open source, very popular, and used by some of our peers.
Considerations:
Many of our peers use it
Lots of control: LOTS of language packs for different languages; plus third-party ones; plus you can train your own. (Includes Latin and Early Modern English). Choice of โFASTโ or โBESTโ models.
Not necessarily very good at handwriting
Can install version 4.x on heroku, harder to install newer version 5.x (not in ubuntu 22 apt repo). But peers I reached out to are all using 4.x too
Weโd run it ourselves on heroku worker dynos, which means we need to manage and pay for dynos, but we have lots of architecture for that, and are comfortable with our bg job dyno autoscale
Cost estimate: In order to estimate cost, we have to run it on some heroku dynos on some of our images, and see how many seconds a page takes.
We roughly estimate 10 seconds a page. So 10,000 seconds = 2.8 hours for 1000 page.
Heroku standard-2X dyno is approximately $0.07/hour, so $0.19/10000 pages, orders of magnitude faster than AWS Textract. (Are we doing math right?). So for estimated 40K bck-fill, thatโs $7.20?!?!
Hereโs some Princeton code example (via escowles) for running tesseract in a bg job. figgy/app/derivative_services/hocr_derivative_service.rb at main ยท pulibrary/figgy
AWS has a service, Textract
Considerations
using a cloud service without having to run our own worker dynos can be an advantage, but also can actually be more complicated for us to wire up (like how we need to poll for async HLS creation to be done with video!). Also ties us to AWS a bit โ source material maybe needs to be on S3, etc.
Unclear if it could work better or worse than tesseract โ for all we know, itโs tesseract under the hood!
Only does six languages: ย English, Spanish, German, Italian, French, and Portuguese. But those include the languages we have most need/use of
Claims to be able to do hand-writing? In all those languages?
We donโt know any peers that use it
Cost estimate: Actually kind of expensive! $15.00 per 1000 pages.
Need better estimates (how many Assets actually attached to Works with โtextโ format? How many created a month on average)
But if we just OCR all existing Assets to back-fill, (this is an overshot, some are audio/video, some we may know are images without texts etc), we estimate maybe 40K, so ~$600
Do we want to do manual QA/correction of OCR?
Iโm not aware of any peers that do this, but havenโt totally investigated
I actually canโt find any good tools with UIs for doing manual corrections to hOCR!
Specs for what to run OCR on
We donโt want to run it on photographs of museum items for instance.
We donโt want to run it on hand-written manuscripts
We prob do want to run it on graphical advertisements that have text in them
We might want to START with English , and then consider ohter languages
Do we have a flag that Annabel et al have to set thatโs just โdo OCRโ, or do we try to automatically identify from existing metadata, like โFormat: textโ, etc.
at what point does the metadata know to do it?
letโs see what other software (like Islandora?) does? Does it just OCR everything?
ย
Storage of OCR data
The simplest thing to do would be just to make another attr_json text attribute on the Asset for hocr
or what have you โ similar to the ones we have now for (manual) transcription and translation.
I was worried that putting this much data directly on the record in postgres would cause performance issues โ itโs going to be fetched and sent across the wire and instantiated in ruby objects every time you fetch these Assets. But tpendragon and princeton says this is what he does and he was initially worried, but it works out fine for him. (By way of valkyrie; granted, he may not be using ActiveRecord, so it could be worse for us).
So weโll probably at least start like that.
There will be some uses where we need the plain text extracted from hOCR markup. (eg for indexing) Shall we extract it at ingest time and store the extracted text separately? (Possibly making performance storage issues double worse?). Or extract it each time we need it (causing a separate performance problem?). Probably initially do the simplest thing, extract it when we need it โ but letโs be aware of indexing performance degradation, which we might not be able to tell for sure until we have hOCR for a large part of corpus and can run bulk index, the numbers are so small per-doc either way.
Use of OCR data
OK, what are we going to actually do with it?
PDF with text layer
This is the initial obvious thing to do. Enhance our existing multi-page PDF so it has text layer based on hOCR (and make sure itโs available even for single-page works, which Iโm not sure if our current PDF generation is).
While OCR engines can create PDFs with text layer themselves, we will probably create it ourselves, to our specs, from our images combined with hOCR.
There are tools that can combine an image with an hOCR to make PDF with text layer
GitHub - ocropus/hocr-tools: Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML. but assumes certain locations on file system and naming of files, not necessarily convenient for us
Another tool from internetArchive, not sure I understand what itโs doing: https://git.archive.org/merlijn/archive-pdf-tools
hocr2pdf looks maybe more convenient, and is available via ExactImage toolkit which is in apt-get and brew. https://manpages.ubuntu.com/manpages/trusty/man1/hocr2pdf.1.html
Hypothetically we could write our own code to do this using ruby PDF toolkits or something, but itโs kind of pain, and ruby PDF tools are limited (hexapdf has a license that may not work for us).
But hereโs a ruby tool which isnโt really maintained, and also does things we donโt want it to do, and also has an ImageMagick dependency we donโt really loveโฆ but does include hOCR to text layer func, demoing how you might do it. https://github.com/ifad/pdfbeads
ย
Since our hOCR coords will be based on full TIFF, this may require doing some hOCR scaling. See https://github.com/ocropus/hocr-tools/issues/39#issuecomment-249316429
We will have to see how much this slows down our on-demand PDF creation. Difficulties of alternative pre-rendering PDFs is we still need to catch every time a work has an asset added/removed/changed/published to re-gen! But we could use same on-demand caching system, but keep generated things around longer, or pre-trigger their creation.
We might want to indicate somehow which PDFs will have text layers? Even if they all do, let people know in UI?
We may want to revisit resolution of images in our PDFs, they may still be excessive, we can perhaps create a smaller PDF to download.
Get Google to indexโฆ
It would be nice if a google search for a passage in our text would at least hypothetically hit on us, if our OCRโd text were indexed by google.
Not really sure how weโd do this, we should ask around and see if anyoneโs tried.
Might require that we have a UI that lets you look at text of (eg) page 43 as HTML, and press next/prev to go through the pages, just to have something for Google to index.
Or we try to get Google to index all our generated PDFs with text-layers? But get them to have links back DC works too?
Or โDownload OCR text as textโ below might do it, if done right and included in our SiteMap (and with links back to Digital Collections if someone does find it!)โฆ.
Searchability in Digital Collections search
Using our existing infrastructure that we use for Bredig and Oral History โfull textโ, it will be fairly easy to Solr index the OCR fulltext. And provide search results in Dig Coll, and with search results with โsearch in contextโ snippets/highlights.
The potential UI concern is that okay, now you click on one of these resultsโฆ and how do you find where in the (eg) book matched your terms?
So we may need some kind of search-in-work functionality to make this not frustrating, possibly as a pre-req, and that is much harder.
Search within Work
You want to be able to:
search within a work, and see what pages have hits โ there are various UIs for this, sometimes integrated into a reader/viewer
Ideally you also get highlighted results on the page image (hOCR makes this possible)
This --especially the second bullet -- will require a very significant development effort one way or another, it is the most challenging thing on this list.
There are two main packages I find people using to this.
UniversalViewer: Used by many of our peers
Example: NCSU: https://d.lib.ncsu.edu/collections/catalog/mc00026-001-bx0001-009-000
See more NCSU examples at https://d.lib.ncsu.edu/collections/catalog?f%5Bfulltext_bs%5D%5B%5D=true
Note you can search within the book, results are indicated as markers on the page scroll bar, you go there and the hits are highlighted on the page image
Using UniversalViewer probably requires us building out a IIIF infrastructure we donโt have now โ this is a significant investment. Including providing our images in IIIF image server format (which can be static files with IIIF โLevel 0โ, prob what weโd want to investigate); IIIF Manifests for works composed of multiple images; and specifically the โIIIF Content Searchโ API (I think thatโs what itโs called!) for the searching weโre talking about
โIIIF Content Searchโ is not built into hyrax, even if we were using hyrax weโd have to build it out
But some people in samvera community have built it out, either for use with hyrax or not. I think the implementations somehow uses solr to search text, but produce results based on hOCR including page coordinates etcc.
tpendragon shared that princeton has an implementation, here is their initial PR of their implementation https://github.com/pulibrary/figgy/commit/66726be508e85d5e4a822115b52d41ca9d6d48a9 ย
Looks like an NCSU implementation providing IIIF Content Search API using HOCR is here? https://github.com/NCSU-Libraries/ocracoke
There may be other samvera implementations to look at
InternetArchive viewer
Is a really nice UI, I like it maybe better than UV --maybe even better than our custom one? It lets you scroll through all pages, and also look at all thumbnails โ could maybe replace much of our current UI. While also offering search within the โbookโ.
Search results down side PLUS marked in scroll bar, also highlighted over image
This is the only in-browser non-PDF interface iโve seen where the text is also selectable/copy-pasteable!!!
Itโs not clear how a non-IA host supplies search (and selectable text!); itโs not standard APIโs like IIIF necessarily, itโs not documented super well. But it could end up being easier to implement than IIIF?
This is a tool the IA made for their own use, and they have releaesd it open source โ but the docs arenโt great for how to use it. There are definitely other libraries using it โ but not sure about using the search and selectable-text features! Weโll have to do research, maybe talk to an insider.
Hypothetically, it might be possible to add โhighlight results in imageโ functionality to our current custom viewer. After all, UV is doing it on top of OpenSeadragon, which we do too. Could we add our own highlighting layers? Probably. Adding selectable text a-la Internet Archive harder, but maybe possible. Unclear if these would be easier than switching to a third-party viewer โ especially over the long-term maintenance, as our own local staff expertise may change.
some info on openseadragon overlaysโฆ https://github.com/openseadragon/openseadragon/issues/1726
None of these options will be easy.
Selectable/copy/pasteable text overlaid on image
Only seen in Internet Archive viewer as above, but it sure is neat!
Viewable online as HTML text?
Not sure how useful this is, but it does give a hook for google indexing.
Internet Archive does it as one giant HTML page with kind of pre-formatted ascii fixed width font, not sure whyโฆ. The bird book : illustrating in natural colors more than seven hundred North American birds, also several hundred photographs of their nests and eggs : Reed, Chester A. (Chester Albert), 1876-1912
With OCR errors and all.
One could also imagine a more HTML-ish ordinary text display (although it would still have errors), and/or one-page-at-a-time display.
Download OCR text as text
Both the NCSU UniversalViewer example and Internet Archive book reader have integrated โdownloadโ buttons
NCSA offers an โDownload as OCR Textโ (thatโs literally how itโs labelled) download option โ in ascii text. It definitely has lots of errors, but they offer it as an option.
The IA โowlโ example above doesnโt have the OCR text in the embedded viewer โdownloadโ tab, but definitely offers โFull Textโ as a โDownload Optionโ on work page, which seem sto be OCR text, errors and all.
ย
Other UI/UX?
ย
Anything else anyone can think of?