Report on Transkribus

Transkribus (trans-CRIB-uss) is a tool that allows you to have a computer "read" the contents of an image containing handwritten text to produce a machine-readable copy of the results.

Transkribus is a Java desktop tool that submits jobs to a server and presents the results in a GUI. The basic workflow is:

  • ingest the documents into the tool (pdfs, images).

  • submit a job request for the server analyze the layout of the documents;

  • submit a job request to transcribe the documents using a model (see definition below).

  • edit the transcriptions, which will have a certain number of errors in them.

Transkribus models

Models are user-contributed, interchangeable machine-learning “engines” that are trained to recognize particular types of handwriting or print. Example: The Sunnhordland Partition Protocols model is based on two Danish-Norwegian partition protocols for Sunnhordland, Norway. The software comes with 117 of them as of early 2023. The tool has been used mostly by scholars in the German-speaking world. There are 5 models for French, 4 for English, and 7 for Latin, but 27 for German. We could train our own model if we wanted but this would be a big undertaking (see below).

What I did:

I installed the client on my laptop and created an account for myself, which came with 500 free credits. I spent a long time figuring out the basics of the user interface, which is badly designed.

I tested a couple different German handwriting models on some letters (illegible to me) in Kurrent, and got results that ranged from utter gibberish to somewhat legible (definitely better than gibberish and potentially helpful).

I did not end up training the machine on a full training set (roughly 100 pages of handwriting, ideally written by the same person). Just the data-entry for this is very labor-intensive; I’d estimate a week at least, ON TOP of the work of an expert’s transcribing the documents.

Note that in some cases one of the “canned” models was able to provide us with quite a readable transcription. https://digital.sciencehistory.org/works/8i8i1w3/viewer/gj36653 is very tidy and written on ruled paper.

Now that i have already spent 2 weeks getting familiar with the product, I would not hesitate to at least try using the tool (using one of the "canned" models) if I were stuck transcribing Kurrent.

Cost:

The price model is pretty cheap. A new subscription comes with 500 credits; I only ended up using roughly 20 of these credits during my experiments. 500 more credits would cost us 66 euros.

Conclusions

  • The user interface of the desktop tool is VERY unfriendly. There is also a somewhat simpler online interface that might be more friendly.

  • Segmenting the text (detecting the edges of letters, words, and paragraphs) is done automatically and quite well by Transkribus.

  • Learning how to enter “ground truth” (expert translations) using Transkribus will likely require at least a week or so of training for someone without a tech background and who has never done it before. A fair amount of tech support will be needed at first.

  • Note that in many cases the conventions for "ground truth" diplomatic transcriptions actually conflict with providing metadata for search. In training a model, it's vital to transcribe the text letter-for-letter, along with typos, spelling errors and so on. This contrasts notably from our current rules for transcribing the Bredig collection, which include frequent editorial interventions intended to aid access and retrieval: corrections, notes indicating e.g. the meaning of abbreviations, or explaining what parts of the transcription go with what parts of the image.

Possible applications at Science History Institute

“Canned” models:

If the text corresponds neatly to an existing user-contributed model and the text is in a pretty decipherable handwriting, we could probably use an existing canned model for small collections, as long as someone is willing to correct the resulting transcriptions. This would involve exporting a bunch of PDFs, creating a “collection” in Transkribus, and running layout-analysis and HTR jobs.

Customizing a canned model

A promising hybrid approach might be to train an existing model on some of our letters. We could start with one of the several Kurrent models, e.g., and train them on a somewhat smaller training set than one used in creating a new model from scratch.

Creating a new model:

For us to get real value out of a new model, we would want to try a text:

  1. That has not already been transcribed or published;

  2. That's of high value to scholars;

  3. Where we have an in-house expert who can easily decipher the handwriting and edit / correct the machine-transcribed results from the model;

  4. That has at LEAST 300 pages (roughly) written in the same handwriting. 100 would be used to train the model.

e.g.

  • if we acquire and mass-digitize the papers of one person – especially in English, I can see training a new model being useful.

  • a rare [handwritten] manuscript of high scholarly value that has not already been studied much, written out in the hand of a single copyist. (Note that a lot of high-value rare manuscripts we already own have already been published at some point in their history. e.g. https://digital.sciencehistory.org/works/82ul4ik/. and that even an early-modern printed edition is a lot easier to OCR than a handwritten one with abbreviations, ragged margins, etc.)

Resources

Machine learning lingo: https://readcoop.eu/glossary/

  • HTR= handwritten text recognition;

  • segmentation=detecting the edges of blocks of text.

  • models=pre-trained recognition engines that work for particular types of handwriting (e.g. kurrent 19th-century German).

How peer institutions have used this software

Particularly interesting: one of the models available is based on the autograph letters of the Austrian composer Anton Bruckner from 1852 to 1896. The ground truth was created by students of musicology of the University of Vienna 2022.

Alternatives to Transkribus

A quick survey yielded:

There’s a sense that for now Transkribus is ahead of all of these. I also think that recent advances in AI mean that we can expect a lot of progress in the next 5 years in this area, and that our best policy may just be to wait until the latest generation of machine learning is used to tackle this problem.