Transkribus R&D proposal

Background

https://readcoop.eu/transkribus/ is a “comprehensive platform for the digitization, AI-powered text recognition, transcription and searching of historical documents.” I’ve heard it mentioned repeatedly at recent conferences I’ve been to, and I’m intrigued by the apparent possibilities. Could we make use of it, in the next few years, to make transcribing handwritten documents cheaper and more efficient?

I’m proposing to spend a bit of time tinkering with it in order to find out.

Timeline

roughly 3 weeks in mid-late January 2023 (subject to interruptions, of course; in the short term, this ranks low among our priorities).

General approach

Install Transkribus and create an account that I can use
Read the docs
Become familiar with the Transkribus interface
Learn the lingo (and there is a lot of lingo)
Explore the user community and other online resources
Research current use patterns: what are common ways in which our peers are using the tool
Gain an intuitive sense of how we might realistically use it

Specific tasks

Test the tool using Jocelyn's transcriptions of Bredig handwritten letters as "ground truth"
Attempt to train the software to transcribe letters it hasn't seen yet
Evaluate automatic transcriptions results against expert transcriptions
Take a look at other collections of handwritten letters too (Pasteur? Booth?)
see if I can come up with practical recipes that could get us more mileage out of human experts in the future, by partially automating the transcription process.

Deliverables

I’d like to write up my results here in the wiki; I want to produce a short, non-technical report containing:

a description of what the software is
…and what it is not
a glossary of machine-learning technical terms
a list of online resources (websites, listservs, videos, etc.)
examples of how peer institutions have used the software
some suggestions about how I think it might come in handy in the future.