Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Background

https://readcoop.eu/transkribus/ is a “comprehensive platform for the digitization, AI-powered text recognition, transcription and searching of historical documents – from any place, any time, and in any language.” It’s come up repeatedly at conferences I’ve attended, and I’d like to get some practical knowledge about what it actually does.” I’ve heard it mentioned repeatedly at recent conferences, and I’m intrigued by the apparent possibilities. Could we make use of it, in the next few years, to make transcribing handwritten documents cheaper and more efficient?

I’m proposing to spend a bit of time tinkering with it in order to find out.

...

.

Table of contents

Child pages (Children Display)

Scope and assumptions

  • High quality translation of text files (as opposed to images of handwriting) is likely to become cheap, easy and convenient in the next 10 years or so, due to progress in machine learning and the economic incentives involved. This will be especially true for translations between the major modern languages.

  • I am focusing strictly on transcription here, not on translation.

  • An expert human transcriber and translator, given enough time, is always going to do a better job than an automatic transcription engine, because the expert can infer the best reading of a handwritten text from deep familiarity with the linguistic, historical and cultural context within which a document was produced.

  • Experts’ time and effort are precious commodities; we don’t want an expert to waste any time on mere data-entry.

  • It’s at least possible to imagine that a decent machine transcription of an image into a text file might be a real time-saver to an expert, even if the copy includes a certain number of errors.

Timeline

roughly 3 weeks in mid-late January 2023 (subject to interruptions, of course; in the short term, this is ranks low priorityamong our priorities).

General approach

  • Install Transkribus and create an account that I can use

  • Read the docs

  • Become Get familiar with the Transkribus user interface

  • Learn the lingo (and there is a lot of lingo)

  • Get an intuitive sense of what it's really for, and how we might realistically use it

  • Explore the user community any and other helpful online resources

  • Research current use patterns: what are common ways in which people our peers are using the tool to save time and effort.

...

  • the tool

  • Explore how an institution like us might make use of the tool

Tasks

  • Formulate proposal

  • Install Transkribus and create an account that I can use

  • Read the docs and watch the videos and go through the tutorials

  • Test the tool using Jocelyn's transcriptions of Bredig handwritten letters as "ground truth"

  • Attempt to train the software to transcribe letters it hasn't seen yet

  • Evaluate automatic transcriptions results against expert translationstranscriptions

  • Take a look at other collections of handwritten letters too (Pasteur? Booth?)

  • see if we can imagine ways to get more mileage out of human experts in the future by partially automating the transcription process.Write everything up (see below)

Deliverables

I’d like to write up my results here in the wiki; I’m thinking I want to produce a short, non-technical 2-page explanation report containing:

  • a description of what the software is, and what it is not

  • a glossary of machine-learning technical terms that I’ve encountered, so we all speak the same lingo

  • a list of online resources I found helpful in teaching myself how to use the software(websites, listservs, videos, etc.)

  • examples of what how peer institutions have done with ithow it might come in handy in the futureused the software

  • practical recipes that could get us more mileage out of human experts in the future, by partially automating the transcription process.