UK

How AI could reveal secrets of thousands of handwritten documents – from medieval manuscripts to hieroglyphics

script of Beowulf

Have you ever struggled to read what that scrawl between “carrots” and “potatoes” is on your shopping list? Soon, artificial intelligence (AI) may be able to help.

Over the last ten years, researchers have gradually been working out how to teach computers to read handwritten documents. As in most machine learning, a computer is fed training data: in this case, images of handwriting and details of what it says. It then learns how the marks on each page correspond to letters. It learns that that half circle is a “c”, that that short vertical stroke is an “i” and that it might therefore be “rice” that you wrote on your shopping list, for example.

How it does this no one is quite sure – machine learning is often a black box. But it seems likely it is at least partly learning which characters are likely to occur in sequence, thus determining that you are unlikely to want to be shopping for “qvjx”, however much the word might look like that.

This technology has been applied to handwriting from many countries and periods, from medieval manuscripts to 19th-century diaries (if not yet 21st-century shopping lists), in languages from Latin to Old French to Hebrew.

Because the technology works on the basis of image analysis, it is in theory applicable to any writing whatsoever, from Egyptian hieroglyphs to copperplate. Ten years after its initial development, some truly exciting consequences of the development of handwritten text recognition (HTR) techniques are becoming clear.

AI’s archive applications

One is that it democratises access to knowledge. The digitisation of manuscripts has made many libraries’ collections accessible at the click of a button (cybercriminality notwithstanding). But lengthy training, only available in select universities, is still needed to read what they say (and some scripts, like Beneventan, have the power to make even postgraduates gnash their teeth).

HTR has the power to generate a tolerably accurate, machine-readable version of a manuscript at more or less the click of a button. If language is still a barrier for the user, that transcript can be subjected to machine translation and a workable English (or French, or Chinese) version given, side by side with the manuscript.

The sheer quantity of data these processes will make available has significant ramifications for scholarship. Many medieval manuscripts haven’t been read since the middle ages. In the past, major questions (like the date of composition of foundational works like Beowulf) have often been resolved with the tiniest fragments of data, such as a single spelling. We are now starting to look at answering such questions with data sets of tens of thousands of spellings: with HTR it will be hundreds of thousands, if not millions. And the answers we get will be different.

Beyond qwerty

The data HTR can generate is also richer. Over the past half millennium, the representation of medieval texts has been fundamentally constrained by the printing press and the computer keyboard.

The first folio of the heroic epic poem Beowulf, one of the texts we may gain better understanding of through AI.
British Library

Some medieval scribes use three different forms of “s”, but all have been transcribed as the familiar, snake-like “s” on a keyboard. Marks of punctuation, like the poor punctus elevatus (which looks something like an inverted semi-colon) have had to be modernised out of sight.

Because HTR is based on visual recognition technology, it can recognise any number of letter forms, not just the hundred or so on a qwerty keyboard, and reproduce them more accurately than a human who has become accustomed to copying all four forms of “s” as “s”.

Realising these potential applications for the earliest written English, from the period before 1150, is the goal of my new pilot project, Ansund, at Trinity College Dublin.

Ansund aims to use HTR to build an exhaustive, open-access digital corpus of Old English texts, that transcribes all surviving Old English for the first time, and in an unparalleled level of detail. We’re particularly excited to see how many new letter forms we discover and to gather the first, substantial data on word division in Old English (scribes did not always put spaces where we might expect).

Ansund is one of a number of initiatives at Trinity that aims to harness new technologies to increase access to manuscripts, including the Trinity Centre for the Book, which focuses on the history of writing and sharing of the book. Virtual Trinity has digitised over 60 manuscripts and launches this week with the Many Lives of Medieval Manuscripts Symposium.

The ethics and dangers of AI have received important attention over the past year, but its power to make legible and navigable our cultural heritage also deserves attention. Some day soon, it may even ensure you can decode your muddled shopping lists.


Looking for something good? Cut through the noise with a carefully curated selection of the latest releases, live events and exhibitions, straight to your inbox every fortnight, on Fridays. Sign up here.


The Conversation

Ansund is funded by the Trinity Long Room Hub Arts and Humanities Research Institute, a collaboration with Elisabetta Magnanti (Vienna).

Manuscripts for Medieval Studies is part of Virtual Trinity and is funded by Carnegie Corporation of New York.