Tuesday, 1 May 2018

Project leveraging AI to scan and translate Vatican texts dating back to the 8th century

The Vatican Secret Archives sound like a joint that'd give Illuminati conspiracy junkies feverish wet dreams. The site, which sits kitty corner to the Vatican's Apostolic Library, is a treasure trove of Catholic Church documents: over 50 linear miles of that letters, books and papal bulls, some of which date back to the eighth century, to be exact. 

Too bad that you could jam the number of scholastically accessible information in the VSA could be jammed up a gnat's ass and it'd look like a BB in a boxcar.

Y'see, most of what's there is priceless. You'd be a nut to allow folks in to view it on a regular basis, for fear of it being damaged. Those responsible for the VSA have, in the past, made half-assed attempts to scan and translate a small number of the Archive's documents. But remember, we're talking OVER 50 MILES of shelves chockablock with missives, notes and tomes. It'd take a fortune (which the Vatican totally has, I suppose) and an unknowable amount of time to collate, translate and scan everything into a usable format.

According to The Atlantic, computer scientists love challenges like this. A new project called In Codice Ratio is working towards using Artificial Intelligence to understand and translate the Archive's contents using OCR so that the information can be plopped into text documents for humanities scholars to use in their studies. It's tough to do! OCR is notoriously bad at translating handwriting, let alone script which, in some cases, was written in a dead language. But the In Codice Ratio thinks they have some tricks up their sleeves that'll sort it all out... eventually.

From The Atlantic:

In Codice Ratio sidesteps these problems through a new approach to handwritten OCR. The four main scientists behind the project—Paolo Merialdo, Donatella Firmani, and Elena Nieddu at Roma Tre University, and Marco Maiorino at the VSA—skirt Sayre’s Paradox with an innovation called “jigsaw segmentation.” This process, as the team recently outlined in a paper, breaks words down not into letters but something closer to individual pen strokes. The OCR does this by dividing each word into a series of vertical and horizontal bands and looking for local minimums—the thinner portions, where there’s less ink (or really, fewer pixels). The software then carves the letters at these joints. The end result is a series of jigsaw pieces.

In order to teach their AI what the letter strokes were supposed to look like, the In Codice Ratio team turned to aa group of high school students. The students were enlisted to compare perfect examples of individual letters that might be in the texts the AI will be pouring over, with samples from the texts, written in various people's hands. Identifying a letter, pen stroke by pen stroke, the students slowly built a database for the AI to draw upon. Crazy.

If you're looking for a short fascinating read on a project that could have a large impact on the Humanities and Computer Science in the near future, head on over to The Atlantic. It's definitely worth your time.

Image via Pxhere