Workflow

MASS DIGITIZATION WITH AUTOMATED TEXT RECOGNITION

 
SCANNING. Paper documents are turned into high-resolution digital images with the help of scanning machines. Different types of documents impose various constraints on the type of scanning machines that can be used and on the speed at which a document can be scanned. In partnership with industry, EPFL is working on a semi-automatic, robotic scanning unit capable of digitizing about 1000 pages per hour. Multiple units of this kind will be built to create an effi cient digitization pipeline adapted to ancient documents. Another solution currently being explored at EPFL involves scanning books without turning the pages at all. This technique uses X-ray synchrotron radiation produced by a particle accelerator.

TRANSCRIPTION. The graphical complexity and diversity of hand-written documents make transcription a daunting task. For the Venice Time Machine, scientists are currently developing novel algorithms that can transform images into probable words. The images are automatically broken down into sub-images that potentially represent words. Each sub-image is compared to other sub-images, and classifi ed according to the shape of word it features. Each time a new word is transcribed, it allows millions of other word transcripts to be recognized in the database.

Venise   Venise

TEXT PROCESSING. The strings of probable words are then turned into possible sentences by a text processor. This step is accomplished by using, among other tools, algorithms inspired by protein structure analysis that can identify recurring patterns.

CONNECTING DATA. The real wealth of the Venetian archives lies in the connectedness of its documentation. Several keywords link diff erent types of documents, which makes the data searchable. This cross-referencing of imposing amounts of data organizes the information into giant graphs of interconnected data. Keywords in sentences are linked together into giant graphs, making it possible to cross-reference vast amounts of data, thereby allowing new aspects of information to emerge.