Workflows for Mass Digitisation
Author: Claus Gravenhorst
at Colloquium of Library Information Employees of the V4+ Countries
Accessible information is a basic need of the society or to put it another way … of everyone. Usually the original can only be accessed in printed form or microfilm/microfiche, which means search, use and distribution of the information is time-consuming, cost-intensive and not available for everyone. The digitisation and conversion of printed items into electronic formats were, until recently, complex and cost-intensive. Insufficient budgets and/or resources prevented extensive transformations to digital repositories. Reliable methods for long-term security and the storage of these enormous data sets were virtually unavailable.
As the result of the METAe project (http://meta-e.uibk.ac.at), funded by the European Commission through the 5th Framework Research Program, CCS Content Conversion Specialists GmbH, Germany developed a comprehensive software solution, available on the market since 2003 under the brand name docWORKS. It is a production tool, which offers an integrated workflow for automated, structured conversion of printed documents into digital objects, which describe the physical and logical document structure by consistent use of international XML standards. These XML documents are to be equated concerning quality and structure with born digital documents and can be transferred to digital library systems, portals, document, content and knowledge management systems as well as virtually any media output device.
The main goal achieved through the project was the automatic generation of administrative, descriptive and structural metadata. The advantages of highly structured documents:
As “digital original” they meet the requirements for a digital long-term storage in repositories
With the use of XML open metadata standards, the data can be transformed and migrated to meet current and future requirements
With logical structures search results are improved (chapter-, article-based) and more easily accessed (chapter titles, headlines, pictures with captions, footnotes, etc.)
Continuity between digitally created and digitized documents
The generic, rule based document analysis technology covers a wide range of different document types such as books, journals, newspapers, but also scientific documents like theses, dissertations and reports. The workflow of the conversion process has been automated and simplified to make the digitisation more cost-effective.
The conversion process depends on the document type and can be completed automatically or semi-automatically. Interactive user interfaces are available to monitor the conversion progress, as well as the verification and correction of conversion results. For conversion a rule-based, object-oriented engine is used in connection with text recognition technology (OCR), supplemented by manual and/or semi-automatic interaction capabilities.
The conversion workflow, well integrated in the libraries infrastructure, is document and application dependent and the conditions can be varied. The goal is to make processing as efficient and automated as possible. Based on a unique identifier, linked to the library catalogue, the status of each document is controlled. Already existing metadata will be automatically ingested form the catalogue. If scanning from origin has to be applied, various scanning devices up to automated Scan-Roboters are supported as well. By using client server based processing, the throughput of the digitisation and conversion process can be scaled in such a way, that it meets mass digitisation requirements. A central, server-based conversion combined with automated quality assurance procedures as well as manual quality assurance spread over internal or external resources (near- or off-shore) enables distributed and efficient production workflows.
During the conversion process physical page objects such as text zones, pictures, tables, advertisements, etc. including their characteristics are determined. In addition to logical structures such as chapters, captions, author, article, etc. as well as associated metadata are determined. Text zones are converted to electronic text with integrated OCR technology. The rich standardized XML-based output increases the added value of digitised collections and opens up new dimensions of access and usability. docWORKS supports open metadata standards like METS, Dublin Core, MODS, NISO MIX, ALTO for storage in repositories. The documents coverted by docWORKS are exported in different standard formats. The most important are image (e.g. TIFF, JPEG, JPEG 2000), PDF (alternately with bookmarks and hidden text) and XML, where the international metadata standard METS, hosted by The Library of Congress, is used in first place. Among other things, the “METS structure map” defines the logical document structures e.g. chapters and articles. In order to store additional information about the physical layout from document pages, in the context of the METAe research project the ALTO schema was developed, which has been meanwhle chosen by many other digitisation projects worldwide incl. The Library of Congress, adopting ALTO as a standard for the NDNP Project (National Digital Newspaper Program, HYPERLINK “http://www.loc.gov/ndnp/” http://www.loc.gov/ndnp/).
Through XSLT transformation virtually any format can be derived for presentation and distribution purposes. The highly structured “digital originals” created by docWORKS provide the source for those transformations.
Today docWORKS is in use at in-house digitisation centres at e.g. Harvard University Library, Stanford University Library, The British Library, Royal Danish Library, National Library of Norway, National Library of Finland as well as several commercial service vendors.

This work, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-Share Alike 2.0 Germany License.
Tags: British Library, docWORKS/METAe, Finland, MASS DIGITIZATION, METAe, Mets Alto, Norway
July 17th, 2008 at 11:20
Dieser Gravenhorst muss sowas, wie ein Digitalisierungs Guru sein. Man kommt gar nicht um den herum beim Thema. Ausserdem scheint der überall gleichzeitig zu sein.
Herr Gravenhorst, schönen Gruß aus dem www – meine Hochachtung.
July 17th, 2008 at 11:22
Wie wahr, wie wahr!
December 12th, 2008 at 11:31
[...] recorded first by Zed68 on 2008-11-30→ Workflows for Mass Digitisation [...]
January 27th, 2009 at 8:45
[...] – bookmarked by 3 members originally found by royzamorano on 2008-12-28 Workflows for Mass Digitisation http://www.godigitalblog.com/2008/07/17/workflows-for-mass-digitisation/ – bookmarked by 6 members [...]