docWORKS/METAe, a tool for converting documents into structured digital objects
Source: 2nd European eAccessibility Forum
Making published content accessible does not consist only of creating electronic text for documents like books and journals. Accessibility means recovering the full structure and the links between the different chunks of the content. This is what docWORKS/METAe has been designed for, as a result of the EU-funded project METAe (IST FP 5).
This software solution, available on the market since 2003, is focussing specifically on the needs of digital libraries and archives. docWORKS/METAe is a production tool, which offers an integrated digitisation workflow for automated, structured conversion of printed documents into digital objects, which describe the physical and logical document structure by consistent use of international XML standards.
This technology is currently applied to large scale digitisation projects at Ivy League cultural heritage institutions as well as commercial content owners and providers.
Accessible information is a basic need of the society or to put it another way … of everyone. Usually the original can only be accessed in printed form or microfilm/microfiche, which means search, use and distribution of the information is time-consuming, cost-intensive and not available for everyone. The digitization and conversion of printed items into electronic formats were, until recently, complex and cost-intensive. Insufficient budgets and/or resources prevented extensive transformations to digital repositories. Reliable methods for long-term security and the storage of these enormous data sets were virtually unavailable.As the result of the METAe project (meta-e), funded by the European Commission through the 5th Framework Research Program, CCS GmbH (Content Conversion Specialists), Germany developed a comprehensive software solution, available on the market since 2003 under the brand name docWORKS/METAe. It is a production tool, which offers an integrated workflow for automated, structured conversion of printed documents into digital objects, which describe the physical and logical document structure by consistent use of international XML standards. TheseXML documents are to be equated concerning quality and structure with born digital documents and can be transferred to digital library systems, portals, document, content and knowledge management systems as well as virtually any media output device.The main goal achieved through the project was the automatic generation of administrative, descriptive and structural metadata. The advantages of highly structured documents:
- As “digital original” they meet the requirements for a digital long-term storage in repositories;
- With the use of XML open metadata standards, the data can be transformed and migrated to meet current and future requirements;
- With logical structures search results are improved (chapter-, article-based) and more easily accessed (chapter titles, headlines, pictures with captions, footnotes, etc.);
- Continuity between digitally created and digitized documents.
The generic, rule based document analysis technology covers a wide range of different document types such as books, journals, newspapers, but also scientific documents like theses, dissertations and reports. The workflow of the conversion process has been automated and simplified to make the digitisation more cost-effective.
The conversion process depends on the document type and can be completed automatically, semi-automatically or manually. Interactive user interfaces are available to monitor the conversion progress, as well as the verification and correction of conversion results. For conversion a rule-based, object-oriented engine is used in connection with text recognition technology (OCR), supplemented by manual and/or semi-automatic interaction capabilities.
The conversion workflow is document and application dependent and the conditions can be varied. The goal is to make processing as efficient and automated as possible. By using client server based processing, the throughput can be scaled in such a way that it even meets mass digitization requirements. A central, server-based conversion as well as quality assurance spread over internal or external resources enables distributed production workflows.
During the conversion process physical objects such as text zones, pictures, tables, advertisements, etc. including their characteristics are determined. In addition to logical structures such as chapters, captions, author, article, etc. as well as associated metadata are determined. Text zones are converted to electronic text with integrated OCR technology. The rich standardized XML-based output increases the added value of digitised collections and opens up new dimensions of access and usability. docWORKS supports open metadata standards like METS, Dublin Core, MODS, NISO MIX, ALTO for storage in repositories. The documents coverted by docWORKS are exported in different standard formats. The most important are image (e.g. TIFF, JPEG, JPEG 2000), PDF (alternately with bookmarks and hidden text) and XML, where the international metadata standard METS, hosted by The Library of Congress is used in first place. Among other things, the “METS structure map” defines the logical document structures e.g. chapters and articles. In order to store additional information about the physical layout from document pages, in the context of the METAe research project the ALTO schema was developed, which has been meanwhle chosen by many other digitisation projects worldwide incl. The Library of Congress, adopting ALTO as a standard for the NDNP Project (National Digital Newspaper Program, NDNP).
Through XSLT transformation virtually any format can be derived for presentation and distribution purposes. This is also the case for eBooks, created in various electronic formats like the current ones from Microsoft, Adobe up to the XML based open eBook format EPUB, hosted by the International Digital Publishing Forum (IDPF). Since people with specific learning difficulties or other kind of disabilities need structured information for navigation as well as an adapted visual or audio-visual presentation via various output devices, the highly structured “digital originals” created by docWORKS provide the source for required transformations to those formats as well.
Today docWORKS is in use at in-house digitisation centres at e.g. Harvard University Library, Library of Congress, Stanford University Library, The British Library, Royal Danish Library, National Library of Norway, National Library of Finland as well as several commercial service vendors.

This work, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-Share Alike 2.0 Germany License.
Tags: Accessibility, docWORKS/METAe, Mets Alto