Archive for January, 2008

docWORKS/METAe, a tool for converting documents into structured digital objects

Monday, January 28th, 2008

Claus GravenhorstSource: 2nd European eAccessibility Forum

Making published content accessible does not consist only of creating electronic text for documents like books and journals. Accessibility means recovering the full structure and the links between the different chunks of the content. This is what docWORKS/METAe has been designed for, as a result of the EU-funded project METAe (IST FP 5).

This software solution, available on the market since 2003, is focussing specifically on the needs of digital libraries and archives. docWORKS/METAe is a production tool, which offers an integrated digitisation workflow for automated, structured conversion of printed documents into digital objects, which describe the physical and logical document structure by consistent use of international XML standards.

This technology is currently applied to large scale digitisation projects at Ivy League cultural heritage institutions as well as commercial content owners and providers.


Accessible information is a basic need of the society or to put it another way … of everyone. Usually the original can only be accessed in printed form or microfilm/microfiche, which means search, use and distribution of the information is time-consuming, cost-intensive and not available for everyone. The digitization and conversion of printed items into electronic formats were, until recently, complex and cost-intensive. Insufficient budgets and/or resources prevented extensive transformations to digital repositories. Reliable methods for long-term security and the storage of these enormous data sets were virtually unavailable.As the result of the METAe project (meta-e), funded by the European Commission through the 5th Framework Research Program, CCS GmbH (Content Conversion Specialists), Germany developed a comprehensive software solution, available on the market since 2003 under the brand name docWORKS/METAe. It is a production tool, which offers an integrated workflow for automated, structured conversion of printed documents into digital objects, which describe the physical and logical document structure by consistent use of international XML standards. TheseXML documents are to be equated concerning quality and structure with born digital documents and can be transferred to digital library systems, portals, document, content and knowledge management systems as well as virtually any media output device.The main goal achieved through the project was the automatic generation of administrative, descriptive and structural metadata. The advantages of highly structured documents:

  • As “digital original” they meet the requirements for a digital long-term storage in repositories;
  • With the use of XML open metadata standards, the data can be transformed and migrated to meet current and future requirements;
  • With logical structures search results are improved (chapter-, article-based) and more easily accessed (chapter titles, headlines, pictures with captions, footnotes, etc.);
  • Continuity between digitally created and digitized documents.

The generic, rule based document analysis technology covers a wide range of different document types such as books, journals, newspapers, but also scientific documents like theses, dissertations and reports. The workflow of the conversion process has been automated and simplified to make the digitisation more cost-effective.

The conversion process depends on the document type and can be completed automatically, semi-automatically or manually. Interactive user interfaces are available to monitor the conversion progress, as well as the verification and correction of conversion results. For conversion a rule-based, object-oriented engine is used in connection with text recognition technology (OCR), supplemented by manual and/or semi-automatic interaction capabilities.

The conversion workflow is document and application dependent and the conditions can be varied. The goal is to make processing as efficient and automated as possible. By using client server based processing, the throughput can be scaled in such a way that it even meets mass digitization requirements. A central, server-based conversion as well as quality assurance spread over internal or external resources enables distributed production workflows.

During the conversion process physical objects such as text zones, pictures, tables, advertisements, etc. including their characteristics are determined. In addition to logical structures such as chapters, captions, author, article, etc. as well as associated metadata are determined. Text zones are converted to electronic text with integrated OCR technology. The rich standardized XML-based output increases the added value of digitised collections and opens up new dimensions of access and usability. docWORKS supports open metadata standards like METS, Dublin Core, MODS, NISO MIX, ALTO for storage in repositories. The documents coverted by docWORKS are exported in different standard formats. The most important are image (e.g. TIFF, JPEG, JPEG 2000), PDF (alternately with bookmarks and hidden text) and XML, where the international metadata standard METS, hosted by The Library of Congress is used in first place. Among other things, the “METS structure map” defines the logical document structures e.g. chapters and articles. In order to store additional information about the physical layout from document pages, in the context of the METAe research project the ALTO schema was developed, which has been meanwhle chosen by many other digitisation projects worldwide incl. The Library of Congress, adopting ALTO as a standard for the NDNP Project (National Digital Newspaper Program, NDNP).

Through XSLT transformation virtually any format can be derived for presentation and distribution purposes. This is also the case for eBooks, created in various electronic formats like the current ones from Microsoft, Adobe up to the XML based open eBook format EPUB, hosted by the International Digital Publishing Forum (IDPF). Since people with specific learning difficulties or other kind of disabilities need structured information for navigation as well as an adapted visual or audio-visual presentation via various output devices, the highly structured “digital originals” created by docWORKS provide the source for required transformations to those formats as well.

Today docWORKS is in use at in-house digitisation centres at e.g. Harvard University Library, Library of Congress, Stanford University Library, The British Library, Royal Danish Library, National Library of Norway, National Library of Finland as well as several commercial service vendors.

Wer hat´s gesehen?

Sunday, January 27th, 2008

sonntags, mit Gert Scobel.
Ein vielseitiger Freund.
Studiogast, CD und Buch
Ich hab es leider nur halb mitbekommen, aber der Medienwissenschaftler Stephan Füssel, Mainz,  hat sich über die Halbwertszeit von CD-Roms als Träger digitaler Bücher ausgelassen und erklärt, daß das Buch auch wenn schon “ganz alt”, ich glaube er sagte 16tes Jahrhundert,  immer noch frisch ist.

Im Netz hab ich alles außer dem Studiointerview gefunden. Weiß da jemand etwas?

Inhouse Massendigitalisierung - macht das Sinn?

Wednesday, January 23rd, 2008

Es stellt sich die Frage, ob und wie man in großen Bibliotheken sowohl Räumlichkeiten und Personal, als auch das Produktionsumfeld für eine Massendigitalisierung realisieren kann, ohne daß man Schätze aus der Sammlung verkaufen muß, die dies finanzieren.

Edelste Lösung ist wohl für Bibliotheken, die Tore zu öffnen und externe Gelder und Dienstleister hineinzulassen, die quasi ohne Belastung des “bibliothekarischen” Ablaufs die Arbeit als gute Geister erledigen. Und eins, zwei, drei oder fünf Jahre später sind die Bücher digital.

Schon ganz schön bescheuert, wenn man überlegt, wofür in der Welt alles Geld ausgegeben wird - oder - wieviel die Börse an einem schlechten Tag verbrennt. Wenn man das in Digitalisierung stecken würde, dann könnte ich jetzt mal schnell in die Erstausgabe von The Celtic Twilight von W.B. Yeats reinschauen.

Schade Schade.

Time drops in decay
Like a candle burnt out.
And the mountains and woods
Have their day, have their day;
But, kindly old rout
Of the fire-born moods,
You pass not away.

NEWTON: Magazin des ORF beschäftigt sich mit digtalen Bibliotheken

Saturday, January 12th, 2008

Kindl

Verpasst?
Die Sendung lief heute am 12. Januar um 14 Uhr bei 3sat.

(Zu sehen ist das Kindle©Amazon.com) So könnte es aussehen: das neue, elektronische Lesezeitalter. Auch wenn es bisher nie geklappt hat, das Buch durch eine papierlose Alternative zu ersetzen - dieses unscheinbare Lesegerät soll das jetzt ändern. Nicht größer als ein Taschenbuch, bietet das E-Book dennoch Platz für eine ganze Bibliothek. 200 Bücher und aktuelle Zeitschriften lassen sich darauf in digitaler Form lesen.

Gezeigt werden auch Zeutschel und Quidenus Scanner.

Den Beitrag als Text gibt es hier

Wiederholung der Sendung? Keine gefunden.

American history goes VISTA

Saturday, January 12th, 2008

 

Library of Congress, Microsoft Announce Agreement to Support New Interactive Experience for Visitors: Innovations will bring historical collections to life onsite and online.