PDF/A Competence Center Newsletter: Issue 21

Topics include PDF/A and its relevance for digital long-term archiving in libraries and archives.

May 2011

Table of Contents:
Overview
PDF/A and its relevance for digital long-term archiving in libraries and archives
PDF/A Competence Center Members Present Themselves:
Satz-Rechen-Zentrum Hartmann+Heenemann

 

 

 

 

Hans-Joachim Hübner

Dear Readers,

This year, from June 7th to 10th, the 100th German Librarian’s Day conference will take place in Berlin, and the PDF/A Competence Center will be there. More than 3000 participants from public, research and business libraries and archives are expected to attend, taking advantage of this unique opportunity to familiarize themselves with the newest information and knowledge, and to take new ideas and challenges back with them.

An event organized by the German Digital Library Project (Deutsche Digitale Bibliothek – DDB) will be held on the first day. The DDB project aims at networking the available digital information from over 30,000 cultural and academic organizations, and making them available to the public through a common national portal. The project will be completed in several stages and will be integrated in the EUROPEANA (European digital archive).

This is a good example for demonstrating that digitization and the digital, public availability of cultural heritage from libraries, archives and museums will continue to be prominent themes. The digital offering of cultural and academic goods has virtually exploded since the founding of the first two competence centers for digitization at the Lower Saxon State and University Library (NSUB) Gottingen and Bavarian State Library in Munich in 1997.

Whereas up to the late 1990s microfilm was the steadfast standard for long-time certainty, digital formats have established themselves since then as suitable means for securing cultural heritage. Adding to this is the fact that more and more cultural and academic information is being created in digital format, so it should also be retained and presented in digital form.

Anyone offering digitization and digital content also has to think about the question of long-term archiving of the digital information. And anyone supporting long-term archiving must also worry about how access to the information can be ensured in the long-term. Whoever is doing this can no longer overlook PDF/A format.

When it comes to presenting academic works and dissertations, PDF/A has established itself on a broad front as the preferred, or in some cases as the required, submission and long-term archiving format in the German library and archival worlds. PDF/A is however not quite as successful with retro-digitization. Here you still frequently find highly complex file conglomerates of images and metadata, whose integrity is provided by hash-value attachments and whose format often depends on the operating system.

The advantages of long-term archiving with PDF/A, both with “digital born” as well as with digitized documents are however unquestionable. A single, simple and easy to handle format is created from a virtual jungle of countless different file formats, and it can also incorporate the corresponding metadata.

This format fulfills to a great extent the requirements established for long-term archiving. It is:

  • Device and operating system independent – it can be reliably reproduced on different systems and machines
  • Self-contained – it contains all of the components necessary for its correct display
  • Self-documented – it contains a description of the integrated data
  • Openly accessible – there is no technical access protection
  • Openly available – the complete authorized format description is publicly available
  • Widespread – its worldwide use is possibly the best protection for guaranteeing the legibility of long-term archives

In the German-speaking library and archival worlds it has taken digitized information almost 20 years to at least be accepted as comparable to, if not better than, microfilm. It even took JPEG2000 a long time to establish itself in practice. I am convinced that PDF/A will now quickly gain importance for digital long-term archiving and the preservation of data.

Speaking of long-term archiving: the PDF/A Competence Center last year became a partner of Nestor, the German expertise network for long-term archiving, and I represent the PDF/A Competence Center in this organization. This is further proof that awareness of PDF/A has established itself amongst the important German knowledge and cultural institutions.

Hans-Joachim Hübner
Satz-Rechen-Zentrum (SRZ)

 

PDF/A and its relevance for digital long-term archiving in libraries and archives

When we speak about digital long-term archiving in libraries and archives today, we are dealing with two main groups of topics:

  • The presentation and retention of digitally created documents
    e.g. dissertations, conference reports and also increasingly more academic publications in general, in the area of academic and public libraries
    the transfer of electronic documents and dossiers to archives
  • The preservation of digitized cultural heritage of any kind, in the manner that they are presented today in almost all large libraries following digitization projects and on-going activities. Archives and museums are becoming more prominent here too.

“Digital Born” and PDF/A

A look at the websites of different national and international cultural and governmental institutions and authorities will confirm that PDF/A has widely established itself for the long-term preservation of digitally created documents.

The German National Library and the Austrian National Library both name PDF/A as their preferred submission format. The Library of Congress in the USA also provides a lot of information about PDF/A, identifying it as a format that completely fulfills all of their strict requirements for the long-term archiving of digital documents.

In fact all academic libraries today offer the possibility of publishing dissertations and other scholarly publications in electronic format on their document servers. PDF/A is also demanded here on a wide front as submission format, since the long-term preservation of academic information is being thought about already on the document servers.

Here are some examples of university libraries (including some less known ones) supporting PDF/A: TU Cottbus, Uni Düsseldorf, Uni Marburg, Uni Potsdam, Uni Erfurt, TU Berlin, TU Chemnitz, Uni Weimar, TU Munich, Uni Duisburg…

On the international scene we find for example the medical universities of Vienna and Graz, the German Historical Institute in Rome as well as the state archives of Lucerne and St. Gallen.

PDF/A plays a growing role in the federal archive in Koblenz for the transfer of digital documents from different federal agencies and their long-term retention. A considerable increase in volume is expected in the next years, since the electronic processing of transactions is becoming more and more prevalent within the public authorities.

This tendency certainly also applies for regional and local archives. The Regional Archive of Baden-Württemberg determined back in 2006 that the number of documents being submitted in electronic format was increasing, and they decided to change the submission requirement from PDF to PDF/A format. As such they were one of the pioneers for the long-term archiving format that had just become a standard in 2005.

Here PDF/A plays its unique capabilities as a widely available and completely open format that packages all of the information belonging to a document in one single file to the fullest.

The Preservation of Digitized Cultural Heritage

Many cultural organizations including academic and public libraries as well as state, private and church archives digitize valuable cultural heritage like books, prints and maps. This is to facilitate a wide availability to the public or academic scholars, while at the same time limiting access to the valuable original information which is saved under optimal storage conditions.

The approach used is to digitize in a high, although not necessarily the highest state of the art possible quality and resolution in accordance with the guidelines of the DFG.

With extremely valuable artwork in particular it will be attempted to achieve the highest resolution that is technically possible, in order that the digitized information is suitable for the widest palette of future uses possible. The Beethoven house in Bonn is an example of this. When the documents there were scanned in location a particularly high quality and high resolution scanner was used. The resulting digital masters were saved as uncompressed TIFF and may have file sizes of several hundred Megabytes. They are now used as the source for different purposes like prints, published articles and web images.

A lot of information is created…

With exception of image files, a whole range of additional information is gathered together when digitizing documents. This begins with the bibliographical metadata, i.e. data like author, copyright, date and place of publishing, publishing house, printer, circulation etc. that describes the document.

It then continues with the content and structural metadata. These are composed from, for example, an existing summary of the content or the creation of such a summary. It is quite common today to OCR all documents that are suitable for the OCR process, and to save the raw results. This provides the fundament to perform a general search in the text and to mark the search results in the facsimile.

Structural metadata are generated through the creation of a table of contents and linking it with the beginning of the chapters in the document, or to other items in the file like registers, location or personnel lists, illustrations, bands and such.

It is quite common to also create technical metadata to record the physical characteristics of the digitized goods, in order to prove the history of the digital documents. This includes amongst other things resolution, bit depth, compression, date of creation, scanning organization / company, legal owner, scan software, scanner hardware etc.

All of these descriptive, content and structural metadata are today created and stored in specific XML schemas. The schemas that are today most widely used worldwide are the Metadata Encoding and Transmission Standard (METS) in the libraries branch and the Encoded Archival Description (EAD) in the world of archives.

Different storage formats and true colors

Different compression processes are commonly used with images. Bi-tonal images are usually lossless compressed and saved in TIFF format according to Fax Group IV. TIFF is also often used for the digital masters of grey tones and colors, and saved in an uncompressed format. Formats like JPEG, GIF, PNG and various high resolutions are regularly used with the derivatives that are created for different uses and purposes.

JPEG2000, the compression method that was published as an ISO standard at the beginning of the decade, is becoming more and more prevalent. It offers a notably higher compression ratio with incomparably higher quality than traditional JPEG. Even a ‘lossless’ variation is available with JPEG2000, in other words a compression without loss of quality.

Color illustrations of valuable originals should not only be archived with the highest possible resolution; the colors displayed on a monitor and on printouts should also be reproduced such that the native eye can recognize the original in them. Color authenticity is achieved through color management with color profiles. The specific differences between output devices like printers and monitors and the norm can also be recognized and saved. Any deviations can then be compensated for when the output is generated.

A multi-colored basket

Let us summarize which data is created during digitization and must be considered for long-term archiving:

  • Digital master, image file in high or highest possible quality, lossless compressed or uncompressed
  • Color profile for colored illustrations of higher quality
  • Derivations of the digital master which are created for different purposes like print, web view etc.
  • Descriptive, technical, content and structural metadata in different XML and / or text formats.

These data, which are saved in different formats, are often consolidated into one technical data object (e.g. a TAR-archive) for a library- or archive-oriented entity, and stored on a suitable archive medium.

In order to facilitate a future verification of the data integrity, a proof-sum file created with a suitable proof-sum algorithm is also often generated and stored.

We are therefore dealing with an information package that is obviously a very complicated entity, contains different formats, muss incorporate 2 information elements and cannot necessarily be read by every TAR program, especially in the Windows word.

How can it work with PDF/A?

In contrast to the above mentioned formats, PDF/A is a completely open and documented format, and is a defined ISO standard. PDF/A is a normal PDF file that can be viewed with any application that displays PDF files. PDF/A is operating system independent, since there is a PDF reader available for virtually every operating system environment.

How would our multi-colored basket of information, as previously described, now look as a digitized PDF/A file?

  • Image files would in no way be compromised during the conversion and retain their original quality, resolution and size, and can be reproduced at any time
    PDF/A requires that information about the colors used is saved in the file, and is capable of integrating color profiles
  • The created derivations can be integrated in the PDF/A file. These also remain unchanged.
  • PDF/A has two fully documented and openly accessible areas for metadata. One is for document description fields (title, creator, subject, keywords). The other is the area of XMP data, which is comprised of XML files and offers the possibility of bringing own-defined XML descriptions into this area. All of the XML schemas used in the library- and archive-oriented environments can be included here.
  • The searchable full-text that is generated by an OCR process can be placed behind the actual text of scanned files. This makes searching in facsimiles much more user friendly since the search hits can be highlighted.

Sample uses for PDF/A digital files

  • The Technical Information Library in Hanover uses PDF/A for retro-digitizing research reports requested by Ministry of Education and Research and to ensure their availability for the future
  • The library of the Federal Technical Institute of Zurich (ETH Zurich) uses PDF/A for the retro-digitizing of dissertations from the first one up to present day
  • The German Broadcasting Archive has conducted several projects like the digitization of documents about TV programs in the DDR including ‘Der Schwarze Kanal’ (The Black Channel), the ‘Aktuellen Kamera’ (Current Camera), the TV guide ‘FF Dabei’ and construction drawings of the comprehensive pool of vehicles used in the children’s show ‘Sandmännchen’ (Sandmen).
  • Catalog enhancement at the TIB Hannover, ETH and ZB Zurich.

Summary

The advantages of long-term archiving with PDF/A for both ‘digital born’ as well as with digitized document lie well in the hand. A simple, easy to handle format can be created out of a multicolored zoo of different file formats, and can also include all of the metadata belonging to the digital information.

The PDF/A format fulfills to a large extent, with few exceptions, all of the requirements placed on long-term archiving of electronic information. It is:

  • Device and operating system independent – it can be reliably reproduced on different systems and machines
  • Self-contained – it contains all of the components necessary for its correct display
  • Self-documented – it contains a description of the integrated data
  • Openly accessible – there is no technical access protection
  • Openly available – the complete authorized format description is publicly available
  • Widespread – its worldwide use is possibly the best protection for guaranteeing the legibility of long-term archives

PDF/A will continue to gain importance for long-term digital archiving and the retention of digitized cultural heritage.

PDF/A COMPETENCE CENTER MEMBERS PRESENT THEMSELVES
Satz-Rechen-Zentrum (SRZ)

The Satz-Rechen-Zentrum (SRZ) is a solution and service provider in Enterprise Content Management, specializing in electronic archiving and digital document creation.

The most important strength of the business is the development of software solutions, by themselves or with partners, for efficient document creation and document management. Their newest product is the software solution “CROSSCAP”, designed for scanning and optionally signing documents in an integrated workflow. The application is extremely simple and intuitive to use; the installation does not require a server and is easy and inexpensive to integrate.

With ProScan V3, the SRZ offers a further comfortable solution for reproducing complex scan and creation scenarios up to the automatic handover to ECM systems. These can generally be configured for many requirements without needing additional customizing.

All creation solutions from the SRZ support output in PDF/A format, whether searchable with the associated text saved or as a pure facsimile PDF.

The SRZ’s solutions are built on years of experience: they have been successfully developing and marketing their own software solutions for scan and creation services for mass records creation, books and large format scanning, and digitalizing microfilm since 1986.

The spectrum of customers is wide, spreading from retail and residential trade, industrial businesses, pharmaceutical companies and financial institutions to the public sector, libraries and archives.

The Satz-Rechen-Zentrum was founded in 1969 and also has long-standing and fundamental experience from numerous successful projects in the areas of cross media publishing. The company has around 90 employees located in two service centers in Berlin as well as sales support offices in Frankfurt on the Main and Stuttgart.

www.srz.de, www.crosscap.de, www.pdfkorrektor.de

About PDF/A Competence Center

The first of the PDF Association's Competence Centers.

Leave a Reply