Digital Preservation – Retrospective Digitization and PDF/A

These days, many cultural institutions (scientific and public libraries as well as state, private and ecclesiastic archives) are digitializing valuable cultural assets such as books, prints and maps. Along with the aim of enabling a broad public or scientific use or to protect valuable originals from direct access, this process is used in order to preserve the historic originals and to securely store them in the most optimal environmental conditions.

In addition, the approach is to digitalize these originals in a high quality and resolution, or, in the best case, the highest quality and resolution according to the current technical state. In Germany, this means that, in accordance with the regulations from the German Research Foundation (Deutsche Forschungsgemeinschaft (DFG)) for the retrospective digitalization, black and white originals must be scanned at a minimum resolution of 600 ppi and gray scale and colour originals must be scanned at a minimum resolution of 300 ppi.

Particularly in the case of very valuable originals, we must attempt to reach the highest possible resolution that corresponds to the technical state and therefore be able to offer a very broad range of usage options. As an example, we will use the Beethoven-Haus in Bonn. When digitalizing documents on site, SRZ used a scanner that was of a particularly high-quality and high resolution. In accordance with the size of the original document, the inclusion head uses all of the available resolution capacity of the camera in relation to the size of the template. The digital masters that result from this were saved as uncompressed TIFF and may be several hundred megabytes in size. They are used as the source format for derivatives for various applications and solutions such as printouts and web display.

A lot of different information…

But what type of data is created during the digitalization of documents from libraries and archives? Put simply, the initial type of data created is image data that you want to display in a social context, and you want to provide the users with data about the development, temporal and contextual relationships, creators and current location.

In addition to image data, a whole range of other information is therefore collated and gathered. This begins with the bibliographic metadata, in other words the descriptive data for the document, such as the author or creator, the place and date of publications, publisher, printer, edition, etc.

Then there is the metadata that relates to contents and structure. This involves, for example, recording an existing abstract or creating an abstract. In addition, today, it is also common to use OCR to process all of the documents that are suitable for this and to save the results in their uncorrected form. This provides the basis for conducting a fuzzy search in the text and to highlight search results in the facsimile for the presentation on the Web or in other applications. However, this only works if the positions of the identified words are also saved on the page display in the application.

Structural metadata is created, for example, by recording tables of contents and their link to the physical start of the chapters in the image data or other parts of the work, such as the register, register of places, register of people or graphics, volumes and so on. For this, you must assign the existing paging in the work to the physical files and you must also specify structure elements and contents such as titles, headers and similar elements. The creation of structure elements may go all the way down to page areas such as margins, images or footnotes.

It is also normal to gather the technical metadata for the creation and the physical attributes of the digital representation in order to prove the history of the digital documents in this case. Metadata includes, for example, resolution, bit depth, compression, date recorded, the institution that gathers and owns the information, scan software, scanner hardware and similar information.

Today, all of this descriptive data that relates to contents and structure is gathered into a specific XML schema and saved. The schema that is used mostly across the world is, in this case, the Metadata Encoding and Transmission Standard (METS) for libraries or Encoded Archival Description (EAD) in the world of archiving.

(see http://www.loc.gov/standards/mets/ and http://www.loc.gov/ead/)

Various storage formats and true colours

Various compression procedures are used for image data. Bitonal images are usually saved in the TIFF format that has been compressed to lossless fax group IV. TIFF is also used for digital masters in gray scale and colour and these are stored as lossless uncompressed files or as compressed in accordance with LZW. In the case of derivatives for different purposes, formats such as JPEG, GIF and PNG and various resolutions are used. The JPEG 2000 format, which was approved as an ISO standard at the start of this decade, is becoming more and more popular as a compression method that allows considerably higher compression with incomparably higher quality than the traditional JPEG. A ‘lossless’ (compression without loss) variant is also available for JPEG 2000.

You don’t just want to archive the colour photographs of valuable originals in the highest possible resolution, you also want to be able to reproduce the colours for the screen and printing in such a way that the human eye identifies it as the original. You can attain so-called colour fastness using the colour management with colour profiles. Using colour charts, in which the RGB or CMYK colours are stored as numerical values in a reference, the variances in the colour devices are calculated and the differences to the standard are saved. This then becomes the so-called colour profile. These differences are then attached to the associated image, either as part of the image or as an attached file. In each case, the specific differences of the output devices to the standard, such as printers and monitors, are determined and saved in the same way and their display can therefore be adjusted using the comparison to the reference.

A colourful mix

Let’s first summarize which data is involved in the digitalization and should be taken into account during long-term archiving:

  • Digital masters, image data in high or the highest-possible quality, compressed or uncompressed without loss
  • Colour profiles for high-quality colour photographs
  • Derivatives of the digital master that are created for various uses, such as printing, Web display etc
  • Descriptive, technical, content and structural metadata in various XML and/or text formats

For each long-term archiving of a library or archive unit, the data, saved under different formats, are combined to form a data-technical unit, for example, a TAR archive, and then saved to a suitable archive medium.

To check the integrity of the data at a later point, an additional checksum file (designed with a suitable checksum algorithm) is usually saved.

We are dealing with an information package that is obviously a quite complex entity, contains formats that differ greatly, must include two information units and cannot necessarily be read again by each TAR program, particularly in the world of Windows.

And what is the case with PDF/A?

In contrast to all of the formats mentioned above, PDF/A is completely disclosed and is a defined ISO standard that, as the first ISO standard, does not have any time restrictions. PDF/A is a normal PDF that can be opened and read properly using any program that can display PDF. PDF/A does not depend on any operating system, because PDF readers exist for almost every operating system environment.

How does PDF/A behave with the mixed bag of information from a digital representation, as described before?

  • During the conversion, PDF/A does not touch image data at all and the image data retains its original quality, resolution and size and these can be restored at any time.
  • PDF/A stipulates that information about the colours that are used must be saved and PDF/A is able to integrate created colour profiles.
  • If you want, you can also integrate the created derivatives into the same PDF/A file. This can also be exported without being touched by PDF/A.
  • PDF/A has two completely documented and disclosed areas for metadata. One is the fields for the document description (title, author, topic, keywords). The other is the area of XMP data that consists of XML data and offers the option to incorporate user-defined XML descriptions in this area. All of the XML schemas that are used in the library and archiving environment can be included here.

An additional advantage is that the full text that was obtained through OCR is not only also saved in PDF/A but it can even be placed behind the text in a searchable format, so that search hits can be highlighted in the facsimile in a way that is user-friendly.

The verification of the data integrity can be directly included in the PDF/A file using digital signatures of various levels of conclusiveness (from simple to qualified) and is not separate information.

What the ISO committee says

  • Independent of any device or operating system: Can be reliably displayed on various systems and devices
  • Self-contained: Contains all of the components that are required to display the data
  • Self-documenting: Contains descriptions for the integrated data
  • Freely accessible: Does not contain any technical access protection
  • Open source: Authorized format definition is completely available
  • Wide distribution: Wide usage is perhaps the best protection for the readability of long-term archives (see http://www.aiim.org/documents/standards/19005-1_FAQ.pdf)

Examples of use

My speech about retrospective digitalization using PDF/A mentions some examples:

  • The German National Library of Science and Technology University Library Hanover performed the retrospective digitalization of the research reports that were supported by the Federal Ministry for Education and Research (in Germany) and the long-term preparation for this digitalization.
  • The library of the Swiss Federal Institute of Technology Zurich and the retrospective digitalization of dissertations.
  • The German Broadcasting Archive and various projects, such as the digitalization of documents regarding television programmes from the GDR:
    • Scripts from the program “Der Schwarze Kanal” (The Black Channel)
    • Broadcasting schedules for “Aktuelle Kamera” (Current Camera)
    • The program guide “FF Dabei”
    • Design drawings for the extensive pool of vehicles in “Sandmännchen” (Little Sandman)

Conclusion

All evidence suggests that the considerable advantages of PDF/A that usually exist in contrast to all other data formats for long-term archiving will lead to further distribution of this standard. More and more applications whose data needs to be securely archived for a long time use PDF/A.

About Hans-Joachim Hübner

Leave a Reply