What document format should we use for electronic archive of critical business documents going forward? How do you select a format that has all the necessary functionality? What questions do we need to ask? What are the options?
Drivers for Long Term Archive Formats
The chief driver for an electronic document format as a standard archive technology is finding reasonable answers to the questions surrounding media lifetime and reader lifetime. Standard media paper is known to last hundreds of years, microfiche perhaps dozens of years, magnetic media degrades after a decade or so and optical formats – well, who knows. The ability to read archived documents in paper format depends on the quality of your eyewear and linguistic capability, while microfiche and the electronic formats are dependent upon the availability of appropriate hardware and software platforms.
Driving the lifetime question are the legal and regulatory requirements for document retention. Retention periods vary from industry to industry and by country, but it is not uncommon for retention periods to be up to a hundred years, as is the typical case with life insurance; 100 years is deemed to be the lifetime of the person plus the lifetime of the court suit afterwards. A court can require you to produce an ‘original’ of the archived document; you must be able to do so for the length of time required. Once organisations are clear on how long they need to hold electronic documents, the question becomes how to manage them in a cost-effective and secure manner.
To meet the court requirement the documents must highly accessible. This means that the documents are not encrypted, are not held in a proprietary format, and can be displayed or printed to reproduce the original document as provided to the customer. The documents must be stored such that they are platform, OS, and device independent; they must be read, understood and displayed using common computers that are using common hardware and software platforms.
The documents must also be self-contained. No external resources can be required, and all fonts and graphics must be embedded in the document files themselves. What this means for the organisation is that these documents are ‘transparent’ – they can be easily read, distributed, and parsed if required using a broad range of freely available tools.
To meet these requirements, the international archive community created PDF/A, an ISO specification that is a subset of PDF, the file format created in 1993 by Adobe Systems. In 2002, a PDF/A initiative was established to establish a global standard long-term document archiving standard.
The PDF/A initiative was kicked off in 2002 by AIIM (Association for Information and Image Management), NPES (National Printing Equipment Association) and the Administrative Office of the U.S. Courts. By 2005, PDF/A had been published as ISO 19005-1, where it is the cornerstone standard for electronic document file format for long-term archive and preservation. Today, AIIM provides the lead on the PDF/A ISO Standard and the PDF/A Competence Center is the major industry association supporting PDF/A, especially in Europe where adoption rates are higher than in North America. With all this in mind, it is easy to understand why the PDF/A standard is rapidly being required by governments and implemented by industries around the world.
What were prior archive format options? In the past, and in what is still the case in many organisations, Raster/TIFF imaging of documents was the document archiving strategy of choice. However, this is an obsolete technology for most companies.
The loss of information in using images for document archiving means that no text, structure or individual graphics remain available. In effect the ‘rasterisation’ of output leads to throwing away valuable information when creating an archive document using images alone.
Other organisations choose proprietary vendor formats, which have the drawback of a future that is dependent upon that vendor’s viability, or continued support for that particular solution. These document archiving solutions are not designed to be self-contained, but as an ongoing revenue stream for the vendor.
More recently, XML has been used to archive documents as pure data. Difficulties with this approach remain, for example, the ability to exactly duplicate the look-and-feel of the original document is challenging. Also, the sheer number of Document Type Definitions (DTDs) and schemas required to implement a long-term document archiving strategy with XML is daunting for most organisations.
Given these issues with Raster, proprietary formats and XML, PDF, with its wide acceptance as a common file format and ability to retain the exact same document look-and-feel, is the next generation choice to assume a key role in a standardised document archive strategy.
The PDF/A standard is “a file format based on PDF which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems used for creating, storing or rending the files”.
The current PDF/A specification, PDF/A-1, is based on the PDF 1.4 specifications and has two levels. Adoption of the first level (PDF/A-1a) ensures the preservation of a document’s logical structure and content text stream in natural reading order. This is critical when the document is displayed on a mobile device (for example a PDA) or other devices. This feature is commonly known as “Tagged PDFs”. Some PDFs are created with sufficient information to meet this requirement; many PDFs created by production business processes do not contain this information and so fit into the second level.
The second level of compliance is referred to as PDF/A-1b. This level is the minimal standard that ensures the rendered visual appearance of the file is reproducible over the long-term. Specifically, PDF/A-1b ensures that the text (and additional content) can be correctly displayed (e.g. on a computer monitor or in hardcopy), but does not guarantee that extracted text will maintain the same structure as presented in the original document.
PDF/A continues to evolve with Part 2. The PDF/A-2 project was approved in January of 2008. It is based on selected functionality from PDF specifications 1.5, 1.6, and 1.7. Part 2 is in development, and the current timetable has it being approved as a specification by the end of 2010. It will be backwards compatible to PDF/A Part 1. Part 3 is currently in the early design stages.
Enterprise Archive Strategies
Important to the understanding of the long-term document archive strategy is that PDF/A is but one component of a complete enterprise archive strategy. A complete strategy involves a comprehensive system design and implementation process that considers archive systems, corporate processes and procedures, and legacy data and documents. A detailed knowledge of what is to be archived as well as awareness of current and future production processes is invaluable in executing a successful long-term document archive strategy.