Pros and Cons of PDF/A for Long-Term Archiving

3rd International PDF/A Conference • Proceedings • PDF/A up to date • Long-Term Archiving with PDF

How the times have changed! There are hardly any more silly discussions about why TIFF is audit-proof (it actually never was) but not PDF. Or why PDF files can be manipulated (correct, as with every other document format) and TIFF presumably not (although due to its simple bitmap structure, it’s actually easier to edit a TIFF than a PDF). Some three years after having been published as an ISO standard, the opinion about PDF/A has completely changed, thanks in part to the countless publications and presentations about PDF/A. PDF and PDF/A are the preferred formats in the DMS / archiving market.

TIFF is still acceptable for black & white scanned documents, it has however surpassed its prime (although some vendors haven’t realized this yet). Unlike TIFF, PDF is suitable not only for input documents but also as a multi-format container for a variety of visual document content on different platforms. Whoever reads a document on his PDA, mobile phone or iPod Touch without having previously purchased and downloaded a special application, is most likely looking at a PDF file. In other words: PDF is the new “top dog”, PDF/A its “ISO-disciplined” offspring, allowing for less creative freedom but offering greater reliability in long-term reproducibility. The question in many current projects is no longer PDF/A or TIFF, but rather PDF/A or PDF.

We believe that the direction of long-term archiving formats has turned towards PDF/A. TIFF will exist in parallel for a while still – alone because of the enormous volumes of TIFF files in existing archiving installations. But many new solutions are not even starting with TIFF, and many old systems that for whatever reasons have to be migrated, are migrating from TIFF to PDF or PDF/A format. XPS (Microsoft’s XML Paper Specification) isn’t really out of the starting blocks yet, and may never get started as a viable document archiving format. In the meantime, PDF/A’s head start is increasing. Issues such as the availability of creation and viewing components on different platforms as well as support from a variety of manufacturers plays an important role for long-term archiving formats. Here PDF/A is the clear winner.

The most common misconceptions about PDF/A

Do you still remember PCText/4, Wang Office, an EBCDIC character set or WordPro? Or how about tiled and banded images for scanned documents? TIFFs with annotations? Even if you cannot remember having ever heard these terms, your “migration advisor” knows the problems associated with them. (Just kidding. But who knows, maybe we will soon need such a profession, if the number of formats continues to grow). These formats are not that old, but are already causing a lot of problems when users try to access their contents (e.g. external users working through a browser).

The longer the retention period, the greater the concern with proprietary formats or so-called standard formats (TIFF) with proprietary enhancements (e.g. annotations). The retention itself is not the problem. But “retention obligation” actually means “reproducibility obligation over the entire retention period”. A small but important difference. The company must ensure that the information worth retaining or that legally must be retained can be reliably reproduced. The more regulated the industry and the more sensitive the documents, the greater the requirements for a reliable authenticity of the reproduced document. Anyone who has to archive accident and pension dossiers, patient or life insurance files, power plant documentation or similar documents for decades, has a problem with the continual conversion to other formats and the documented proof, that these documents have the same content and in some cases also the same look as the originals did. It involves more than just linking a growing variety of content sources (scanner, e-mail, ERP/CRM applications, print factories etc.) with the ECM repository. It is being increasingly recognized that a growing number of sources has to be covered while at the same time the threatening explosion of different formats and format conversions has to be avoided.

Users found an alternative format to the numerous different vendor formats even before PDF/A, already the fast spreading of PDF as a multi-format container which can accommodate text, bitmaps and other content already was a substantial improvement. With the publication of PDF/A as an ISO standard for archiving formats, users now have an explicit guarantee that the visual content of the documents that meet the standard can be reproduced in their original look in the future, e.g. for several decades to come. However, as is with many topics, the devil is in the details. The following typical misconceptions, taken from different projects, help exemplify this.

TIFF is a Standard

Yes and no. TIFF is an industry standard because the specification from Aldus (which was acquired by Adobe years ago) is so widely distributed, that one can assume that the TIFF files conforming to the current TIFF specification 6.0 actually can be viewed with all TIFF viewers. TIFF however has never been approved or standardized by an international standardization organization (ISO, DIN etc.). TIFF also contains a number of proprietary aspects, because these are explicitly permitted. In addition to the “public tags” (these are included in TIFF version 6) there are also so-called “private tags”. With these, vendors can introduce their own specifications, for example for creating annotations, with the consequence that – to stay with this example – the annotations from different vendors are not compatible with each other. There are many examples of such proprietary functions and the more often they are used (a further example are OCR extracts from scanned documents) the more often users will notice problems in their daily business, when migrating or when accessing the files from different viewers.

PDF/A is an Audit-Proof Format

False. Not per se. PDF/A documents can be manipulated just like any other document. If you consider the requirements for audit-proof retention (i.e. the effective and documented protection against intentional or accidental manipulation) then you need a system or process that protects the documents. There are no “audit-proof” document formats. It is amazing how this incorrect requirement (because technically it is impossible to implement) hangs around. What can lead to the goal of “audit-proof” with the help of PDF/A together with a DMS is a retention that is safe, true to original and therefore unchanged with respect to content and look. Also, the problem of accidental changes of the page layout doesn’t exist with PDF/A. For example, whether or not a 20-page Word document in 2009 will still have 20 pages in 2019 depends less on the version of Word, and much more on the printer driver that is available then. This may have a different printable area and force Word to use different page breaks than are used today. This is only one example of the advantages of PDF/A when it comes to achieving an audit-proof archive.

PDF/A is a Legal Requirement

No. In Germany, the retention requirements for electronic documents in accordance with trade and tax laws are more or less neutrally formulated with respect to technology. If you must be able to reproduce your documents over long periods of time, PDF/A is really worth looking at, but it is NOT A REQUIREMENT. There are however regulations and process descriptions in some industries (e.g. in social securities or documentation requirements for power plants), where organisations are provided with a list of specific document formats that are acceptable for long-term retention. In other cases the electronic document exchange is based on PDF/A from the beginning, like for example the DALE records in public insurance organizations which are submitted as XML and as PDF/A. PDF/A eases the fulfillment of legal requirements, because the user doesn’t have to think about all the details pertaining to long-term reproducibility. No worrying about links to resources, no more versioning of resources, no pondering about backwards compatibility, no concerns about proprietary page logic (e.g. single page TIFF, where only the vendor’s database knows about previous or subsequent pages). And what if password-protected documents or JavaScript animated PDFs manage to creep in through external e-mail? No problem, the PDF/A validator will identify these unwanted intruders and weed them out. Therefore, although it isn’t a categorical obligation, the user will make things a lot easier for himself in fulfilling the requirements when he converts the relevant documents to PDF/A.

All Documents can be Converted to PDF/A

No. There is a need for documents that have other content than what is allowed by the ISO standard 19005-1, for example video, audio, transparent annotations etc. In addition, not every document has a printable view, for example an Microsoft Project file. Such documents must still be retained in a different format – typically their original format – because there simply isn’t a suitable standard format that they can be converted into. The guideline is then: convert all documents to PDF/A where it is possible and practical. Future versions of PDF/A with enhanced functionality can lead to some of the document and information types being categorized as convertible, even though such conversion isn’t possible today. PDF/A is not a static standard, but further parts of the standard will be developed to respond to new requirements from the market. A further variation is worth mentioning: some users (for example the regional archive in Baden-Württemberg) save specific documents both as PDF/A as well as in their original format. With this, they have both the valid visualization at the time the document was created as well as the original format version with non-visible content.

A Bit of PDF/A is not Permitted (e.g. PDF/A but without Embedded Fonts)

Of course it is. Validators can only work with binary logic: “conforming” or “not conforming” – there is no “almost conforming”. The user however can himself decide in which format his documents should be retained, as long as he respects the legal requirements with respect to content and visual identification. A ground for numerous discussions is the requirement in ISO 19005-1 for fonts to be embedded. Without these fonts, a validator will identify a document as being non-conforming to PDF/A. Almost always an auditor, the regulating authority or the person acting as the gate keeper to the archive will reject such non-conforming files – simply because the validation results show a large red X next to the conformity tag. And a reviewer – quite understandably – cannot accept the files, since he doesn’t necessarily know (or can’t know) what the cause and consequences are. The reviewer as well adopts the binary logic “conforming vs. non-conforming” presented by the validation software, and rejects the files.

Here are two examples to illustrate the predicament that a user who wants to archive ALL of his documents as 100% PDF/A conforming files now finds himself in: a private bank that creates 100 Word documents daily and wants to save the files in electronic customer dossiers probably has no problem with embedding a subset of two font families in the documents. A large insurance company, creating 50 million documents a year (and this is a realistic number), has a considerable problem with it. If you assume that the two subsets require 100 KB storage space, then this user would save the font 50 million times a year and would need 50 million x 100 KB = 5 TB of storage per year for the fonts alone. And the thought that in 5 years some 25 TB of storage space is filled up only with the same font is so absurd, that no-one would come up with the idea of taking this seriously.

Mind you: this is not including the document information itself. One could argue that the fonts only account for a small portion of the entire memory requirement and that the relationship should be kept in perspective. However, large organisations in particular generally create extremely small print streams in their computing centers (1403, AFP) that require only a few KB per document, since no storage capacity is wasted for visualization resources. Such objects are always linked and not embedded, or one even prints with a bare 1403 line printer format with an economical “retro-look” from the 70’s. There are a number of practical examples where the storage capacity required by the fonts can be up to 10 times larger than the storage capacity required by the content. In cases of such disproportion the user can say: we will use PDF/A but without embedded fonts. He uses the basic requirements for PDF/A and delegates almost the entire responsibility for long-term reproducibility to the ISO standard. The only exception is with regard to the fonts: this he regulates himself under his own responsibility, just like he (legally binding) did in the time before PDF/A. He could for example archive all of the fonts that are used, but only once and not embedded in every document. He could only embed the critical fonts (for example the non-Latin fonts for list records or formula fields). Whatever he does, he can make exceptions to the pure doctrine, but carries the responsibility of these exceptions himself.

Summary

PDF/A is established as an archiving format. It is an excellent recipe when it comes to ensuring the long-term reproducibility of multifaceted document information. PDF/A and “the other” ISO standard 32000 (the “normal” PDF, based on PDF 1.7) give the user a lot of latitude. There is a tradeoff between almost guaranteed reproducibility (ISO 19005-1, PDF/A) and deviations due to functional or business necessities (ISO 32000 resp. PDF 1.7 or proprietary formats). PDF/A is a recommendation that one should choose whenever reasonably possible. However, one can deviate from it should it be necessary. But also under these circumstances, PDF/A should be the compass that gives the direction how a format should be archived, so that without constant format and viewer changes a document can be reproduced for decades to come, exactly as it looked when first created.

About Berndhard Zoller

Bernhard Zöller, President Zöller & Partner GmbH, Vice-Chairman VOI Zöller & Partner GmbH is a vendor and product neutral consulting company, focussed on the fields of document management, enterprise content management and electronic archiving. Since we also intensively deal on a technical / functional level with implementing these solutions, the subject of archiving formats and its detailed aspects are a constant theme in virtually all DMS and archiving projects. This was already the case 25 years ago as Bernhard Zöller conducted the first consulting project and study at Diebold Germany during the early stages of the market. But only since the approval of PDF/A have the users had an ISO standardized format for long-term archiving available to them. Accordingly, the consulting work is also comprised of solution concepts and component selection, including creation, rendition, viewing and other format relevant components. Bernhard Zöller is vice-chairman in the VOI. The partnership between the VOI and the PDF/A Competence Center is coordinated in an executive team, where Bernhard Zöller is the contact person for technical questions. Zöller & Partner GmbH is also a member of the PDF/A Competence Center.

Leave a Reply