Table of Contents
Current issues for PDF/A in consulting projects
The most common misunderstandings about PDF/A
- TIFF is a standard
- PDF/A is an audit-proof format
- PDF/A is a legal requirement
- All documents can be converted to PDF/A
- A bit of PDF/A is not permitted (e.g. PDF/A but without embedded fonts)
PDF/A Competence Center Members Introduce Themselves:
Zöller & Partner GmbH
How the times have changed! There are hardly any more silly discussions about why TIFF is audit-proof (it actually never was) but not PDF. Or why PDF files can be manipulated (correct, as is with every other document format) and TIFF presumably not (although due to its simple bitmap structure, it’s actually easier to edit a TIFF than a PDF). Some three years after being made available as an ISO standard, the opinion about PDF/A has completely changed, thanks in part to the countless publications and presentations about PDF/A. PDF and PDF/A are the preferred formats in the DMS / archiving market. TIFF is still acceptable for black & white scanned documents, it has however surpassed its prime (although some vendors haven’t realized this yet). Unlike TIFF, PDF is suitable not only for input documents but also as a multi-format container for a variety of visual document content on different platforms. Whoever reads a document on his PDA, handy or iTouch without having previously purchased and downloaded a special application, is most likely looking at a PDF file. In other words: PDF is the new “top dog”, PDF/A its “ISO-disciplined” offspring, allowing for less creative freedom but offering greater reliability in long-term reproducibility. The question in many current projects is no longer PDF/A or TIFF, but rather PDF/A or PDF.
CeBIT 2009 is just around the corner and we are curious to see how the ECM/DMS vendors have integrated PDF/A in their capture, rendition, signature and viewing components. There are great differences in the functionality and level of integration being offered, but the direction is clear: the integration of PDF and PDF/A in the content processes of the end users should be as seamless as possible.
We believe that the direction of long-term archiving formats has turned towards PDF/A. TIFF will exist in parallel for a while still – alone because of the enormous volumes of TIFF files in archiving installations. But many new solutions are not even starting with TIFF, and many old systems that for whatever reasons have to be migrated, are migrating from TIFF to PDF or PDF/A format. XPS (Microsoft’s XML Paper Specification) isn’t really out of the starting blocks yet, and may never get started as a viable document archiving format. In the meantime, PDF/A’s head start is growing. Issues such as the availability of creation and viewing components on different platforms as well as support from a variety of manufacturers plays an important role for long-term archiving formats. PDF/A is here the clear winner. Reason enough to use the opportunity to research the current status of relevant PDF/A themes at the forthcoming CeBIT.
PDF/A Competence Center
President Zöller & Partner GmbH
Current issues for PDF/A in consulting projects
By Bernhard Zöller
Do you still remember PCText/4, Wang Office, an EBCDIC character set or WordPro? Or how about tiled and banded images for scanned documents? TIFFs with annotations? Even if you cannot ever remember hearing these terms, your migration agent knows the problems associated with them. (Just kidding. But who knows, maybe we will soon need such a profession, if the number of formats continues to grow). These formats are not that old, but are already causing a lot of problems when users try to access their contents (e.g. external users working through a browser).
The longer the retention period, the greater the concern with proprietary formats or so-called standard formats (TIFF) with proprietary enhancements (e.g. annotations). The retention itself is not the problem. But “retention obligation” actually means “reproducibility obligation over the entire retention period”. A small but important difference. The company must ensure that the information worth retaining or that legally must be retained can be reliably reproduced in its original form. The more regulated the branch and sensitive the document, the greater the requirement for a reliable authenticity of the reproduced document. Anyone who has to archive accident and pension dossiers, patient or life insurance files, power generation and power plant documentation or similar documents for decades, has a problem with the continual conversion to other formats and the documented proof, that these documents have the same content and in some cases also the same look as the originals did. It involves more than just linking a growing variety of content sources (scanner, e-mail, ERP/CRM applications, print factories etc) with the ECM repository. It is being increasingly recognized that the growing number of sources has to be covered while at the same time the threatening explosion of different formats and format conversions has to be avoided.
Users found an alternative format to the numerous different vendor formats not first with PDF/A, but already with the fast spreading of PDF as a multi-format container which can accommodate text, bitmaps and other content. With the creation of PDF/A as an ISO standard for archiving formats, users now have an explicit guarantee that the visual content of the documents that meet the specification can be reproduced in their original look in the future, e.g. for several decades. However, as is with many topics, the devil lies in the details. The following typical misunderstandings, taken from different projects, help exemplify this.
The most common misunderstandings about PDF/A
TIFF is a Standard
Yes and no. TIFF is an industry standard because the specification from Aldus (belongs now to Adobe) is so widely distributed, that one can assume that the TIFF files conforming to the current TIFF specification 6.0 actually can be viewed with all TIFF viewers. TIFF however has never been approved or standardized by an international standardization organization (ISO, DIN etc.). TIFF also contains a number of proprietary aspects, because these are explicitly permitted. In addition to the “public tags” (these are included in TIFF version 6) there are also so-called “private tags”. With these, vendors can introduce their own specifications, for example for creating annotations, with the consequence that – to stay with this example – the annotations functionality from different vendors are not compatible with each other. There are countless examples of such proprietary functions and the more often they are used (a further example is OCR extracts from scanned documents) the more often users will notice problems in their daily business, with migration or when accessing the files from different clients.
PDF/A is an audit-proof format
False. Not per se. PDF/A documents can be manipulated just like any other document. If you consider the requirements for audit-proof retention (i.e. the effective and documented protection against intentional or accidental manipulation) then you need a system or process that protects the documents, data and similar files. There are no “audit-proof” document formats. It is amazing how this incorrect requirement (because it technically cannot be realized) hangs around. What can lead to the goal of “audit-proof” with the help of PDF/A together with a DMS is a retention that is safe, true to original and therefore unchanged with respect to content and look. Also, the problem of changing page structures doesn’t exist with PDF/A. For example, whether or not a 20-page Word document in 2009 will still have 20 pages in 2019 depends less on the version of Word, and much more on the printer driver that is available then. This may have a different printable area and force Word, in extreme cases, to use different page breaks than are used today. This is only one example of the advantages to using PDF/A to help achieve an audit-proof archive.
PDF/A is a legal requirement
All documents can be converted to PDF/A
No. There is a need for documents that have other content as what is allowed by the ISO standard 19005-1, for example video, audio, transparent annotations etc. In addition, not every document has a printable view, for example an MS Project file. Such documents must still be retained in another format – typically their original format – because there simply isn’t a suitable standard format that they can be converted to. The guideline is then: convert all documents to PDF/A where it is possible and practical. Future versions of PDF/A with enhanced functionality can lead to some of the document and information types being categorized as convertible, even though the conversion isn’t possible today. PDF/A is not a static specification, but will be further developed to respond to new requirements from the market. A further variation is worth mentioning: some users (for example the regional archive in Baden-Württemberg) save specific documents both as PDF/A as well as in their original format. With this, they have both the valid visualization at the time the document was created as well as the formal version with non-visible content.
A bit of PDF/A is not permitted (e.g. PDF/A but without embedded fonts)
Of course it is. Validators can only work with binary logic: “conforming” or “not conforming” – there is no “almost conforming”. The user however can himself decide in which format his documents should be retained, as long as he respects the legal requirements with respect to content and visual identification. A ground for numerous discussions is the requirement in ISO 19005-1 for fonts to be embedded. Without these fonts, a validator will identify a document as being non-conforming to PDF/A. This ultimately leads to an auditor, the regulating authority or whoever is tasked with accepting the documents rejecting the files, because the validation results show a large red X beside the conformity. And a reviewer – quite understandably – cannot accept the files, since he doesn’t necessarily know (or can’t know) what the cause and consequences are. The reviewer adopts the binary logic “conforming / non-conforming” presented by the software, and rejects the files.
Here are two examples to exemplify the predicament that the user who wants to archive ALL of his documents as 100% PDF/A-conforming files now finds himself in: a private bank that creates 100 Word documents daily and wants to save the files in electronic customer dossiers probably has no problem with embedding a subset of two font families in the documents. A large insurance company, creating 50 million documents a year (and this is a realistic number), has a considerable problem with it. If you assume that the two subsets require 100 KB memory space, then this user would save the font 50 million times a year and would need 50 million x 100 KB = 5 TB of memory space per year for the fonts alone. And the thought that in 5 years some 25 TB of memory space is filled up only with the same font is so absurd, that no-one would come up with the idea of taking this seriously.
Mind you: this is not including the document information itself. One could argue that the fonts only account for a small portion of the entire memory requirement and that the relationship should be kept in perspective. However, large users in particular generally create extremely small print streams in their computing centers (1403, AFP) that require only a couple of KB per document, since no space is wasted for visualization resources. Such objects are always linked and not embedded, or one even prints with a bare 1403 line printer format with an economical “retro-look” from the 70’s. There are a number of practical examples where the memory space required by the fonts can be up to 10 times larger than the memory space required by the content. In cases of such disproportion the user can say: we will use PDF/A but without embedded fonts. He uses the basic requirements for PDF/A and delegates almost the entire responsibility for long-term reproducibility on the ISO standard. The only exception is with the fonts: this he regulates himself under his own responsibility, just like he (legally binding) did in the time before PDF/A. He could for example archive all of the fonts that are used, but only once and not embedded in every document. He could only embed the critical fonts (for example the non-Latin fonts for list records or formula fields). Whatever he does, he can make exceptions to the pure doctrine, but carries the responsibility of these exceptions himself.
PDF/A is established as an archiving format. It is an excellent recipe when it comes to ensuring the long-term reproducibility of multifaceted document information. PDF/A and “the other” ISO standard 32000 (the “normal” PDF, based on PDF 1.7) give the user a lot of latitude. There is a tradeoff between almost guaranteed reproducibility (19005-1, PDF/A) and deviations due to functional or business necessities (ISO 32000 resp. PDF 1.7 or proprietary formats). PDF/A is a recommendation that one should take whenever reasonably possible. However, one can deviate from it should it be necessary. But also under these circumstances, PDF/A should be the compass that gives the direction how a format should be archived, so that without constant format and viewer changes a document can be reproduced for decades to come, exactly as it looked when first created.
PDF/A COMPETENCE CENTER MEMBERS PRESENT THEMSELVES
Zöller & Partner GmbH
By Bernhard Zöller, President
Zöller & Partner GmbH is a strict vendor and product neutral consulting company, focussed on the fields of document management, enterprise content management and electronic archiving. Since we also intensively deal on a technical / functional level with implementing these solutions, the subject of archiving formats and its detailed aspects are a constant theme in virtually all DMS and archiving projects. This was already the case 25 years ago as Bernhard Zöller conducted the first consulting project and study at Diebold Germany during the early stages of the market. But only since the approval of PDF/A have the users had an ISO standardized format for long-term archiving available to them. Accordingly, the consulting work is also comprised of solution concepts and component selection, including creation, rendition, viewing and other format relevant components.
Bernhard Zöller is vice-chairman in the VOI. The partnership between the VOI and the PDF/A Competence Center is coordinated in an executive team, where Bernhard Zöller is the contact person for technical questions. Zöller & Partner GmbH is also a member of the PDF/A Competence Center.
More information can be found at: www.zoeller.de