PDF/A Making Inroads in the USA

Stephen Levenson, US District Courts and Convener of the ISO PDF/A Committee

Stephen LevensonCan a four year old standard have a past that anyone would care about? The answer is yes, if the past begins with the adoption of the personal computer and the proliferation of desktop document creation. The first personal computer (PC) was invented by Steve Jobs and Steve Wozniak in 1976 and revolutionized the way we create and store business documentation. During the 1980’s, Dr. John Warnock was developing Postscript as a printer language and Dr. An Wang would eventually convert his standalone word processor to a product called MultiMate. MultiMate joined other word processing products for the PC from Microsoft, Corel and others. Word processing would become one of the dominating applications for the PC. This is where future began and the trouble started. The concept of feature rich became problematic for preservation. A tension exists for features that help workflow, but need to be turned off for preservation.

No one ever was really worried about long term preservation because there was always microfilm or paper. The general consensus was that as long as there was the ability to magnify a photographic micrographic image, you had preservation. This seemed to work for homogeneous collections that were easy to organize like land records, library card catalogs, and others. You could gather these up under a camera and film them into a complete set.

Paper became the leading choice of indecision. Many collections were processed in paper and then stayed in that form. As long as business documents were not needed in multiple locations these systems worked, but still paper has many problems. In the United States many paper collections have had mold contamination. At a minimum this requires an expensive cleaning processes if the records are still needed for business purposes. Storms and floods have taken a large toll on primary record collections. Storms like Hurricane Katrina were particularly devastating because they covered a large geographic area and mitigation was unavailable or too late to be effective.

Business processes began to require more functionality. Filing from anywhere (Internet), remote access, and remote processing began to be routine business demands. Thus began scanning and direct acceptance of digital files into the business process.

Digitization was becoming the norm and not the exception. Not many years ago the United States National Archives only recognized ASCII and EBCDIC for electronic formats. After all, what else was there? But the Personal Computer and the proliferation of desktop applications for document creation changed everything.

The cacophony of multiple document creation tools was noise and not instrumentation and a maestro was needed to tune and tame the document creation tools. They did not lend themselves into “easy to organize” sets and some other technology was needed to meet the new challenges. The ability to have consistent reliable formats that you could reliably render far into the future (like microfilm) did not exist. That role was filled by PDF.

The good thing about PDF was that it maintained high product quality through market domination by Adobe Systems, Inc. This was true until the proliferation of many PDF writers started to enter the market at the beginning of the 21st century. There then began the need for more than a “.pdf” at the end of a file format to assure quality and reliability.

A business need arose for a testable independent version of PDF that honoured the needs of the archival community, but could still be a practicable business document format. Could we emulate some of microfilm’s properties (e.g. self contained, verifiable, designed for long-term)?

What else is important for long term preservation? 

We cannot assume what hardware platforms will be used to open files in the future. In the desktop arena, UNIX, Windows and Mac still fight for market share. In fact, desktops will not be the only choice for opening and viewing documents. One only has to look at the proliferation of handheld devices from RIM and Apple today to see that it is not very easy to predict where that market may go. It is clear that file formats must be device independent. Also, dependence on external files makes preservation more difficult. It will be an additional burden to assume that the digital object and some external object that is needed to interpret the object will be available twenty years from now. It must stand alone as a self contained object. As an example, if you use a font set that is no longer available on the platform that you are rendering the object on, then a substitute font would be required. This violates the principal of accurate rendering. You are going to get unpredictable results. This will especially be true with pagination and special characters. The future of digital preservation is more self containment and not less. PDF/A is a good start in this direction.

Digital objects must be able to provide documentation for provenance and repurposing. Documents may be collected at the end of a process and verification of that process is essential evidence adding value to the document. PDF/A takes full advantage of XMP to be able to store extensible metadata. In this way the document is able to inherit as much metadata as is necessary to understand fully how it fits into a collection and any other documentation an archivist or record manager might deem important. A minimal portion of the file should be able to be inspected by basic tools. Basic tools should render these parts not only to human readable content, but to machine sortable organization for inspection and audit.

No discussion of this type can occur without discussing XML. Many advocates contend that you convert everything to XML and there you have it you are done. This ignores some critical issues, the biggest of which is that forms and documents are more easily understood by end users. XML looks like computer code. Some end users are ready to work with this, but not many. PDF provides the type of human readable appearance that is expected in business processes. Though, when moving structured database data between processes, XML shines and there is no better format. There is a little known story: PDF and XML are not mutually exclusive. For example, the Brazilian government will store a XML source code in the PDF document private data area. This way, both the PDF format is used and the XML is available for reuse. It is the best of both worlds. A new version of PDF/A will consider incorporating XML. So when it comes to XML and PDF it is not a question of either or but use based on the business case.

Where are we now with PDF/A?

Only good news has happened to date. The ISO committee has published ISO 19005-2. You may ask why does ISO 19005-1 need to be updated if it was to be forever? Well, it will be forever, all documents written to this standard will be able to be read in all future updates. But technology moves on and PDF itself has had some changes. These should probably more correctly be called additions or new features. PDF/A picked PDF up at version 1.4 and ISO 19005-1 is based on that. Since that publication, PDF moved to version 1.7. Adobe then rewrote it and gave to the world and it was renamed ISO 32000-1 after worldwide ISO adoption.

ISO 19005-1 (PDF/A) is based upon the Adobe specification for PDF 1.4. When we update to ISO 19005-2 it will be based on ISO 32000-1 or in plain language it will be a standard based on a standard.

Tremendous excitement has been generated by the PDF/A Competence Center and their activities over the last year. The PDF/A Competence Center is advancing the standard. In addition many educational sessions have been held with many more anticipated. Education has started in the United States and sessions have already occurred in Chicago and Washington D.C. See www.pdfa.org for more information.

Where is the future?

PDF/A 19005-2 will have been voted upon by the time of the 2010 conference. Work has begun on 19005-3. Both standards will be better because a much larger community has made it a better standard. This trend seems to be accelerating with groups like the PDF/A Competence Center growing in membership and mission. PDF/A 19005-3 will be limited to additional file types that may be included in the format. For example XML is under consideration to be an acceptable embedded file type.

At the first international conference on PDF/A I listed in my keynote an acknowledgment to a list of “Hall of Fame” participants that made a critical difference in the emergence and development of this standard. The PDF/A Competence Center has earned its honorable listing.

19005-2 and the coming 19005-3 would not be the quality addition without the wisdom of the PDF/A Competence Center. The standard has benefited from the participation on the ISO committee, AIIM, NPES and the PDF/A Competence Center. We look forward to adding you and your wisdom soon.

About PDF Association

Founded in 2006 as the PDF/A Competence Center, the PDF Association exists to promote the adoption and implementation of International Standards for PDF technology.

Leave a Reply