Facebook
Twitter
YOUTUBE
LINKEDIN
XING
About the contributor
Duff Johnson

A veteran of the electronic document space, Duff Johnson is an independent consultant, Executive Director of the PDF Association and ISO Project co-Leader (and US TAG chair) for ISO 32000 and ISO 14289.
More contributions
Interview with Ulrich Isermeyer, Sr. Business Development Manager, Adobe Systems GmbH, about PDF Days Europe 2018

Ulrich Isermeyer, Sr. Business Development Manager, Adobe Systems GmbH, will be hosting a presentation titled “The power of 3D PDF” at the PDF Days Europe 2018. In this Interview he gives some background information about it.

Six Questions to Duff Johnson, Executive Director at PDF Association

Our Executive Director Duff Johnson gives some personally information about his experience with the “PDF universe” and much more.

Interview with Kevin Willems, Software Engineer at iText Software, about PDF Days Europe 2018

Kevin Willems, Software Engineer at iText Software, will be hosting a presentation titled “Redaction in electronic documents” at the PDF Days Europe 2018. In this interview Kevin gives some background information about it.

Interview with Vel Genov, Senior Product Manager at Datalogics, about the PDF Days Europe 2018

Vel Genov, Senior Product Manager at Datalogics, will be hosting a presentation titled “Common PDF Interoperability Concerns” at the PDF Days Europe 2018. In this interview Vel gives some background information about it.

Interview with Matt Kuznicki, CPO of Datalogics, about the PDF Days Europe 2018

Matt Kuznicki, CPO of Datalogics, will be hosting a presentation titled “PDF 2.0 Updates to Rendering and Color Processing” at the PDF Days Europe 2018. In this interview Matt gives some background information about it.

PDFpackage – a lightweight content management platform


Illustration of a PDF box containing files.In a recent article I discussed using PDF as a container to organize, transport and archive collections of content. Since then I’ve had numerous discussions about this idea with members of the PDF technology and related communities. This article is an attempt to consolidate the substantial (and notably favorable) feedback. Written to stand alone, this article necessarily repeats several aspects of the previous article.

The basic facts

No-one owns PDF, but it’s accepted worldwide as the dominant end-state electronic document format.

The format includes rich capabilities in an ISO-standardized, vendor-independent specification.

Today, few customers leverage PDF’s richer capabilities.

That could change.

Background

Relevant technical attributes

PDF is a self-contained, platform-independent page-description model with electronic document features. PDF/A, the archival subset of PDF, ensures reliable archival-grade electronic documents while accommodating virtually any arrangement of text and graphics that can be rendered.

PDF includes a set of features that, taken together, are unique to the format.

  • A key feature of PDF is XMP metadata; onboard XML (via an XMP Extension Schema) that may be associated with any PDF document, page, object or semantic structure element.
  • PDF’s ‘embedded file streams’ (PDF 1.3) and ‘associated files’ (PDF/A-3 and PDF 2.0) features allow containment and characterization of arbitrary content (i.e., as is commonly found in email attachments) within the PDF file
  • PDF includes rich interactive features to facilitate navigation (links and bookmarks), markup, annotations and JavaScript; most of these features can use embedded content such as movies, 3D objects and MathML.
  • PDF files generated from structured content are readily created with built-in navigation and accessibility features
  • Page-content may be reliably redacted to remove personally-identifiable information (PII) or other content
  • PDF files from “analog” workflows (e.g. scanned or faxed pages) are readily intermixed with PDF pages from other sources
  • PDF includes encryption capabilities that enable many types of workflows
  • PDF includes digital signature capabilities for tamper-proofing, authentication and more
  • PDF may be readily extended to facilitate rich experiences such as reflow and access to live data more commonly associated with HTML-based technologies

The strategic marketplace

Google Trends; PDF keeps going up.It may have been invented before the dawn of the modern web, but PDF has no real competitor in the broad and deep space it occupies in business and consumer technology. Indeed, the format’s mindshare continues to grow. Today, and with Google Docs, Office 365 and web-based technologies notwithstanding, the data shows that PDF is even more the de facto electronic document format than it was 10 years ago.

Why?

PDF supplies the lowest-common-denominator vehicle for sharing content. The format facilitates exchange between users and organizations because the format delivers equivalent – and flawless – representation on any system. Any technology that could replace PDF at this task would have to equal (at least) PDF’s enabling features.

Nonetheless, the organizations that rely on electronic documents have needs that extend beyond PDFs broadly-supported “electronic paper” features. These include:

  • A standardized, vendor-neutral approach to retaining and future-proofing end-state user content such as email and case files.
  • A lightweight records and content-management technology platform enabling vertical solutions across every business sector.
  • A way to share rich, dynamic content experiences without giving up (or while leveraging) PDF’s fundamental features.
  • A roadmap for affirmatively managing electronic content for authentication, long-term retention and security (inc. GDPR) purposes.

Businesses focussed on each of these needs are already worth many billions of dollars, but today, no-one addresses all of these in an independent, consistent and interoperable (i.e., PDF-like) way. None of them yet harness the power of PDF to drive their solutions because few think of PDF beyond its core ‘electronic paper’ skill-set. And yet, PDF is undeniably familiar where it counts – with the users. The marketplace is already very comfortable with the PDF model, even if today’s vendors don’t yet to take full advantage of it. For vendors, this is not a stable situation.

The PDFpackage concept

Adobe's PDF package concept
The “PDF Package” from a 2008 post in Adobe’s Acrobat for Legal Professionals blog.

It’s not a new idea per se – Adobe, Nuance, Foxit, BlueBeam and others have leveraged PDF’s embedded files feature to developed their own “package” implementations over the past 10-15 years. Although some of these are worthy efforts, they did not establish technology platforms that any user could leverage. They were not standards.

The PDFpackage future depends on providing enough structure to enable interoperability while letting the users (and user communities) define everything else their own way. Here’s what vendors need to do:

  1. Support PDF 2.0, including PDFpackage-related features (a handy list is included above).
  2. Add an industry-supported PDFpackage profile describing appropriate usage for email archiving (for example) in a PDFpackage.
  3. Stand back, and let the customer apply any content provisions or business rules they want.

Collecting institutions will then know that email records retain necessary header information, metadata, links and attachments in a consistent structure that they may characterize, index, validate, maintain and provide access to in an efficient way.

PDFpackage profiles can be leveraged to provide trusted, predictable containers for record types that often present workflow and preservation challenges, with email and case files (associated files in arbitrary formats) as primary use-cases. Community-developed profiles can solve utilization, preservation and access challenges for specific domains.

The big idea

PDF may be a relatively straightforward usage model away from a silver-bullet for archiving many types of electronic content, and useful in operational workflows as well.

Data retention, meet data protection

There’s something about rendering

A former developer at Sun Microsystems explained their mid-1990s rendering-based “distributed ledger” methodology to me:

This was a bit before blockchain. We’d digitally sign a document, then publish the resulting MD5 hash in the next day’s Boston Globe. As a certificate authority it left something to be desired, but it was the best way we could think of to indelibly record the signature.

Although counter-intuitive to HTML-oriented developers, PDF’s unique feature-set make the format ideal for archiving email and “case files” – arbitrary collections of content.

Adobe’s PDF “portfolios” concept teased this idea, but was hobbled by the lack of a standardized means of leveraging PDF attachments via an interoperable set of profiles.

Why is PDF’s basis in a page-description model so important? Only a rendering can meet all handling requirements across the entire electronic document landscape. Accordingly, only an approach that considers rendering can readily accommodate existing rendered content.

One clear example of this need is the case of information security.

Modern data-protection regulations (notably the GDPR, in effect in the EU by May, 2018) include penalties so stringent that they are driving businesses towards comprehensive control over the documents they keep. Data retention regulations, on the other hand, tend to demand information retention… and if redactions are necessary, to retain information about the redaction (volume, context, purpose, etc.) as well.

If documents and email are archived using data structures that do not include a rendering, the content may be somewhat easier for familiar software to re-use, but the fundamental need to retain content safely – and in a readily human-readable fashion – becomes difficult or impossible to meet.

The PDF paradigm for archiving… anything

A PDFpackage model would leverage PDF to suit the email and case-files archiving use-case by using PDF’s features in the following way:

Rendering PDF/A presents a generic archival model for representing rendered electronic content.
Metadata Appropriate metadata (such as that defined by PREMIS) may be included in the PDF using an XMP metadata schema identified in the PDFpackage model. Content semantics may be preserved for reuse by way of tagged PDF.
Attachments PDFpackage files used for email archives would contain (and thus associate) email attachments with the baseline representation. The PDFpackage model would include requirements and encourage best-practices in using this feature, including attachment identification and processing requirements and recommendations. If appropriate to the use-case, source-files (mailboxes) may also be embedded in the PDFpackage.

For case-files, the PDFpackage includes cover materials, tables of contents, indexes or other information as appropriate to the case file type. Associated files are included as attachments, with appropriate metadata stored in the containing PDF files XMP.

Navigation The PDFpackage’s XMP metadata may be used to find attached documents; within these, outlines and links can provide rich navigation.
Data protection (contents) Rendered page content may be authoritatively redacted using conventional (and long-standing) tools in an archival context.
Data protection (document) PDF’s encryption facilities may be leveraged to protect the document in live workflow settings.
Authenticity PDF’s digital signature facilities may be leveraged to verify authorship and prevent tampering.

Leveraging PDF in this way as the universal solution for end-state content management invites vendors to consider supporting other features of modern PDF. If it’s possible to convert a PDFpackage-based email archive back into a mailbox, that same file will also be highly accessible and navigable, with excellent performance and extensibility in any reuse context.

PDF as package: an example

The German ZUGFeRD specification details the use of PDF/A-3 files in a live workflow for electronic invoicing purposes. The “human readable” PDF/A-3 invoice includes, as an attachment, a “machine-readable” XML version of the invoice. The result: automated invoice processing at a low cost and 100% ready for the archive.

The business case

The potential for PDFpackage is nothing less than disruption of the current desktop, server and cloud-based paradigms for document creation, handling, finalizing and sharing. Even in the professional’s Office365-dominated world, users already use PDF as their generic platform for sharing end-state content. PDFpackages, although a dramatic upgrade, will also seem very natural. The new world can simply grow from within the old one.

Leveraging the already-accepted electronic document format as a generic packaging model for making information portable means business in:

  • Authoring software designed to leverage PDF 2.0 and PDFpackage features directly.
  • Conversion software capable of creating PDFpackage files from a myriad of sources.
  • Processing (including business-intelligence) software to use the contents of PDFpackage files.
  • Management software to handle PDFpackage content (in addition to conventional PDF) as part of a vertical solution.
  • Mobile and server-based solutions to re-use PDFpackage content in appropriate ways.
  • Implementation consulting and training services…. to replace all the obsolete document-management systems 🙂

Conclusion

Rendering, the process of resolving code into human-readable content, is what you do to make a sharable, accountable thing. As such, it’s rendering – not encoding – that is the truly meaningful act.

Mailbox files and proprietary source-content fit certain needs. These formats will never satisfy the universal requirement for a generic, self-contained and readily-consumable electronic record representing a given body of content at a given moment in time – its ‘rendering’. As such, PDF pages and XMP metadata, together with the other enabling features of the Portable Document Format, offer a practical and vendor-neutral, fully interoperable solution to finalizing, managing and using document-like electronic content.

Discussion

If the PDFpackage concept interests you, let’s discuss it at PDF Days Europe this coming May!


Tags: PDFpackage
Categories: PDF 2.0, Standards adoption, Standards development