Facebook
Twitter
YOUTUBE
LINKEDIN
XING
About the contributor
Duff Johnson

A veteran of the electronic document space, Duff Johnson is an independent consultant, Executive Director of the PDF Association and ISO Project co-Leader (and US TAG chair) for ISO 32000 and ISO 14289.
More contributions
Post-Conference of PDF Days Europe 2018

On Wednesday, May 16, 2018, directly following PDF Days Europe, the PDF Days Post-Conference offers a variety of workshops on PDF 2.0 Interop or PDF/UA.

PDF Days Europe 2018 – schedule of sessions

Fittingly for the tenth anniversary of PDF’ becoming an ISO standard, standardization will play a significant role this year. The focus will be on recent developments, with an eye on the future. The agenda also includes PDF market analyses, next-generation PDF for mobile devices, universally accessible PDF files and the industry-supported veraPDF validator initiative.

Hotel Recommendations and Sightseeing Tips for PDF Days Europe 2018 in Berlin

You will visit the No. 1 PDF event – the PDF Days Europe 2018 in Berlin? Great! Here are some hotel recommendations and sightseeing tips near the event location (SI Hotel).

A double anniversary for PDF Days Europe 2018

Richard Cohn, Principal Scientist at Adobe, one of the two co-authors of the original PDF specification in the era of Acrobat 1.0 gives the keynote on 25 years of PDF during the PDF Europe 2018.

PDF Day: a breakthrough in Washington DC

Over 260 attendees from 40+ federal and state agencies The latest PDF Day event was held on January 29th at the National Archives Building in Washington DC. With a record-breaking number of attendees, the event was a resounding success. PDF is a broadl …

PDFcontainer – a proposal


Introduction

Cardboard box with "PDF" written on the box.PDF technology profiles can be leveraged to provide trusted, predictable containers for record types that often present workflow and preservation challenges, with email and case files (associated files in arbitrary formats) as primary use-cases. Community-developed profiles can solve utilization, preservation and access challenges for specific domains.

PDF may be a relatively straightforward usage model away from a silver-bullet for archiving many types of electronic content, and useful in operational workflows as well. Using PDF as a vessel in which to transport other content and associated metadata is not a new idea, but it may be time to put it to work at scale.

Background

Accepted worldwide as the de facto electronic document format, PDF includes embedded-file, metadata, navigation, data-protection and accessibility/reuse features in an ISO-standardized, vendor-independent specification. Various subset specifications cater to diverse needs in many different industries. Even as HTML implementations expand, PDF’s mindshare continues to grow.

Technical background

PDF is a self-contained, platform-independent page-description model with electronic document features. PDF/A, the archival subset of PDF, ensures reliable archival-grade electronic documents, and accommodates virtually any arrangement of text and graphics that can be rendered.

  • A key feature of PDF is XMP metadata; onboard XML (via an XMP Extension Schema) that may be associated with any PDF document, page, object or semantic structure element.
  • PDF’s ‘embedded file streams’ (PDF 1.3) and ‘associated files’ (PDF/A-3 and PDF 2.0) features allow containment and characterization of arbitrary content (i.e., as is commonly found in email attachments) within the PDF file.
  • PDF includes interactive features to facilitate navigation (links and bookmarks) and markup.
  • PDF files generated from structured content are readily created with built-in navigation and accessibility features.
  • Page-content may be reliably redacted.
  • PDF files from “analog” workflows (e.g. scanned or faxed pages) are readily intermixed with other PDF files at the page level.
  • PDF includes encryption facilities vital to many types of workflows
  • PDF includes digital signature capabilities that (in the published PDF 2.0 and upcoming PDF/A-4 specifications) support long-term validation of documents.

This collection of features is unique to PDF. Add a profile describing appropriate usage for email archiving (for example) in a PDFcontainer… and then stand back to let the customer apply any content provisions or business rules they want. Collecting institutions will then know that email records retain necessary header information, metadata, links and attachments in a consistent structure that they may characterize, validate, maintain and provide access to in an efficient way.

Data retention, meet data protection

There’s something about rendering

A former developer at Sun Microsystems explained their mid-1990s rendering-based “distributed ledger” methodology to me:

This was a bit before blockchain. We’d digitally sign a document, then publish the resulting MD5 hash in the next day’s Boston Globe. As a certificate authority it left something to be desired, but it was the best way we could think of to indelibly record the signature.

Although counter-intuitive to HTML-oriented developers, PDF’s unique feature-set make the format ideal for archiving email and “case files” – arbitrary collections of content.

Today’s PDF “portfolios” tease this idea, but are hobbled by the lack of a open best-practice or specification to leverage PDF attachments via an interoperable set of profiles for the long-term storage of email and collections of arbitrary data using PDF technology.

Why is PDF’s basis in a page-description model so important? Only a rendering can meet all handling requirements across the entire electronic document landscape. Accordingly, only an approach that considers rendering can readily accommodate existing rendered content. One clear example of this need is the case of information security.

Modern data-protection regulations (notably the GDPR, in effect in the EU by May, 2018) include penalties so stringent that they are driving businesses towards comprehensive control over the documents they keep. Data retention regulations, on the other hand, tend to demand information retention… and if redactions are necessary, to retain information about the redaction (volume, context, purpose, etc.) as well.

If documents and email are archived using data structures that do not include a rendering, the content may be somewhat easier for familiar software to re-use, but the fundamental need to retain content safely – and in a readily human-readable fashion – becomes difficult or impossible to meet.

The PDF paradigm for archiving… anything

A PDFcontainer model would leverage PDF to suit the email and case-files archiving use-case by using PDF’s features in the following way:

Rendering PDF/A presents a generic archival model for representing rendered electronic content.
Metadata Appropriate metadata (such as that defined by PREMIS) may be included in the PDF using an XMP metadata schema identified in the PDFcontainer model. Content semantics may be preserved for reuse by way of tagged PDF.
Attachments PDFcontainer files used for email archives would contain (and thus associate) email attachments with the baseline representation. The PDFcontainer model would include requirements and encourage best-practices in using this feature, including attachment identification and processing requirements and recommendations. If appropriate to the use-case, source-files (mailboxes) may also be embedded in the PDFcontainer.

For case-files, the PDFcontainer includes cover materials, tables of contents, indexes or other information as appropriate to the case file type. Associated files are included as attachments, with appropriate metadata stored in the containing PDF files XMP.

Navigation The PDFcontainer’s XMP metadata may be used to find attached documents; within these, outlines and links can provide rich navigation.
Data protection (contents) Rendered page content may be authoritatively redacted using conventional (and long-standing) tools in an archival context.
Data protection (document) PDF’s encryption facilities may be leveraged to protect the document in live workflow settings.
Authenticity PDF’s digital signature facilities may be leveraged to verify authorship and prevent tampering.

PDF as container: an example

The German ZUGFeRD specification details the use of PDF/A-3 files in a live workflow for electronic invoicing purposes. The “human readable” PDF/A-3 invoice includes, as an attachment, a “machine-readable” XML version of the invoice. The result: automated invoice processing at a low cost and 100% ready for the archive.

Conclusion

Rendering, the process of resolving code into human-readable content, is what you do to make a sharable, accountable thing. As such, it’s rendering – not encoding – that is the truly meaningful act.

Mailbox files and proprietary source-content fit certain needs. These formats will never satisfy the universal requirement for a generic, self-contained and readily-consumable electronic record representing a given body of content at a given moment in time – its ‘rendering’. As such, PDF pages and XMP metadata, together with the other enabling features of the Portable Document Format, offer a practical and vendor-neutral, fully interoperable solution to archiving email and other static electronic content.

Discussion

If the PDFcontainer concept interests you, let’s discuss it at PDF Days Europe this coming May!


Tags: PDFcontainer
Categories: Archives & Libraries, Document Management, PDF/A, Standards development