Archiving email into PDF containers: A Mellon Foundation project

Duff Johnson // July 10, 2019

PDF/A Announcement Article


In collaboration with the US National Archives and Records Administration (NARA), the Library of Congress and others, the PDF Association will participate in an Andrew W. Mellon Foundation project to identify the essential characteristics and optimal functional requirements of email messages and necessary related information in a PDF technology-based archive.

Running over six months, the project's objective is to publish a technical white paper defining how email messages and their identified essential characteristics and functionality should be converted into PDF containers that can be considered - in the context of captured information - provably authentic and complete email records.

Project deliverables include a published report and appendices that define the significant characteristics of email required to meet the needs of the email archiving community. In addition, the report will lay out use cases for email-to-PDF software, while providing recommendations that vendors can use to build such archiving capabilities into email clients or third-party tools.

Background

From 2016-2018 the Andrew W. Mellon Foundation and the Digital Preservation Coalition supported the Task Force on Technical Approaches for Email Archives which released a report of their findings: The Future of Email Archives (PDF). Published by the Council on Library and Information Resources (CLIR) in August 2018, this report provides a detailed analysis of the technical challenges to preserving email and constructed a working agenda for the community to improve and refine the technical framework for email archiving, including developing interoperable toolkits to fill in the missing gaps.

One of the gaps identified and a goal of many of the other email archiving projects is to identify a format or formats that are appropriate for use in storing email for long-term preservation. One strong contender identified in the Email Task Force report is PDF, and specifically, the PDF/A subset.

Email in archival records

The traditional “print and file” approach to email archiving is cumbersome. As practical experience and digital preservation have advanced, the traditional method is now recognized as destructive, resulting in a loss of contextual information such as metadata, changes to the look and feel of email and the associated user experience, and dissociates messages from their attachments.

As validated by the U.S. Government's Managing Government Records Directive (PDF), which required that by 2016 federal agencies maintain all email records in an accessible electronic format, the records management and archival communities have long recognized that “print and file” is no longer an acceptable records management approach for email.

As part of its support for this directive, and in furtherance of its own mission to bolster the continued integrity of electronic records throughout the federal government, the National Archives and Records Administration (NARA) issues formal guidance to federal agencies defining the file formats they may use when transferring electronic records for permanent retention. While this guidance is directed at federal agencies, in practice it is widely adopted by the archival and records management communities at large.

Maintaining email for archival purposes using often unfamiliar and/or proprietary formats is a daunting task, so many organizations choose the safe and predictable PDF format, in no small part due to its similarity to the familiar "print and file" approach. With the sunset of the print and file method, saving email messages out of their native applications as PDF files provides one potential preservation pathway where records remain in a familiar format that can be managed in an electronic record-keeping system. Additionally, some legacy email systems and systems supporting email encryption ONLY provide support for exporting messages as PDF files. For these reasons, as well as other enabling features of the format, "saving as PDF" remains a compelling option for many organizations and archives.

A canvass at sunset with "PDF/A" on the canvas.

Challenges

Since PDF is designed to replicate the look of paper documents, some of the same problems encountered when converting email messages to paper are evident in conversions to PDF. This grant seeks to begin addressing those problems.

Ironically, it is the flexibility of PDF that makes preserving email messages challenging. In the absence of an industry-supported profile of PDF for the purpose of archiving email there are simply too many ways to store and associate the various components of email in PDF format documents. Moreover, the lack of such a profile inhibits development of end user applications for interacting with such archived email.

Current tools vary widely in how they handle the archivally-significant properties of email; none address them in a manner that is considered fully “archival”, according to Principal Investigator Chris Prom. Email messages converted to PDF files include the following known issues:

  • Email components such as headers, message bodies, and attachments may not be distinguishable.
  • Address fields may be inconsistently populated with either an address or an alias.
  • BCC recipients may not be displayed.
  • Attachments are not dealt with consistently.
  • Messages do not retain conversation threads unless the text is quoted or nested within a single message, which is then 'printed' to a PDF file. A response is easily disassociated from its original context.
  • Aliases are commonly used by mailing lists, but they aren’t always tied to a verified email address.
  • PDFs don’t indicate if the email was read or unread.
  • User classification systems (such as folders or tags) may not be rendered or recorded.
  • Hyperlinks may not be rendered properly and hyperlinked content is not included in the PDF.

Possibilities

In spite of these issues, inconsistencies, and complexities, PDF is a very viable option for email archiving:

  • PDF is highly adopted and familiar
  • Many email clients include PDF as a native export option
  • Many repositories already have functionality to preserve PDF, so PDF’d email will integrate more easily with existing systems than other email-specific formats
  • PDF includes rich metadata features
  • PDF includes rich semantic features to ensure accessibility and enable content reuse
  • PDF includes a broadly implemented model for attachments
  • PDF is readily and reliably redacted (to remove sensitive information), and even includes features specifically to enable redaction workflows
  • PDF includes a proven archival model (PDF/A)
  • PDF is open ISO standardized technology supported by a broad ecosystem of developers

A profile of PDF for archiving email

PDF could be a powerful solution for archiving email, but the necessary profile of PDF to meet archival requirements will require significant work. This Andrew W. Mellon Foundation-funded project assembles a team to draft and publish a report detailing specific properties of email key to archival in the PDF context, along with use cases for converting archival emails to PDF and recommendations to the vendor community regarding development of the profile during a planned secondary phase of the project.

The project is led by its Principal Investigator (PI), Christopher Prom, Professor and Dean for Digital Strategies at the University of Illinois at Urbana-Champaign. The team consists of a wide variety of government, academic and industry experts, including members and representatives of the PDF Association.

Expected outcomes

The project's intent is to provide building blocks for interoperable technical solutions for email archiving. By defining the archival needs in a practical way, systems builders can use them as functional requirements which will create consistent email packages across multiple vendors and platforms.

The primary outcome from the current project will be a technical white paper identifying the significant properties of email and making them actionable in a transformation to PDF. A secondary outcome is direct engagement between the email and PDF communities. By establishing working groups and formal connections between PDF vendors and members of the email community in the academic and government sphere, the goal is to service real needs for advanced email archiving with simplified pathways to help users more easily find their way to best practice in records-keeping.

The work done during this project will aid in a planned follow up project in which both PDF and email industry experts will develop:

  • A profile of PDF designed for email archiving (“Archival PDF profile for email”)
  • A best practice guide to producing PDF files leveraging the profile
  • Technical Notes or Application Notes necessary or desirable as supplements to the above.

Want to learn more?

Members of the PDF Association and others interested in learning more about the organization's involvement with this project should contact Executive Director Duff Johnson.


ABOUT THE AUTHORS

Duff Johnson
Duff Johnson

Duff serves the PDF industry as ISO Project co-Leader and US TAG chair for both ISO 32000 (the PDF specification) and ISO 14289 (PDF/UA). As Executive Director of the PDF Association, Duff coordinates several working groups, speaks at a wide variety of industry events and promotes the advancement and adoption of PDF technology worldwide. An independent consultant, Duff Johnson is a veteran …

ABOUT THE AUTHORS

Duff Johnson

Duff Johnson

Duff serves the PDF industry as ISO Project co-Leader and US TAG chair for both ISO 32000 (the PDF specification) and …

© 2019 Assosiation for Digital Document Standards e.V. | Privacy Policy | Imprint