Books

Publish smarter – the Internet Standards Series

Dietrich von Seggern // September 2, 2020

PDF in general PDF/A Article

Most publishers today do not stick to a single format or communication channel. Choices are (at least): The World Wide Web (formats are HTML, CSS, JavaScript …), Apps for tablets (internally using mostly PDF or proprietary page-based formats) or paper (produced using PDF). Publishing houses selling content usually use a variety of formats to reach as many as possible (paying) customers. Non-commercial publishers often focus on a single approach. In this article I want to talk about an example in which the internet community came up with a smart solution for a very specific set of digital publications: their own technology standards.

The question isn't – and never has been - whether HTML or PDF is the better format. That's like asking whether a phone call is better than an email, or whether a truck is better than a race car. In the digital world that enables rich communication capabilities the question of channel or format depends on the publication, and it's not always an easy question to answer.

The RFC Series is the home for internet standards, related best practices and informational documentation developed by the responsible organizations: the Internet Engineering Task Force (IETF), the Internet Research Task Force (IRTF), the Internet Architecture Board (IAB) and independent submission streams. The collection is produced by the RFC Editor which is no longer a person but an organization. The RFC Series had its 50th anniversary in 2019 which was also celebrated in RFC 8700 that provides a nice overview about how the whole system has evolved.

Although HTML might seem like the obvious choice for this content, when you check out RFC 8700 you will notice that the RFC editor has made it available in four possible formats: HTML, text, PDF and XML. Why is PDF included? Not because it's a "better" format but because PDF has certain qualities that web technologies don’t.

50 years ago RFCs were rather informal documents published as individual Requests For Comments; these plain-text documents were limited to ASCII character codes. ASCII text is not the most powerful format; it limits characters and disallows umlauts and other diacritical characters – which makes it difficult to e.g. write a specification about how to encode umlauts. In addition you can only use word art for graphics (see example). The RFC Editor began looking for a better solution for almost as soon as the institution of the RFC Editor was created.

The natural choice for publishing internet technology standards was HTML and it is no surprise that HTML is one of the formats the RFC Editor offers today. Nonetheless, the organization has acknowledged that providing HTML alone has downsides. The documents can’t easily be downloaded, they have no (working) pagination concept that would be necessary to printing. Vector graphics are supported only via SVG graphics which limits the applications that can be used for their creation. Versions and updates present challenges, and although the HTML has some structure it can’t be compared with XML in this regard.

The RFC Editor came up with what I believe is a very smart way to publish technical specifications: First of all, new documents can be downloaded in a variety of formats: HTML, TXT, XML and PDF, each format with its specific advantages. Sidenote: Graphics are still not as high-quality as they could be in PDF, since they are understandably only created once for all formats (see for example: https://www.rfc-editor.org/rfc/rfc8728.pdf).

One feature that the current generation of RFCs stand out is the RFC Editor's choice to invest in PDF beyond its pagination and graphics support features. Rather than simple PDF the RFC Editor chose PDF/A-3u. What does that mean?

The "A" stands for "archival", as in "archival quality". Conformance level "u" establishes that all text in the document has Unicode representation, guaranteeing searchability and text extraction. The fact that the RFC Editor chose part "3" of PDF/A allowed for the embedding of arbitrary file formats into the PDF.

From accessibility to ZUGFeRD, there are many use cases for combining PDF/A documents with structured information and that is the case here as well: Each RFC in PDF format includes an embedded XML structure so that interested parties can easily extract structured content for use in their own data repositories.

This is a very smart way to use PDF features for publishing technical specifications! In fact, it's better than what “we” (the PDF industry) does in publishing PDF standards using only PDF (and EPUB) since “our” ISO procedures do not allow us to do otherwise.

Congrats to the RFC Editor!


ABOUT THE AUTHORS

Dietrich von Seggern

Dietrich von Seggern received his degree as a printing engineer, and in 1991 started his professional career as head of desktop prepress production in a reproduction house. He became involved in research projects for digital transmission of print files, and moved to the German Newspaper Marketing Organisation (ZMG). There Dietrich was responsible for a project to enable the digital transmission of …

ABOUT THE AUTHORS

Dietrich von Seggern

Dietrich von Seggern received his degree as a printing engineer, and in 1991 started his professional career as head of desktop …

© 2020 Association for Digital Document Standards e.V. | Privacy Policy | Imprint