PDF/A Competence Center Newsletter: Issue 19

Topics include: PDF/A-2 Ratified, PDF/A Coming to U.S. Courts, D-Lib Magazine.

Table of Contents:

Overview
Current News:
PDF/A-2 Ratified
PDF/A Coming to U.S. Courts
D-Lib Magazine
Main Article: Each PDF Page is a Painting
PDF/A COMPETENCE CENTER MEMBERS PRESENT THEMSELVES:
Appligent Document Solutions
NEW MEMBERS IN THE PDF/A COMPETENCE CENTER

 

 

 

 

 

Duff Johnson

Dear Readers,

On November 30, 2010, the committee for ISO 19005 met in Ottawa, Canada and ratified Part 2 of ISO 19005 (PDF/A)! What is Part 2 really all about?

While PDF/A-1 is based on PDF 1.4, PDF/A-2 takes advantage of features that only became available in later versions of PDF, up to and including PDF 1.7. PDF/A-2 is no longer based on a specification published by Adobe, but on the internationally approved ISO 32000-1: 2008.

PDF/A-2 adds:

  • Support for transparencies. PDF/A-1 prohibited transparencies due to the immaturity of the technology when the original PDF/A standard was written. PDF/A-2 allows document authors to use transparency without compromising archivability.
  • Support for OpenType Fonts. OpenType is itself standardized as ISO/IEC 14496-22. PDF/A-2 allows these fonts to be directly embedded without first having to convert them into PostScript Type 1 or TrueType.
  • Support for JPEG2000. An advanced image compression, especially useful for scanned documents, JPEG 2000 is supported in ISO 32000-1:2008, and is thus supported in PDF/A-2.
  • Support for PDF/A Collections (“Portfolios” in Acrobat). Collections allows a single PDF to be used as a container and navigation mechanism for a set of associated PDF/A files, for example digitally signed documents.
  • Support for Optional Content. PDF/A-2 supports optional content, also known as “layers”. Optional content provides a method of grouping content for display purposes, useful for technical drawings and plans, or for multilingual documents.
  • New Conformance Level: PDF/A-2u. U is for Unicode. A new conformance level “PDF/A-2u” is a “slimmed-down” version of conformance level “a”. PDF/A-2u delivers the advantages of Unicode with respect to text searching and copying text without the logical structure requirements of the “a” conformance level.
  • Updated Annotations Support. PDF/A-2 revises ISO 19005 support for annotations. Some types are still prohibited, while others (for example, text-editing annotations) are now permitted.
  • PadES Digital Signatures. PDF/A allows electronic signatures in order to facilitate authenticity. PDF/A-2 ensures interoperability by including provisions from the ETSI/PadES (PDF Advanced Electronic Signatures) standard under TS 102 778.

PDF/A-2 does not supersede PDF/A-1; it’s a new Part of ISO 19005 that allows new features in PDF. There is no need to convert or update existing PDF/A-1 files because a PDF/A-1 file is always a valid PDF/A-2 file.

Let’s hear it for PDF/A-2!

Duff Johnson
Vice Chairman, PDF/A Competence Center
CEO Appligent Document Solutions

CURRENT NEWS

PDF/A-2 Ratified

ISO 19005-2 was ratified on November 30, 2010 in Ottawa, Canada. The PDF/A Competence Center was represented at the table by our Chairman Olaf Drümmer and Vice-Chairman Duff Johnson.

PDF/A Coming to U.S. Courts

A docketing standard requiring PDF/A formatting is soon to be implemented for filing PDF case documents with the U.S. federal courts. Although it is not currently required, it is suggested that all interested parties begin using the PDF/A format when filing documents through their Case Management / Electronic Case Files (CM/ECF) systems.

D-Lib Magazine

PDF/A is a “Viable Addition to the Preservation Toolkit”, according to Daniel W. Noonan, Amy McCrory and Elizabeth L. Black of the Ohio State University Archives. Their article was published in the November/December issue of D-Lib, the Magazine of Digital Library Research. Read PDF/A is a Viable Addition to the Preservation Toolkit

MAIN ARTICLE

Each PDF Page is a Painting

Why PDF “reading order” is irrelevant to accessibility

Introduction

This article attempts to explain the concept of “reading order” in PDF files. Why is this necessary?

End users are often frustrated by inconsistent and often illegible results when attempting to read PDF files on mobile devices, search for PDF content online, or when using assistive technology (AT) to read.

Content authors and managers tasked with ensuring accessibility or Section 508 compliance in PDF documents often focus on objects rather than tags, thus missing the mark.

Software developers are (understandably) confused by “reading order” as presented in the current PDF Reference (ISO 32000), the technical description of PDF.

Many have come to use the term “reading order” as functionally synonymous with the logical order imposed by tags, but this interpretation is incorrect.

A technical annex is included for those who want to see what “reading order” really means in PDF.

The PDF Paintbrush

When you create a PDF, you’re painting a picture. Your paintbrush is the is the result of a combination of the software used to create the source document and the software you’ve chosen to convert your source document into the universal electronic document format we all know as PDF.

Like the painter’s brushstrokes, each character, each line and each image is fundamentally independent, but they can interact with each other to produce a particular visual effect. On the PDF page, objects are connected by a coordinate system and not much else. There’s no logical, semantic connection between the letters comprising a word; characters simply happen at a series of locations on the rendered page.

As originally designed, PDF is fundamentally a system for painting objects onto a page, plus a whole lot of other features we aren’t talking about right now! There’s no innate concept of words, sentences, paragraphs, columns, headings, images, tables, lists, footnotes – any of the semantic structures that distinguish a “document” from a meaningless heap of letters, shapes and colors. PDF is fundamentally about how the document appears on the page, not how it looks when abstracted from the page.

When a PDF includes instructions to paint more than one object in the same spot (it happens all the time), the items stack on top of each other, with the last item painted appearing on the top of the stack. Unlike watercolors, each brushstroke only appears to blend with the others if one or more of them is semi-transparent.

Another example: A PDF creator may choose to paint all the Times-Roman text on the page first, then come back and paint the text that appears in other fonts. Since it’s a painting, the order doesn’t really matter anymore than it matters whether Monet painted his water lilies from left-to-right or from right-to-left, or from the inside-out, for that matter.

If we think that these objects have meaning, that’s because we impose semantics on the objects as we read. If you encounter a word that starts at the end of one column and ends at the top of the next, your mind stitches the two together without conscious thought. Likewise, if you see a line of 16 point text followed by a paragraph of 12 point text, you naturally assume the 16 point text was a heading.

Ok, it’s all very well to paint a picture – but what if we want to copy and paste the text, or reflow it for display on a mobile phone? What if the “consumer” is actually a search-engine trying to index the document? What if the user is blind or otherwise disabled, and requires special Assistive Technology devices to read and to operate the computer?

Universal Accessibility

What does it mean to say that an electronic document is “accessible”? If a document’s contents are structured and organized such that the meaning of the document is available to every consumer, then we can say that the document is accessible.

It’s not about file format. Word, HTML, PDF, Excel, Flash… they all have capabilities and limitations as file-formats for electronic documents. In most cases, each format can be made accessible, but it never happens by accident. Accessibility requires intention, and the difficulty of achieving real accessibility tends to vary as a function of the complexity of the content.

In PDF, accessibility is assured by adding “tags” – markers that identify the correct order of objects and the semantics of the document. Tags strongly resemble the HTML tags on which they were modeled.

What’s the “correct order”? There may be more than one; after all, there’s no “correct” way to read a newspaper. The idea of “correct order” is simply that whichever order the author selects for their PDF, it must make sense. It’s not OK, for example, to mix two separate articles together simply because the columns of text are adjacent – but it’s perfectly legitimate to do so in the “reading order” (as the example in the technical annex makes clear).

Conclusion

PDF tags and PDF tags alone define the logical order of the document’s content, and thus, its accessibility. To the extent a PDF is tagged, it might be accessible. To determine whether it is in fact accessible, the tags need to be checked and, if necessary, corrected to ensure correct logical order and usage.

Users seeking to ensure their PDFs are accessible should focus on the tags. The “reading order” of the content on the PDF page just isn’t a factor in accessibility, as we demonstrate below.

Technical Annex: What “Reading Order” in PDF really means

The term “reading order” might lead one to think that it is relevant to accessibility, but it’s not, notwithstanding the confusing representation of the issue in ISO 32000-1:2008, Section 14.8.2.3.

In PDF, “reading order” refers simply to the order in which the computer reads the file. It has nothing whatsoever to do with “logical order”, the sequence people use, which is defined in PDF by tags.

Section 14.8.2.3 will be modified in a new part of ISO 32000 to clear up this confusion over the significance of reading order when re-using PDF page content for accessibility or other purposes.

You can buy an official copy of ISO 32000-1:2008 directly from ISO, or download an authorized copy for free from Adobe Systems.

Demonstration

PDF is capable of extraordinary complexity, sophistication and accuracy in rendering content. From typography to transparencies, from alpha channel to z-order, the range of possibilities in generating the file’s reading order is effectively infinite, even for the same content!

The following image represents an example of content as rendered on a PDF page. Simple though it is, this example nonetheless demonstrates how reading order and logical order are utterly distinct in a PDF file.

Quick brown fox text example

What follows is one possible example of actual PDF code for the above text. This code has been dramatically simplified to make things as clear as possible. Note the rendered text (see the image above) as it occurs in the PDF’s “reading order”, below.

q 1 0 0 -1 0 432 cm
0 g 0 G
BT
14 0 0 -14 72 84 Tm /F1.0 1 Tf (The quick) Tj
14 0 0 -14 147.6 84 Tm (the lazy) Tj
14 0 0 -14 72 100 Tm (brown fox) Tj
14 0 0 -14 147.6 100 Tm (dog.) Tj
14 0 0 -14 72 116 Tm (jumps over) Tj
ET Q

Of course, the “reading order” is this case is semantically incorrect, because the PDF creation software “painted” each line of text across the page, crossing the columns as it did so. Nonetheless, this example is 100% legitimate PDF, as per ISO 32000-1:2008.

If the example code given above included container information (not included to make the example more readable to non-developers) and tags, it would conform to the forthcoming ISO 14289-1 (PDF/Universal Accessibility), even though the “reading order” makes no sense.

If your PDF viewer cannot consume tags, you’ll get your text in the above order, i.e.: “The quick the lazy brown fox dog. jumps over”. That’s NTDE (Not The Desired Effect), as we like to say.

If the PDF is correctly tagged and the viewing software supports tags for content extraction and reuse, the text will appear in correct logical order and with appropriate semantics (in this case, a simple paragraph) as follows:

The quick brown fox jumps over the lazy dog.

And that’s why we can safely and responsibly ignore reading order when considering accessibility in PDF.

If you are unhappy with your results extracting content for reuse, using assistive technology, or otherwise consuming PDFs, be sure your software supports tagged PDF.

Key Takeaways

A PDF is accessible without reference to its “reading order”, but by reference to the tags.

If the PDF has no tags, or if the tags are incorrect, that PDF is not accessible or reliably reusable.

If the creation, viewing or extraction software cannot create or use PDF tags (as appropriate), that software doesn’t support accessible PDF.

PDF/A COMPETENCE CENTER MEMBERS PRESENT THEMSELVES

Appligent Document Solutions

Appligent Document Solutions is one of the oldest and most innovative independent PDF technology companies in the world, with customers in the government, financial services, insurance, manufacturing, publishing and legal sectors, among others. The company invented PDF redaction and form-flattening and was first to market with PDF-specific server applications for forms, stamping, appending, encryption and digital signatures, all available on the leading server OS platforms. The company created the first business document service bureau for PDF files in 1996, and now offers PDF forms development, Section 508 compliance, document automation and publication imaging services, among others.

Appligent Document Solutions is a leader in the continuing development of PDF international standards, with active participation in ISO 32000 (PDF), ISO 19005 (PDF/A) and ISO/DIS 14289 (PDF/UA).

Appligent Document Solutions and PDF/A

In October, 2002, Appligent Document Solutions became one of the first organizations to join the effort to create the PDF/A international standard for archiving electronic documents. Appligent’s CTO, Mark Gavin is one of the original members of the PDF/A ISO working group, contributing heavily to the process of drafting ISO 19005-1.

Appligent Document Solutions’ involvement in the development of PDF/A continues to the present time, with CEO Duff Johnson joining AIIM’s US Committee for PDF/A in 2009. Duff Johnson also chairs the US Committee for ISO/DIS 14289 (PDF/UA (Universal Access)), a standards effort closely related to PDF/A and targeted for publication in 2011.

NEW MEMBERS IN THE PDF/A COMPETENCE CENTER

We welcome the following companies as members in the PDF/A Competence Center:

Coextant, Germany

About PDF/A Competence Center

The first of the PDF Association's Competence Centers.

Leave a Reply