NOTE: Since this article was published the ISO specification for PDF/A has added new "parts". Visit the PDF/A resource page for current information on published versions of PDF/A.
PDF/A-1a is the higher of the two conformance levels for PDF/A. This article explains that the "a" stands for "accessible", and provides an overview of the end-user, business, regulatory and operational significance of conformance level a. Finally, we introduce PDF/UA, the forthcoming International Standard for accessible PDF.
In HTML, accessibility is simple. The content and the logical structure organizing paragraphs and headings and images into a document are seamless.
PDF is a different world; a world of objects, coordinate references, dictionaries and content streams. Deep within the core technology of PDF, the characters, words and paragraphs and pages so clearly evident to the visually-oriented reader have no logical connection to each other at all.
PDF was originally designed to provide multiplatform fidelity on screen and in print, where the only objective was painting a picture. Theres no such thing as a paragraph or even a word. Runs of text are known as TJ operators. TJs appear in a sequence suiting the software that produced the PDF. Dont confuse words with TJ operators; thats like confusing a sentence with the movements of the print-head used to physically print that sentence.
It may seem too obvious, as it were, for words. It's not. The order in which content-streams occur in the PDF file, commonly referred to as the reading order, is something of a misnomer. In this context, reading order actually refers to the order in which a computer reads the files contents. Humans, by contrast, read in logical order. The two often appear similar but should never be confused with each other. Unfortunately, many developers have interpreted typesetting arrangements as equivalent to logical order, with disastrous results.
As of 1999, PDFs could be made accessible through tags the addition of logical ordering structures (headings, lists, tables, footnotes, form fields, etc.) to document content (text, images).
Tagged PDF makes PDF/A-1a possible, because tags are the mechanism for expressing logical document-structuring concepts in PDF files. Since tags organize the non-visual means of accessing content on the page, correct tagging is essential to the intent of PDF/A-1a.
Given the history of PDF and the way most PDFs are built, achieving logical in addition to visual reproducibility is a substantial challenge. Requiring a reproducible visual appearance over the long-term is profoundly different from requiring that the same documents contents be accessible. The two conformance levels of PDF/A exist to allow for both.
Conventionally, the typical consumer of assistive technology (AT) is a blind person equipped with a computerized braille reader or screen reader software. Their chosen AT device provides text-to-speech, keyboard interaction or other features to make computers usable to those without sight. There are many disabilities, however, and a correspondingly wide variety of assistive technology devices, both software and hardware, are available to enable disabled individuals to read and interact with web-pages, forms and electronic documents.
Governments are increasingly requiring their agencies and contractors to deliver accessible products and services. From websites to forms, regulations, product manuals and reports, documents in the US Federal government must comply with Section 508 accessibility regulations, in effect since 2001. Several state governments have similar laws, as do governments in Canada, various EU member states, Australia and elsewhere. A number of organizations, including the retailer Target, have been found liable, with significant monetary damages, for their failure to provide equal access to content.
That said, accessibility isnt just about the needs of disabled users. Human beings are not the only consumers of electronic content; search and indexing engines are also readers of PDF files. There are several conventional business and operational reasons to ensure PDF files are tagged to high standards and thus achieve meaningful PDF/A-1a compliance.
Blind users are prominent in calling for content accessibility; but the technology that makes documents readable by blind users is directly applicable to the mainstream business needs of civil servants, attorneys, archivists and others considering PDF/A. Properly tagged PDF files offer a series of functional effects with significant benefits for users of archival material.
After all, while visually reproducible pages are, obviously, critical, if you cant find the document in the first place because its not tagged correctly, reproducibility becomes somewhat moot.
The key advantages of accessibility for the institutional or business archivist are:
Searchability, because logical ordering of content ensures that words and phrases are made available to the search engine irrespective of page position, print order, or other, non-semantic factors. Additionally, well-tagged PDFs include alternate text for each semantically significant image, providing additional content to search engines.
Search Engine Optimization (SEO), because tagging-aware search engines understand the logical structure elements (such as headings) in tagged PDF and can use them in their metrics.
Content extraction (assuming your preferred PDF viewer is aware of PDF tags) is enhanced at two levels. First and foremost, proper tagging ensures that text is selected and extracted in the correct logical order. Its not OK to have page header text interrupting a sentence, or to mix up columns in a multiple-column document. Secondly, proper tagging ensures that complex logical structures such as tables may be exported to spreadsheets without error, while document text may be exported with key structural information such as headings and lists intact.
Of course, to gain the benefits of tagged PDF, your PDF software must process PDF tags!
In a PDF, just like in HTML, you have to use as many tags as are required to correctly convey the logical structure of the content. Each paragraph, for example, needs a <P> tag. Headings get tags such as <H1> and <H2>, while lists consist of a group of <LI> tags nested within an <L> tag. Tables (minimally) consist of a collection of <TR>, <TH> and <TD> elements grouped into a set of <TR> tags, themselves contained within a <Table> tag. There are many other such rules for tags, tag attributes, artifacts, images, languages, fonts and so on.
A full description of what accessibility meant for PDF files was unavailable when PDF/A was first developed between 2001 and 2005. For this reason, PDF/A-1a offers only the broadest outlines of what's required to make a PDF file fully accessible. Technically, it's possible to comply with PDF/A-1a with a single tag for each page, irrespective of the document's contents. That's the key reason why claims of conformance or validation of PDF/A-1a are, by themselves, essentially meaningless.
What's lacking is a technical description of PDF/A-1a's true intent; the preservation of not only a visually reproducible document, but an accessible one as well. This description is the subject of another ISO standard – PDF/UA - which we'll discuss shortly.
Accessibility of electronic content is not a concept invented for PDF alone. From IBM's GML and SGML through to HTML and XML the need to mark up text with structure has led a steady march towards a more or less universally comprehensible, and thus accessible, set of concepts.
Large-scale authoring of structured content began with the birth of the internet and the associated explosion in the use of HTML. NIMAS and DAISY provided important options for published materials, but not all content is formally published. To establish accessibility guidelines and to provide a baseline standard for consistent delivery of logical structure in web pages, in 1999 the W3C's Web Accessibility Initiative published the first Web Content Accessibility Guidelines (WCAG) 1.0. WCAG 1.0 has been replaced by a far more advanced, less HTML-specific document, WCAG 2.0, in 2008.
The Federal regulations known as Section 508 have been in force since 2001. More recently, compliance has improved across most Federal agencies, with new websites and documents undergoing at least cursory examination for Section 508 compliance. While large volumes of content remain unvalidated, the trend is for new documents to be either created accessible or made accessible prior to release.
In late 2004, while PDF/A was preparing for it's debut as ISO 19005-1, the industry's main standards development organization was gearing up an ambitious effort to produce an international standard for PDF accessibility: PDF/UA.
Since PDF is a format for arbitrary documents, not just published content, NIMAS and DAISY are fundamentally inapplicable. WCAG 1.0 was specific to HTML, and Section 508 is general and vague and leaves much to be desired. WCAG 2.0, while generally technology agnostic, doesn't specify technical requirements for accessible PDF files. Just as with PDF/X, PDF/A and then PDF/E, a new PDF standard was required to describe accessible PDF in technically complete terms.
Recognizing this need, AIIM, the ANSI-accredited organization leading electronic document standards development and education in the US, initiated the PDF/UA (Universal Accessibility) standards committee in 2004. The objective of PDF/UA: to set clear normative standards for developers seeking to create, manipulate or read accessible PDF files.
In 2009, PDF/UA became ISO/AWI 14289, a candidate International Standard. As of August, 2010, the document is a Committee Draft, with hopes to publish in 2011. Alongside the Standard itself, the Committee plans to publish an authoritative Developer's Guide to PDF/UA, explaining core concepts for software developers, as well as Best Practices for PDF/UA, a guide to tagging PDF files for end-users.
The key thing to understand is that a really good PDF/A-1a file is one that also complies with PDF/UA.
Creating accessible PDF automatically directly from an authoring application is possible, but first and foremost requires the PDF creation software to be capable of generating PDF tags. A wide variety of applications, from Adobe Acrobat's plugin to Microsoft Word to Adobe's InDesign and FrameMaker, as well as free applications such as Open Office, can create tagged PDF.
However, it's not enough to simply use the right software and push the right buttons. Tags must correctly represent the logical structure of the document. Ensuring tags are correctly applied requires strict guidelines governing document authoring, layout and production. Styles must be appropriately named and/or role-mapped, and then employed consistently and correctly. Table structure must be well-considered and implemented; images need alternate text; heading tags should descend from H1 to H2 and H3 without skipping, and so on.
Manual validation work may be minimized or eliminated through authoring practices that are sensitive to accessibility requirements. Absent careful, accessibility-oriented authoring, alternate text for images, complex layouts, tables and forms will require human validation into the forseeable future.
In principle, structured documents are just better. Teach the authors how to write documents with an eye for the concerns of accessibility and the problem is solved in the most cost-effective possible way.
PDFs can be created from any software that can print, an overwhelmingly important reason why PDF is so successful. However, the ease of PDF creation poses a special challenge in terms of accessibility because today, most PDF creation software can't create a tagged PDF.
For this reason, most PDFs are untagged, and most tagged PDFs are unvalidated. If you're trying to achieve high quality PDF/A-1a conforming files from existing untagged PDFs, the only mainstream software currently capable of editing tags in a PDF file is Adobe's Acrobat Professional.
Adobe's Acrobat includes automation triggered by the “Add Tags” function in the Advanced → Accessibility menu (Acrobat Professional 9). This feature scans the PDF and builds a tag tree for the document. You can get lucky on the simplest files, but the Add Tags function invariably makes mistakes, and results must always be checked.
If the software which created the PDF was relatively well-behaved, simple documents may require very few corrections. The more complex the page layouts, especially when tables, multiple columns and graphics are involved, the more difficult it is to check and correct a tagged PDF to ensure accessibility.
The ten commandments at the 'core' of a PDF Accessibility Best Practices Workflow are as follows:
Semantic content is the material of value, the significant text and graphics conveying the meaning of a document. Non-semantic content includes repeating page headers and numbers, image borders, lines separating columns and so on. These are artifacts of the page, the design and the layout, and must be marked as artifact rather than tagged so as not to interfere with the logical flow of the document.
As such the three basic questions when validating for PDF/A-1a conformance are:
The answer to all three questions must be “yes”.
At present, there is no way to automatically validate conformance with PDF/A-1a. Automated checkers are important, but can offer little more than verify tags are present, language is specified, that images include alternate text, and similar, limited validation functions. In many cases, validating the logical order, table or list structures is still a job for a human.
To ensure PDF/A-1a conformance is meaningful rather than notional, it is necessary to ensure the file's contents are accessible. PDF/UA will provide a clear set of file-format requirements to flesh out the details of conformance with the spirit of PDF/A-1a.
A PDF/A-1a conforming file is not just visually reproducible, it's logically reproducible as well. As servers the world over grow into vast silos of PDF and other content, PDF/A-1a files will offer a key benchmark for the long-term preservation of document structure, ensuring high-quality search, reference and reuse for the lifetime of the file.