PDF/A-1a is the higher of the two conformance levels for PDF/A. This article explains that the “a” stands for “accessible”, and provides an overview of the end-user, business, regulatory and operational significance of conformance level “a”. Finally, we introduce PDF/UA, the forthcoming International Standard for accessible PDF.
The role of accessibility in PDF/A
In HTML, accessibility is simple. The content and the logical structure organizing paragraphs and headings and images into a document are seamless.
PDF is a different world; a world of objects, coordinate references, dictionaries and content streams. Deep within the core technology of PDF, the characters, words and paragraphs and pages so clearly evident to the visually-oriented reader have no logical connection to each other at all.
PDF was originally designed to provide multiplatform fidelity on screen and in print, where the only objective was painting a picture. There’s no such thing as a “paragraph” or even a “word”. Runs of text are known as “TJ operators”. TJs appear in a sequence suiting the software that produced the PDF. Don’t confuse words with TJ operators; that’s like confusing a sentence with the movements of the print-head used to physically print that sentence.
It may seem too obvious, as it were, for words. It’s not. The order in which content-streams occur in the PDF file, commonly referred to as the “reading order”, is something of a misnomer. In this context, “reading order” actually refers to the order in which a computer reads the file’s contents. Humans, by contrast, read in “logical order”. The two often appear similar but should never be confused with each other. Unfortunately, many developers have interpreted typesetting arrangements as equivalent to logical order, with disastrous results.
As of 1999, PDFs could be made accessible through “tags” – the addition of logical ordering structures (headings, lists, tables, footnotes, form fields, etc.) to document content (text, images).
Tagged PDF makes PDF/A-1a possible, because tags are the mechanism for expressing logical document-structuring concepts in PDF files. Since tags organize the non-visual means of accessing content on the page, correct tagging is essential to the intent of PDF/A-1a.
Given the history of PDF and the way most PDFs are built, achieving logical in addition to visual reproducibility is a substantial challenge. Requiring a reproducible visual appearance over the long-term is profoundly different from requiring that the same document’s contents be accessible. The two conformance levels of PDF/A exist to allow for both.
Who needs accessibility?
Conventionally, the typical consumer of assistive technology (AT) is a blind person equipped with a computerized braille reader or “screen reader” software. Their chosen AT device provides text-to-speech, keyboard interaction or other features to make computers usable to those without sight. There are many disabilities, however, and a correspondingly wide variety of assistive technology devices, both software and hardware, are available to enable disabled individuals to read and interact with web-pages, forms and electronic documents.
Governments are increasingly requiring their agencies and contractors to deliver accessible products and services. From websites to forms, regulations, product manuals and reports, documents in the US Federal government must comply with Section 508 accessibility regulations, in effect since 2001. Several state governments have similar laws, as do governments in Canada, various EU member states, Australia and elsewhere. A number of organizations, including the retailer Target, have been found liable, with significant monetary damages, for their failure to provide equal access to content.
That said, accessibility isn’t just about the needs of disabled users. Human beings are not the only “consumers” of electronic content; search and indexing engines are also “readers” of PDF files. There are several conventional business and operational reasons to ensure PDF files are tagged to high standards and thus achieve meaningful PDF/A-1a compliance.
Accessibility benefits every user
Blind users are prominent in calling for content accessibility; but the technology that makes documents readable by blind users is directly applicable to the mainstream business needs of civil servants, attorneys, archivists and others considering PDF/A. Properly tagged PDF files offer a series of functional effects with significant benefits for users of archival material.
After all, while visually reproducible pages are, obviously, critical, if you can’t find the document in the first place because it’s not tagged correctly, reproducibility becomes somewhat moot.
The key advantages of accessibility for the institutional or business archivist are:
Searchability, because logical ordering of content ensures that words and phrases are made available to the search engine irrespective of page position, print order, or other, non-semantic factors. Additionally, well-tagged PDFs include alternate text for each semantically significant image, providing additional content to search engines.
Search Engine Optimization (SEO), because tagging-aware search engines understand the logical structure elements (such as headings) in tagged PDF and can use them in their metrics.
Content extraction (assuming your preferred PDF viewer is aware of PDF tags) is enhanced at two levels. First and foremost, proper tagging ensures that text is selected and extracted in the correct logical order. It’s not OK to have page header text interrupting a sentence, or to mix up columns in a multiple-column document. Secondly, proper tagging ensures that complex logical structures such as tables may be exported to spreadsheets without error, while document text may be exported with key structural information such as headings and lists intact.
Of course, to gain the benefits of tagged PDF, your PDF software must process PDF tags!
Why PDF/A-1a is insufficient
In a PDF, just as in HTML, you must use as many tags as are required to correctly convey the logical structure of the content. Each paragraph, for example, needs a <P> tag. Headings get tags such as <H1> and <H2>, while lists consist of <LI> tags nested within an <L> tag. Tables (minimally) consist of a collection of <TR>, <TH> and <TD> elements grouped into a set of <TR> tags, themselves contained within a <Table> tag. There are many other such rules for tags, tag attributes, artifacts, images, languages, fonts and so on.
A full description of what accessibility means for PDF files was unavailable when PDF/A was first developed between 2001 and 2005. For this reason, PDF/A-1a offers only the broadest outlines of what’s required for accessible PDF. Technically, it’s possible to comply with PDF/A-1a using a single tag for each page, irrespective of the document’s contents. That’s the key reason why claims of conformance or validation of PDF/A-1a are, by themselves, essentially meaningless.
What’s lacking from the standard is a technical description of PDF/A-1a’s true intent; the preservation of not only a visually reproducible document, but an accessible one as well. This description is the subject of another ISO Standard – PDF/UA – which we’ll discuss shortly.
Existing concepts of accessibility
From IBM’s GML and SGML through to HTML and XML the need to mark up text with structure has led a steady march towards a more or less universally comprehensible, and thus accessible, set of concepts.
Large-scale authoring of structured content began with the birth of the Internet and the associated explosion in the use of HTML. NIMAS and DAISY provided important options for published materials, but not all content is formally published. To establish accessibility guidelines and to provide a baseline standard for consistent delivery of logical structure in web pages, the W3C’s web Accessibility Initiative published the first Web Content Accessibility Guidelines (WCAG) 1.0 in 1999. WCAG 1.0 has since been replaced by a far more advanced, less HTML-specific document, WCAG 2.0, in 2008.
The Federal regulations known as Section 508 have been in force since 2001. More recently, compliance has improved across most Federal agencies, with new websites and documents undergoing at least cursory examination for Section 508 compliance. While large volumes of content remain unvalidated, the trend is for new documents to be either created accessible or made accessible prior to release.
In late 2004, while PDF/A was preparing for its debut as ISO 19005-1, the industry’s main standards development organization was gearing up an ambitious effort to produce an international standard for PDF accessibility: PDF/UA.
Since PDF is a format for any document, not just published content, NIMAS and DAISY are fundamentally inapplicable. WCAG 1.0 was specific to HTML, and Section 508 leaves much to be desired. WCAG 2.0, while generally technology agnostic, doesn’t specify technical requirements for accessible PDF files. Just as with PDF/X, PDF/A and then PDF/E, a new PDF standard was required to describe accessible PDF in technically complete terms.
Recognizing this need, AIIM, the ANSI-accredited organization leading electronic document standards development and education in the US, initiated the PDF/UA (Universal Accessibility) standards committee in 2004. The objective of PDF/UA: to set clear normative standards for developers seeking to create, manipulate or read accessible PDF files.
In 2009, PDF/UA became ISO/AWI 14289, a candidate International Standard. As of August, 2010, the document is a Committee Draft, with hopes to publish in 2011. Alongside the Standard itself, the Committee plans to publish an authoritative Developer’s Guide to PDF/UA, explaining core concepts for software developers, as well as Best Practices for PDF/UA, a guide to tagging PDF files for end-users.
Creating PDF/A-1a (accessible) PDFs
The key thing to understand is that a really good PDF/A-1a file is one that also complies with PDF/UA.
Creating accessible PDF automatically directly from an authoring application is possible, but first and foremost requires the PDF creation software to be capable of generating PDF tags. A wide variety of applications, from Adobe Acrobat’s plugin to Microsoft Word to Adobe’s InDesign and FrameMaker, as well as free applications such as Open Office, can create tagged PDF.
However, it’s not enough to simply use the right software and push the right buttons. Tags must correctly represent the logical structure of the document. Ensuring tags are correctly applied requires strict guidelines governing document authoring, layout and production. Styles must be appropriately named and/or role-mapped, and then employed consistently and correctly. Table structure must be well-considered and implemented; images need alternate text; heading tags should descend from H1 to H2 and H3 without skipping, and so on.
Manual validation work may be minimized or eliminated through authoring practices that are sensitive to accessibility requirements. Absent careful, accessibility-oriented authoring, alternate text for images, complex layouts, tables and forms will require human validation into the foreseeable future.
In principle, structured documents are just better. Teach the authors how to write documents with an eye for the concerns of accessibility and the problem is solved in the most cost-effective possible way.
- Use color or contrast alone to indicate meaning
- Use design to convey meaning in a way that can’t be expressed though the document’s text
- Spanning table cells, complex table structures in general
- Table tags without tabular data
- Illustrations comprised of many small vector graphics
- Overlapping elements
- Background images
- Prefer simpler layouts
- Address the need for alternate text for graphics early in the authoring process
Making untagged PDFs accessible
PDFs can be created from any software that can print, one important reason why PDF is so successful. However, the ease of PDF creation poses a special challenge in terms of accessibility because today, most PDF creation software can’t create a tagged PDF.
For this reason, most PDFs are untagged, and most tagged PDFs are unvalidated. If you’re trying to achieve high quality PDF/A-1a conforming files from existing untagged PDFs, the only mainstream software currently capable of editing tags in a PDF file is Adobe’s Acrobat Professional.
Acrobat includes automation triggered by the “Add Tags” function in the Advanced → Accessibility menu (Acrobat Professional 9). This feature scans the PDF and builds a tag tree for the document. You can get lucky on the simplest files, but on more complex content, the Add Tags function invariably makes mistakes, and results must always be checked.
If the software which created the PDF was relatively well-behaved, simple documents may require very few corrections. With more complex page layouts, especially when tables, multiple columns and graphics are involved, the more difficult it is to check and correct a tagged PDF to ensure accessibility.
The ten commandments at the ‘core’ of a PDF Accessibility Best Practices Workflow are as follows:
- Identify and resolve low contrast and color-used-as-content situations
- (If a scanned document) OCR and correct the output. OCR errors are not permitted in accessible documents.
- Add hyperlinks as required, or check existing links for validity
- Run “Add Tags” in Adobe Acrobat Professional, or other of PDF tagging software
- Check and correct tag order (text and graphics tagged in correct logical order, artifacts marked)
- Check and correct heading, list, and table structures and language attributes
- Add alternative text to image tags
- Ensure file metadata is correct and the document’s language property is set
- Add bookmarks (outlines) if the document is longer than ten or so pages
- Quality control, optimize and deliver
How to Validate PDF/A-1a
Semantic content is the material of value, the significant text and graphics conveying the meaning of a document. Non-semantic content includes repeating page headers and numbers, image borders, lines separating columns and so on. These are artifacts of design and the layout, and must be marked as such rather than tagged so as not to interfere with the logical flow of the document.
The three basic questions when validating for PDF/A-1a conformance are:
- Is all semantic content tagged in correct logical order?
- Are the headings, lists and tables and other tags in the tags tree correctly structured?
- Is the non-semantic content marked as artifact?
The answer to all three questions must be “yes”.
At present, there is no way to automatically validate conformance with PDF/A-1a. Automated checkers are important, but can offer little more than verify tags are present, language is specified, that images include alternate text, and similar, limited validation functions. In many cases, validating the logical order, table or list structures is still a job for a human.
To ensure PDF/A-1a conformance is meaningful rather than notional, it is necessary to ensure the file’s contents are accessible. PDF/UA will provide a clear set of file-format requirements to flesh out the details of conformance with the spirit of PDF/A-1a.
A PDF/A-1a conforming file is not just visually reproducible, the content is reproducible as well. As servers world-wide become into vast silos of PDF and other content, PDF/A-1a files will offer a key benchmark for the long-term preservation of document structure, ensuring high-quality search, reference and reuse for the lifetime of the file.