The PDF Association started in 2006 as the “PDF/A Competence Center”. The mission was to identify – and thereby establish – a common interpretation of the PDF/A-1 specification. With that accomplished through meetings open to all members, the secondary …“PDF can do THAT?!”
PDF files deliver a complete package of information that defines a document; everything that’s needed to represent the text, graphics and layout that the recipient receives. To most people, PDF is “electronic paper” – the digital expression of a cellul …The only digital document format
What is a “document”? A document is a record of some (typically written) content – a publication, a contract, a statement, a painting – at a moment in time. Until the advent of computers (and scanners), the media typically considered useable for such r …Save the Date: PDF Days Europe 2018, May 14-16, in Berlin
PDF Days Europe is the most popular PDF event of the year. It’s where the PDF industry meets, and where institutional and corporate users come to learn what else PDF could do for them. The first two PDF Days will offer a broad range of educational sessions focussed on current and perennial topics in the world of PDF technology implementation.The Power of the Page
It’s a question that vexes vendors of web-based solutions everywhere: why do people still insist on PDF files? And why does PDF’s mindshare keep going up? “PDF is such antediluvian technology!” they say. “It’s pre-web, are you kidding me? It’s so old-f …
PDF/A-1a is the higher of the two conformance levels for PDF/A. This article explains that the a stands for accessible, and provides an overview of the end-user, business, regulatory and operational significance of conformance level a. Finally, we introduce PDF/UA, the forthcoming International Standard for accessible PDF.
In HTML, accessibility is simple. The content and the logical structure organizing paragraphs and headings and images into a document are seamless.
PDF is a different world; a world of objects, coordinate references, dictionaries and content streams. Deep within the core technology of PDF, the characters, words and paragraphs and pages so clearly evident to the visually-oriented reader have no logical connection to each other at all.
PDF was originally designed to provide multiplatform fidelity on screen and in print, where the only objective was painting a picture. Theres no such thing as a paragraph or even a word. Runs of text are known as TJ operators. TJs appear in a sequence suiting the software that produced the PDF. Dont confuse words with TJ operators; thats like confusing a sentence with the movements of the print-head used to physically print that sentence.
It may seem too obvious, as it were, for words. Its not. The order in which content-streams occur in the PDF file, commonly referred to as the reading order, is something of a misnomer. In this context, reading order actually refers to the order in which a computer reads the files contents. Humans, by contrast, read in logical order. The two often appear similar but should never be confused with each other. Unfortunately, many developers have interpreted typesetting arrangements as equivalent to logical order, with disastrous results.
As of 1999, PDFs could be made accessible through tags the addition of logical ordering structures (headings, lists, tables, footnotes, form fields, etc.) to document content (text, images).
Tagged PDF makes PDF/A-1a possible, because tags are the mechanism for expressing logical document-structuring concepts in PDF files. Since tags organize the non-visual means of accessing content on the page, correct tagging is essential to the intent of PDF/A-1a.
Given the history of PDF and the way most PDFs are built, achieving logical in addition to visual reproducibility is a substantial challenge. Requiring a reproducible visual appearance over the long-term is profoundly different from requiring that the same documents contents be accessible. The two conformance levels of PDF/A exist to allow for both.
Conventionally, the typical consumer of assistive technology (AT) is a blind person equipped with a computerized braille reader or screen reader software. Their chosen AT device provides text-to-speech, keyboard interaction or other features to make computers usable to those without sight. There are many disabilities, however, and a correspondingly wide variety of assistive technology devices, both software and hardware, are available to enable disabled individuals to read and interact with web-pages, forms and electronic documents.
Governments are increasingly requiring their agencies and contractors to deliver accessible products and services. From websites to forms, regulations, product manuals and reports, documents in the US Federal government must comply with Section 508 accessibility regulations, in effect since 2001. Several state governments have similar laws, as do governments in Canada, various EU member states, Australia and elsewhere. A number of organizations, including the retailer Target, have been found liable, with significant monetary damages, for their failure to provide equal access to content.
That said, accessibility isnt just about the needs of disabled users. Human beings are not the only consumers of electronic content; search and indexing engines are also readers of PDF files. There are several conventional business and operational reasons to ensure PDF files are tagged to high standards and thus achieve meaningful PDF/A-1a compliance.
Blind users are prominent in calling for content accessibility; but the technology that makes documents readable by blind users is directly applicable to the mainstream business needs of civil servants, attorneys, archivists and others considering PDF/A. Properly tagged PDF files offer a series of functional effects with significant benefits for users of archival material.
After all, while visually reproducible pages are, obviously, critical, if you cant find the document in the first place because its not tagged correctly, reproducibility becomes somewhat moot.
The key advantages of accessibility for the institutional or business archivist are:
Searchability, because logical ordering of content ensures that words and phrases are made available to the search engine irrespective of page position, print order, or other, non-semantic factors. Additionally, well-tagged PDFs include alternate text for each semantically significant image, providing additional content to search engines.
Search Engine Optimization (SEO), because tagging-aware search engines understand the logical structure elements (such as headings) in tagged PDF and can use them in their metrics.
Content extraction (assuming your preferred PDF viewer is aware of PDF tags) is enhanced at two levels. First and foremost, proper tagging ensures that text is selected and extracted in the correct logical order. Its not OK to have page header text interrupting a sentence, or to mix up columns in a multiple-column document. Secondly, proper tagging ensures that complex logical structures such as tables may be exported to spreadsheets without error, while document text may be exported with key structural information such as headings and lists intact.
Of course, to gain the benefits of tagged PDF, your PDF software must process PDF tags!
In a PDF, just as in HTML, you must use as many tags as are required to correctly convey the logical structure of the content. Each paragraph, for example, needs a
tag. Headings get tags such as