How veraPDF does PDF/A validation

Boris Doubrov // May 19, 2015

PDF/A Article


veraPDF logoA number of commercial and open-source PDF preflight and validation engines are available. Nonetheless, validating PDF documents for long-term archival using ISO 19005 (the PDF/A standard), remains a challenging task for a number of reasons.

This article explains how veraPDF’s purpose-built validation model is designed to address these problems, and opens the door to a generic model for file-format validation.

Why PDF/A validation is so special

PDF is very different from XML syntax, where Schema validation has been an established technology for a decade.

The PDF format is highly flexible and extremely complex

Apart from 1000+ pages of an ISO standard, there is no further formal description of the PDF format itself. And yet, the core value proposition of PDF is simple: reliability when shared. PDF must be fully portable - entirely self-contained - as well as flexible and capable. Accordingly, PDF technology is undeniably complex.

PDF’s specification was published and available for unencumbered use since the initial launch of PDF in 1993. Over time, the flexibility and rich capability of PDF technology facilitated diverse applications for a wide variety of market segments, and from time to time these specifications are interpreted differently by different vendors, sometimes causing customer confusion.

Since it is a complex format, and since PDF’s reliability is at the core of the format’s value proposition, standardized creation and processing of PDF data structures is of value to any PDF developer.

PDF/A itself is complex

It does not help that the PDF/A standard comes in 3 different versions (technically known as “parts”), 3 conformance levels, relies on two different PDF specifications, has two technical corrigenda and refers to over a dozen other external specifications such as fonts, ICC profiles, image compression, XML and others. Since broken fonts or invalid ICC profiles embedded into a PDF document could be as critical for long term preservation as missing metadata or inconsistent use of color models, from the user’s perspective, these specifications need to be addressed as well.

PDF/A validation in particular needs to meet very specific standards

The PDF/A standard aims for long-term (100+ years!) preservation of electronic documents. This means that the results of the validation performed today might be relevant long time after. This fact implies a much more formal analysis of requirements.

Test corpora as a starting point

The first necessary step in formal analysis of PDF/A requirements is to identify all potential violations and generate a suite of test files which cover every PDF/A requirement. The initial step in this direction was taken back to 2008, when the Isartor Test Suite for PDF/A-1b was created through the joint efforts of PDF Association members. This was followed by other corpora, totalling a collection of 323 atomic test files.

The initial analysis of veraPDF consortium shows that extra 500 test files need to be generated to fill the remaining gap for all PDF/A versions and levels. This is an open, deliberative and consensus-driven process involving the PDF developer community through the PDF Association’s PDF Validation Technical Working Group (TWG), and of course, the work of QA engineers to generate real test files.

veraPDF’a goal is to become the commonly accepted PDF/A Conformance Checker, so the process for its development must be highly responsive to industry concerns at first instance and at the lowest-possible level of engagement. Accordingly, all test files go through a thorough acceptance process:

  • Initial discussions of border cases by the TWG
  • Creation and submission of initial drafts of test files to a public github repository
  • Internal presentation of these files to the TWG for review and discussion
  • Finally, their formal acceptance months later

This model maximizes veraPDF’s engagement with the industry that produces, edits and displays PDF documents. Ensuring that commercially-interested parties have ample opportunity to have their concerns heard and resolved in an open manner is critical to establishing the credibility and acceptance veraPDF requires for success.

Abstract validation model

In parallel with test corpora creation, veraPDF is formalizing all validation rules to ensure they are as precise as, for example, XML Schemas, and easy to verify by PDF experts with different technical backgrounds. Since the first meetings of the veraPDF TWG it became clear that veraPDF needed a formalized validation model to cover all the requirements.

Based on further technical analysis, we came to the following architecture:

  • An abstract validation model consisting of object-oriented hierarchy of object types to be validated. Each object type contains a predefined inheritable set of simple properties as well as named links to lists of objects of other types.
  • A validation profile that lists all requirements for each object type, or validation rules in formal terminology. Each rule is a certain Boolean expression built from the object properties, elementary arithmetic, and Boolean operations.

Note that the term “PDF” does not appear in the above items; this approach is designed to be as generic as possible. We thereby gain the following advantages:

  • On one hand our model fits the purposes of PDF validation and, in particular, is aligned with the internal PDF syntax.
  • On the other hand, the same model is readily employed for validating other file formats in digital content such as ICC profiles, images and fonts.

Formal syntax

As validation model resembles object-oriented programming, we decided that the optimal approach is that of a simplified programming language, which we invented from scratch. The PDF model description starts as follows:

type Object {};

type CosObject extends Object {};

type CosArray extends CosObject {

  property size: Integer;

  link values: CosObject*;

};

The veraPDF validation model is generic enough to deal not only with PDF arrays and dictionaries, but also with page content operators and low-level technical requirements:

type Operator extends Object {

};

type qOperator extends Operator {

  property nestingLevel: Integer;

};

Validation profiles are structured collections of particular validation rules, so they are naturally serialized to a plain XML syntax. A typical validation rule in PDF/A-1 requires that any PDF array shall have less than 8192 elements. The XML syntax of the validation profile reflects this as follows:

id="rule8" object="CosArray">

Maximum capacity of array in elements less than 8192

  size < 8192

  Capacity of array greater than 8191

  ISO19005-1

  6.1.12

  PDF Reference 1.4

  Table C.1

The main advantages of this approach are the easy to read syntax for formalized validation requirements and independence from any particular implementation technology. In fact, this model can be coded in any programming language from C/C++ to Python, and should generate identical validation profiles on the same input data.

Beyond PDF validation

The PREFORMA project presents additional challenges to PDF/A validation. As metadata is key information about the document’s PDF/A conformance, one wants to fix any incorrect claims of specific PDF/A validity or, conversely, to mark the document as a conforming PDF/A document, if proper metadata identification is the only problem. This component of veraPDF is the Metadata Fixer and is developed along with the Conformance Checker.

In addition to industry-supported PDF/A validation, end users might have extra requirements for their archived PDF documents. veraPDF helps users impose these requirements in two stages.

  1. The veraPDF library generates a PDF Features Report with various data such as number of pages, embedded fonts, images and other resources and other user-friendly information about a given PDF document.
  2. The PDF Features Report is passed to a so-called Policy Checker that verifies it against specific Policy requirements formalized in Schematron syntax.

Users would be able to add their own specific checks, and even update them at a later stage, with no need to regenerate the PDF Features Report.

Collaborative approach

PDF is not the only format one needs to validate in order to achieve long-term electronic preservation of archival electronic content. The EU’s PREFORMA project that’s funding development of veraPDF includes TIFF image and audiovisual file validation in addition to PDF. Part of PREFORMA’s requirements include a common interface for the three formats, which might be an interesting challenge for our generic validation model, but should also help prove out the degree to which our approach is applicable for generic application to file format specifications generally.

Finally, we believe our approach will pave the way to resolving yet another challenge of PDF/A validation; smooth interoperability with code designed to address objects encoded in PDF beyond the PDF specification itself. This is achieved by a plug-in architecture providing interoperability with third-party components validating file formats embedded into PDF.

Just imagine embedded ICC profiles and JPEG2000 images validated along with its PDF parent! For such a critical file-format, a true demonstration of collaborative software development and knowledge sharing to increase the tide and raise all boats.

Learn more about veraPDF

If you develop PDF technology and want to learn more about veraPDF, contact the PDF Association's veraPDF coordinator.


ABOUT THE AUTHORS

Boris Doubrov

Boris Doubrov is CEO of Dual Lab, the company specializing in product development services in the areas of Computer Graphics, CAD/CAM Modelling and other Science-intensive areas. Boris Doubrov holds a PhD in Mathematics and has been working for more than 15 years in PDF technologies as a software developer, project manager, and business owner. He is an active participant of …

ABOUT THE AUTHORS

Boris Doubrov

Boris Doubrov is CEO of Dual Lab, the company specializing in product development services in the areas of Computer Graphics, …

© 2019 Assosiation for Digital Document Standards e.V. | Privacy Policy | Imprint