Source code

A SafeDocs update

Peter Wyatt // April 15, 2020

PDF in general Article

As announced in June 2019 (“SafeDocs: DARPA Does PDF”) the PDF Association is now serving as an industry partner in the Defense Advanced Research Projects Agency (DARPA)-funded Safe Documents (SafeDocs) program. The goal of this fundamental research program is to develop novel parser methodologies for ensuring safety in digital content, whether document formats (such as PDF), specialized imaging formats (such as NITF) or streaming data protocols (such as DNS or MAVlink). The philosophy underlying SafeDocs' approach is that of ‘language-theoretic security’ (LangSec), which posits that “the only path to trustworthy software that takes untrusted inputs is treating all valid or expected inputs as a formal language, and the respective input-handling routines as a recognizer for that language. The recognition must be feasible, and the recognizer must match the language in required computation power.”

Through the PDF Association's active involvement in the SafeDocs program, researchers from BAE Systems, Galois Inc., SRI International, Northrop Grumman Systems Corp., Lockheed Martin Corp., Kudu Dynamics, NASA JPL and the many university and cyber-security research teams supporting their efforts have effectively ramped-up their understanding of PDF, both in terms of the file format itself and the realities of PDF's utilization by industry­ and users. Under SafeDocs, each research team is approaching the same high-level objectives from different directions while also working collaboratively. The challenge for us, as their technology and industry guide, has been to support the many different kinds of microscopes being applied to PDF!

From the PDF industry perspective, in the short time since SafeDocs commenced a number of visible positive outcomes have already occurred:

  • A number of subtle but important wording changes to ISO/DIS 32000-2:202x (the PDF 2.0 dated revision) were triggered by SafeDocs researchers and accepted during the last ISO TC 171 SC 2 WG 8 meeting in December 2019. These tweaks removed clear mistakes (e.g. did anyone notice that curly braces were permitted as valid token delimiters in PDF and not just for Type 4 PostScript Functions?), missing and contradictory statements, resolved ambiguities and unstated assumptions (e.g. <> is a valid empty hex string and hex strings can have mixed case), and generally improved language around the basic PDF COS syntax. But, as developers already know, a file format specification can only go so far in defining what an implementation has to do - working with real-world files while meeting customer expectations is beyond a file format specification!
  • PolyFile and PolyTracker are two new open-source tools developed by security researchers at Trail of Bits. A detailed walkthrough on the background and use of both these tools can be found in this video, but, in a nutshell:
    • PolyTracker is an automated instrumentation framework that efficiently tracks each byte from an input file through the execution of a program (such as parser) with the goal of associating functions with the byte offsets of the input files on which they operate; and
    • PolyFile is a polyglot-aware file identification utility that can be used with PolyTracker to semantically label each byte of an input file according to its usage, according to a multitude of file formats.
  • Hashashin is a new open-source extension to the popular Binary Ninja tool, developed by security researchers at River Loop Security. Hashashin's purpose is to apply Binary Ninja annotations developed on one instantiation of a parsing library to other instances of the library, and is described in detail in two blog articles, "Binary hashing: motivations and algorithms", and "Hashashin: Using binary hashing to port annotations".
  • The researchers have uncovered a number of issues and vulnerabilities in both OSS and proprietary PDF implementations with responsible reporting practices being followed.
  • Apache Tika was selected as SafeDocs' preferred ‘parser hosting framework’. The initial integrations with researcher technologies and testing with a number of PDF-centric corpora identified a number of issues in Tika. This included improved disambiguation of FDF files (TIKA-2986 and TIKA-2988) and applying Tika to all embedded XMP instances (TIKA-3058).
  • Driven by some of the complexities of PDF, ongoing capability improvements have been committed to the C-based “Hammer” parser-combinator library, which supports developing formally-provable parsers by writing grammars as inline domain-specific languages.
  • As part of the most recent evaluation event for researchers, a “PDF Observatory” experiment in a similar style to the SSL Observatory was used. In this model, “extant data” (unfiltered real-world files, including malformations) and “extant parsers” (open source and black-box PDF processors) were monitored and correlated in an effort to detect “parser differentials” caused by identifiable syntax structures, regardless of syntax validity or degree of malformation. This experiment aimed to determine whether SafeDocs' early-stage technology could detect evolution in a file format, when new features are added but not all parsers support this feature.

Researchers initially focused on the GovDocs1 corpora (containing 239K PDF files), primarily because it is a freely available, well-studied real-world corpus collected from US government websites (.gov). However, from a PDF industry perspective, GovDocs1 is over a decade old and has limited PDF technical diversity due to the way it was sourced. SafeDocs' researchers are now currently scaling their technologies to include additional corpora, including Common Crawl and an exciting new PDF-centric issue-tracker corpora under development by NASA JPL with guidance from the PDF Association. The Sixth LangSec IEEE S&P Workshop at the IEEE Security & Privacy Symposium 2020 will be held on May 21, and will include many SafeDocs researchers presenting their latest work on corpora, topological difference testing and other methodologies.

Researchers are demonstrating early progress towards stretching the underlying principles of linguistics, formal language theory, topology and applied category theory, type theory and various other advanced disciplines. New tooling is being rapidly prototyped supporting novel ideas, so that diverse combinations of theory and practice can be easily assessed for feasibility. PDF is being used to push the boundaries of these domains, with research outcomes destined to be broadly applicable to other digital formats.

Some components of the PDF Association's original plan of work, such as surveying industry for security processes and practices around PDF development, and technical benchmarking and metrics of “hidden corpora” will become increasingly relevant once core research problems are addressed and SafeDocs technologies have matured.

From the PDF industry perspective, there is still a long way to go before the research outcomes in development are applicable to the day-to-day business of engineering reliable and interoperable PDF technologies (such as parser construction toolkits). In true research style, no one is entirely sure where these efforts will take us. Along the way, we believe there is real potential to gain important insights and leverage intermediate outcomes to improve interoperability, reliability and security across the PDF industry.


This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.


Peter Wyatt
Peter Wyatt

Peter Wyatt is an independent technology consultant with deep file format and parsing expertise, who is a developer and researcher actively working on PDF technologies for more than 18 years. He is currently Project Co-Leader of ISO 32000 (the core PDF standard), a member of the Board of the PDF Association, and co-Chairs the PDF Association PDF TWG. He is …


Peter Wyatt

Peter Wyatt

Peter Wyatt is an independent technology consultant with deep file format and parsing expertise, who is a developer and researcher …

© 2021 PDF Assocition e.V. | Privacy Policy | Imprint