PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

"Making sense of PDF structures in the wild at scale" at PDF Days Online 2021

Presented at PDF Days Europe 2021
( 2021, Sep )

Making sense of PDF structures in the wild at scale

Building a file observatory to support secure parser development

Excerpt: PDFs in the wild offer a bewildering amount of variation in syntax, features and structure.  For those building parsers or evaluating parsers, it is critical to have a broad coverage corpus available to assess and discover distributions of issues “in the wild” or on specific client document sets.  In this talk, our team will present our work building a File Observatory to support Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program, which has an initial primary focus on PDF and r … Read more
About the presenter(s)

Tim has been working in content/metadata extraction (and evaluation), advanced search and relevance tuning for nearly 20 years. Tim is the founder of Rhapsode Consulting LLC, and he currently works … Read more


Tim Allison
Jet Propulsion Laboratory

Description

PDFs in the wild offer a bewildering amount of variation in syntax, features and structure.  For those building parsers or evaluating parsers, it is critical to have a broad coverage corpus available to assess and discover distributions of issues “in the wild” or on specific client document sets.  In this talk, our team will present our work building a File Observatory to support Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program, which has an initial primary focus on PDF and related formats (e.g. jpeg, ICC, fonts, XMP).  The talk will focus on a) gathering interesting PDFs and b) making features searchable and patterns easily discoverable with open source search technologies.  In the first part, we’ll discuss gathering millions of PDFs from Common Crawl and thousands of files from open source PDF parser bug tracker sites. In the second we’ll outline the capabilities of the File Observatory to run multiple parsers against the files, extract features (runtime exceptions and error messages as well as structural features, including PDF DOM keys and values and other semantic components within the PDFs’ structures) and make those features searchable with Elasticsearch.  We will also briefly demonstrate how the observatory enables the discovery of spelling variations (e.g. /Subtype vs. /SubType), and structural features which are statistically correlated with specific creator tools.


WordPress Cookie Notice by Real Cookie Banner