PRESENTATION

Making more sense of PDF structures in the wild at scale

Progress and outcomes of analysis on 8 million PDFs gathered from Common Crawl

Tim Allison
Presenter: Tim Allison, Jet Propulsion Laboratory
Language: English
Date/Time: Sep, 13, 2022 - 15:30

Symbol calendar download
Download this presentation date to create your personal agenda

Description

This is a follow-on talk from our 2021 PDF Days presentation on the File Observatory. Our team built the File Observatory to support Defense Advanced Research Projects Agency (DARPA)'s SafeDocs program by enabling parser developers to understand features of PDFs in the wild at scale.

In the first part of our presentation, we'll offer an overview of the capabilities of the observatory, from gathering files, to running numerous parsers on the files, to searching and analyzing the features extracted by the parsers. In the second part, we'll detail progress on building and packaging the "observatory in a box" for transition. In the third part, we'll present some of the findings on an analysis of roughly 8 million PDFs from Common Crawl. This section will include an analysis of parser warnings, exceptions and errors on the set of files as well as a presentation of statistical summaries of PDF features, including versions, languages, creator tools/producers and more interesting syntactic features.

REGISTRATION


View the PDF Days Europe 2022 agenda

This presentaton is part of PDF Days Europe 2022.
Register now!

View our terms and conditions.



PRESENTATIONS ON OTHER TRACKS AT THE SAME TIME