PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

Stressful PDF Corpus

The "Issue Tracker" corpus of stressful PDF files was originally developed under the DARPA-funded "SafeDocs" program as discussed on pdfa.org.

If a “stressful PDF” can be considered as any file that causes problems for a parser, then looking into the problems faced by diverse parsers can be a great learning experience.

This corpus now includes bug attachment data from 35 issue tracker repositories across 32 PDF technologies, comprising 31 GB and over 32,500 stressful PDF files.

These issue trackers now span a broad variety of PDF technologies written in a wide range of programming languages. Due to the size, we have packaged the corpus into six compressed tar balls (.tgz files) each containing the data from multiple repositories to make downloading more convenient.

PDF technology Folder Issue Tracker URL # files Size .tgz file)
Android PDF Viewer (Java) androidpdfviewer https://github.com/barteksc/AndroidPdfViewer 13 3.2M 5
Cairo cairo https://bugs.freedesktop.org 166 33M 5
Cairo cairo-gitlab https://gitlab.freedesktop.org/cairo/cairo 29 12M 6
DeJaVu dejavu https://bugs.freedesktop.org 39 2.7M 5
eSignature DSS DSS https://ec.europa.eu/cefdigital/tracker/projects/DSS 243 89M 5
GNOME Evince evince https://gitlab.gnome.org/GNOME/evince 241 591M 6
Apache FOP FOP https://issues.apache.org/jira/projects/FOP 808 157M 5
GhostScript (C/C++) GHOSTSCRIPT https://bugs.ghostscript.com/ 5,458 5.6G 2
Snappy PDF (laravel, PHP) laravel-snappy https://github.com/barryvdh/laravel-snappy 5 1.8M 5
Libre Office LIBRE_OFFICE https://bugs.documentfoundation.org/ 5,572 1.4G 4
libvips image library libvips https://github.com/libvips/libvips 18 384M 5
Mozilla MOZILLA https://bugzilla.mozilla.org/ 6,879 3.9G 3
Apache Nutch NUTCH https://issues.apache.org/jira/projects/NUTCH 13 976K 5
OCRmyPDF (Python) ocrmypdf https://github.com/jbarlow83/OCRmyPDF 205 501M 5
Apache OpenOffice.org OOO https://bz.apache.org/ooo 1,564 253M 4
OpenPDF (Java) openpdf https://github.com/LibrePDF/OpenPDF 32 3.2M 5
parsr (JS) parsr https://github.com/axa-group/Parsr 28 12M 5
Mozilla pdf.js (JS) pdf.js https://github.com/mozilla/pdf.js 2,368 4.5G 4
Apache PDFBOX (Java) PDFBOX https://issues.apache.org/jira/projects/PDFBOX 3,832 2.7G 1
pdfcpu (Go) pdfcpu https://github.com/pdfcpu/pdfcpu 100 218M 5
Chromium PDFium (C++) PDFIUM https://bugs.chromium.org/p/pdfium/issues/list 379 212M 5
pdfkit (JS) pdfkit https://github.com/foliojs/pdfkit 38 35M 5
pdfminer.six (Python) pdfminer.six https://github.com/pdfminer/pdfminer.six 123 106M 5
PikePDF (Python) pikepdf https://github.com/pikepdf/pikepdf 23 30M 5
Apache POI POI https://bz.apache.org/bugzilla/ 11 940K 5
Poppler (C/C++) poppler https://bugs.freedesktop.org 1,585 6.2G 5
Poppler (C/C++) poppler-gitlab https://gitlab.freedesktop.org/poppler/poppler 463 926M 6
Prawn PDF (Ruby) prawn https://github.com/prawnpdf/prawn 53 69M 5
qpdf (C++) qpdf https://github.com/qpdf/qpdf 111 324M 5
react-pdf (JS) react-pdf https://github.com/diegomura/react-pdf 14 2.2M 5
Redhat Linux REDHAT https://bugzilla.redhat.com/ 1,712 1.3G 5
Sumatra PDF (C/C++) sumatrapdf https://github.com/sumatrapdfreader/sumatrapdf 320 788M 5
tabula tabula https://github.com/tabulapdf/tabula 2 172K 5
tabula-java (Java) tabula-java https://github.com/tabulapdf/tabula-java 77 45M 5
Apache TIKA (Java) TIKA https://issues.apache.org/jira/projects/TIKA 155 156M 2
TOTAL: 35 - 32,679 31G -

This README file describes the overall issue tracker corpus and how data has been collated.

This README file describes the PDF-centric issue tracker corpus that is pre-packaged into six compressed tarball (.tgz files) (see https://corpora.tika.apache.org/base/packaged/pdfs/). The broader multi-format Issue Tracker corpus, which includes many more formats than just PDF  and is used for testing Apache Tika, is at https://corpora.tika.apache.org/base/docs/, however these files are not pre-packaged.

For more information and to stay up-to-date with the “Issue Tracker” PDF corpus, please join the corpora-dev@tika.apache.org email list (via https://tika.apache.org/mail-lists.html) and, for PDF Association members, please provide your feedback or comments in the PDF TWG.

The PDF Association again wishes to thank the NASA JPL and Apache Tika teams, and particularly Dr. Tim Allison, for their efforts in maintaining the technology and collating the data. We also wish to thank Maruan Sahyoun of PDF Association member FileAffairs GmbH, part of the Apache PDFBox team, for hosting the “Issue Tracker” PDF corpus as a valuable new industry resource.

 

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.

WordPress Cookie Notice by Real Cookie Banner