PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area


Presented at OctoberPDFest online
( 2020, Oct )

Evaluating Text Extraction at Scale

Case Study from Apache Tika

Excerpt: Apache Tika is widely used as a critical enabling technology for search in Apache Solr and other search systems. This open source library performs text and metadata extraction from numerous file formats, including PDF via an integration with Apache PDFBox. As we all know, when something goes wrong with text extraction, the reliability of search and other natural language processing (NLP) applications is greatly hindered. Over the last 5 years, the Tika project has gathered and published a large … Read more
About the presenter(s)

Tim has been working in content/metadata extraction (and evaluation), advanced search and relevance tuning for nearly 20 years. Tim is the founder of Rhapsode Consulting LLC, and he currently works … Read more


Tim Allison
Jet Propulsion Laboratory

Description

Apache Tika is widely used as a critical enabling technology for search in Apache Solr and other search systems. This open source library performs text and metadata extraction from numerous file formats, including PDF via an integration with Apache PDFBox. As we all know, when something goes wrong with text extraction, the reliability of search and other natural language processing (NLP) applications is greatly hindered.

Over the last 5 years, the Tika project has gathered and published a large corpus of files (https://corpora.tika.apache.org/base/docs), and we have developed an evaluation module (tika-eval) and methodology to identify regressions in text extraction and areas for improvement in our parsers.

This talk offers an overview of Tika’s publicly available regression corpus as well as the tika-eval module.   We’ll discuss ways of scaling its NLP/language-modeling based metrics to identify potential mojibake, corrupt text and/or bad OCR at scale.  These techniques have applicability for PDF parser developers, search system integrators and for those interested in archiving and accessibility.


WordPress Cookie Notice by Real Cookie Banner