PRESENTATION

Evaluating Text Extraction at Scale

Case Study from Apache Tika

Tim Allison
Presenter: Tim Allison, Jet Propulsion Laboratory
Language: English
Accessibility Archiving Date/Time: 0

Symbol calendar download
Download this presentation date to create your personal agenda

Description

Apache Tika is widely used as a critical enabling technology for search in Apache Solr and other search systems. This open source library performs text and metadata extraction from numerous file formats, including PDF via an integration with Apache PDFBox. As we all know, when something goes wrong with text extraction, the reliability of search and other natural language processing (NLP) applications is greatly hindered.

Over the last 5 years, the Tika project has gathered and published a large corpus of files (https://corpora.tika.apache.org/base/docs), and we have developed an evaluation module (tika-eval) and methodology to identify regressions in text extraction and areas for improvement in our parsers.

This talk offers an overview of Tika's publicly available regression corpus as well as the tika-eval module.   We'll discuss ways of scaling its NLP/language-modeling based metrics to identify potential mojibake, corrupt text and/or bad OCR at scale.  These techniques have applicability for PDF parser developers, search system integrators and for those interested in archiving and accessibility.

REGISTRATION


View the PDF Days Europe 2022 agenda

This presentaton is part of PDF Days Europe 2022.
Register now!

View our terms and conditions.



PRESENTATIONS ON OTHER TRACKS AT THE SAME TIME

None