The TET Plugin provides easy access to the PDFlib Text Extraction Toolkit (TET). Although the TET Plugin runs as an Acrobat plugin, the underlying content extraction features do not use Acrobat functions, but are completely based on TET. The TET Plugin is provided as a free tool which demonstrate the power of PDFlib TET. Since the TET Plugin is more powerful than Acrobat’s built-in text and image extraction tools and offers a number of convenient user interface features, it is useful as a replacement for Acrobat’s built-in copy and find features. PDFlib TET can successfully process many documents for which Acrobat provides only garbage when trying to extract the text. The TET Plugin offers the following functions:
- Copy the text from a PDF document in plain text to the system clipboard or a disk file. Enhanced clipboard controls facilitate the use of copy/paste.
- Convert a PDF to an XML dialect called TETML and place it in the clipboard or a disk file.
- Copy XMP document metadata to the clipboard or a disk file.
- Find words in the document. The search text can be supplied literally or in hex syntax to facilitate the search for unusual characters.
- Highlight all instances of a search term on the page simultaneously.
- Extract images from the document as TIFF, JPEG, or JPEG 2000 files.
- Display color space and position information for images.
- Detailed configuration settings are available to adjust text and image extraction to your requirements. Configuration sets can be saved and reloaded.
Advantages over Acrobat’s copy function
The copy feature of the TET Plugin offers several advantages over Acrobat’s built-in copy facility:
- The output can be customized to match different application requirements.
- TET is able to correctly interpret the text in many cases where Acrobat copies only garbage to the clipboard.
- Unknown glyphs (for which proper Unicode mapping cannot be established) will be highlighted in red color, and can be replaced with a user-selected character (e.g. question mark).
- TET processes documents much faster than Acrobat.
- Images can be selected interactively for export, or all images on the page or in the document can be extracted.
- Tiny image fragments are merged to usable images.
What is PDFlib TET?
The PDFlib Text Extraction Toolkit (TET) is the underlying engine of the TET Plugin. TET is a developer product for reliably extracting text from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page. In addition, TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text, such as shadows or artificially bolded text. Using the auxiliary pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc. With PDFlib TET you can:
- Implement a search engine for processing PDF;
- Extract text from PDFs, e.g. to store it in a database;
- Convert text contents of PDFs to other formats, such as XML;
- Process PDFs based on their contents.
TET is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer similar features, but are suitable for different deployment tasks.
Fully functional evaluation versions of PDFlib TET for a variety of platforms are available here.
Producer: PDFlib GmbH