Facebook
Twitter
YOUTUBE
LINKEDIN
XING
PDFlib GmbH
Status: Full Member
Country: DE
Sector: All industries
Contact:
Joined at: Sep 06
Website: http://www.pdflib.com/

Linked User
Rainer Plöckl
Stephan Mühlstrasser
Thomas Merz

PDFlib TET Plugin



The TET Plugin provides easy access to the PDFlib Text Extraction Toolkit (TET). Although the TET Plugin runs as an Acrobat plugin, the underlying content extraction features do not use Acrobat functions, but are completely based on TET. The TET Plugin is provided as a free tool which demonstrate the power of PDFlib TET. Since the TET Plugin is more powerful than Acrobat’s built-in text and image extraction tools and offers a number of convenient user interface features, it is useful as a replacement for Acrobat’s built-in copy and find features. PDFlib TET can successfully process many documents for which Acrobat provides only garbage when trying to extract the text. The TET Plugin offers the following functions:

  • Copy the text from a PDF document in plain text to the system clipboard or a disk file. Enhanced clipboard controls facilitate the use of copy/paste.
  • Convert a PDF to an XML dialect called TETML and place it in the clipboard or a disk file.
  • Copy XMP document metadata to the clipboard or a disk file.
  • Find words in the document. The search text can be supplied literally or in hex syntax to facilitate the search for unusual characters.
  • Highlight all instances of a search term on the page simultaneously.
  • Extract images from the document as TIFF, JPEG, or JPEG?2000 files.
  • Display color space and position information for images.
  • Detailed configuration settings are available to adjust text and image extraction to your requirements. Configuration sets can be saved and reloaded.

Advantages over Acrobat’s copy function

The copy feature of the TET Plugin offers several advantages over Acrobat’s built-in copy facility:

  • The output can be customized to match different application requirements.
  • TET is able to correctly interpret the text in many cases where Acrobat copies only garbage to the clipboard.
  • Unknown glyphs (for which proper Unicode mapping cannot be established) will be highlighted in red color, and can be replaced with a user-selected character (e.g. question mark).
  • TET processes documents much faster than Acrobat.
  • Images can be selected interactively for export, or all images on the page or in the document can be extracted.
  • Tiny image fragments are merged to usable images.

What is PDFlib TET?

The PDFlib Text Extraction Toolkit (TET) is the underlying engine of the TET Plugin. TET is a developer product for reliably extracting text from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page. In addition, TET contains advanced content analysis algorithms for determining word boundaries, grouping text into columns and removing redundant text, such as shadows or artificially bolded text. Using the auxiliary pCOS interface you can retrieve arbitrary objects from the PDF, such as metadata, interactive elements, etc. With PDFlib TET you can:

  • Implement a search engine for processing PDF;
  • Extract text from PDFs, e.g. to store it in a database;
  • Convert text contents of PDFs to other formats, such as XML;
  • Process PDFs based on their contents.

TET is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer similar features, but are suitable for different deployment tasks.
Fully functional evaluation versions of PDFlib TET for a variety of platforms are available here.

Location
Franziska-Bilek-Weg 9, 80339 München, Deutschland



Related Products
PDFlib FontReporter


PDFlib FontReporter is a free plugin for analyzing fonts in PDF documents.

PDFlib Products for Mobile Devices and Embedded Platforms
PDFlib products for generating and processing PDF documents on smartphones and tablets are available for mobile devices and embedded platforms

PDFlib pCOS – PDF Information Retrieval Tool


PDFlib pCOS provides a simple and elegant facility for retrieving any information from a PDF document which is not part of the page contents.

PDFlib PLOP DS - PDF Linearization, Optimization, Protection, Digital Signature


PLOP DS (Digital Signature) a versatile tool for linearizing, optimizing, repairing, analyzing, encrypting and decrypting and digitally signing PDF documents.

PDFlib PLOP - PDF Linearization, Optimization, Protection



PDFlib TET Plugin


The free TET Plugin provides easy access to the PDFlib Text Extraction Toolkit (TET).

PDFlib TET PDF IFilter - Enterprise PDF Search for Windows



PDFlib TET


PDFlib TET (Text and Image Extraction Toolkit) reliably extracts text, images and metadata from PDF documents. TET makes available the text contents of a PDF as Unicode strings, plus detailed colour, glyph and font information as well as the position on the page.

PDFlib Personalization Server (PPS)


The PDFlib Personalization Server (PPS) includes PDFlib+PDI plus additional functions for variable data processing using PDFlib Blocks.

PDFlib+PDI


PDFlib+PDI includes all PDFlib functions, plus the PDF Import Library (PDI).

PDFlib


PDFlib is the leading developer toolbox for generating and manipulating files in the Portable Document Format (PDF).