Facebook
Twitter
YOUTUBE
LINKEDIN
XING
Datalogics
Status: Partner Member
Country: US
Sector: All industries
Contact:
Joined at: Feb 08
Website: http://www.datalogics.com/

Linked User
Maryanne Pavlin
Matt Kuznicki
Nicki Bullock
Vel Genov
Emma Kaschke
Leonard Ho

Datalogics PDF Alchemist, a Developer Toolkit for Converting PDF to HTML, Now Available

Chicago, IL, July 15, 2015 – Datalogics, the premier source for Adobe PDF and eBook technologies, announced the release of Datalogics PDF Alchemist, a new SDK for converting PDF documents to HTML. PDF Alchemist recovers critical text flows that were lost during the initial conversion of the source document to PDF. These text flows are essential for repurposing document contents in a number of ways, including:

  • Optimizing the viewing experience on mobile phones and tablets by enabling intelligent text reflow;
  • Enabling improved semantic text search in content repositories and document management systems;
  • Enabling the reconstruction of editable source documents in situations where the original was lost;
  • And more.

PDF Alchemist employs advanced heuristics and sophisticated algorithms to scan across columns and pages of a PDF, linking related text and paragraphs together in the final output. Images are extracted as separate files, and are referenced inline in the HTML output; and formatting including text styling, indentation and justification are also preserved.

“Recovering the text structure of PDFs is a bit of a ‘holy grail’ of PDF processing,” notes Greg Manuel, Vice President of Marketing. “Since the PDF format was first and foremost a page description language, encoding semantic text flow information within the document was never a priority. Consequently, today there’s a wide variety of tools for creating PDFs, each encoding the internal structure in a different way. PDF Alchemist employs a number of techniques, including using positional and style “hints,” to reconstruct the flow of text across columns and pages. The resulting output contains text as a human would read it; and recovering these text flows is becoming increasingly important as companies look to improve the utilization of the institutional knowledge locked within these documents.”

PDF Alchemist is available for Windows, Linux and MacOS platforms, and exposes a full-featured API to facilitate integration into other software applications. The package also includes a command-line executable and sample files to get started quickly and easily. It accepts one input file and creates one output .zip file containing an HTML file, a CSS stylesheet, a set of extracted images, and a set of extracted fonts.

Additional features of PDF Alchemist include:

  • Font style detection (bold, italic, underline) and conversion to HTML markup
  • Text justification detection (left, center, right) and conversion to HTML markup
  • Text flow indentation detection and conversion to HTML markup
  • Text flow margin detection and conversion to HTML markup
  • List detection and conversion to HTML markup
  • Images are detected and extracted, and referenced in place in the HTML output
  • Table detection and conversion to HTML table markup
  • Detections of external URL links and conversions to HTML markup

 

Datalogics PDF Alchemist is available to software developers, integrators and IT professionals exclusively from Datalogics. Additional information about PDF Alchemist, including details about the free evaluation program, are available at www.datalogics.com.

 

About Datalogics
Chicago-based Datalogics, Incorporated, an Adobe Portfolio Company, has dedicated over 45 years to delivering the highest quality software technologies and services which meet the most demanding customer needs. Datalogics is the premier source for Adobe eBook and PDF developer technologies, including the Adobe PDF Library, Datalogics PDF Java Toolkit, Datalogics PDF WebAPI, Datalogics PDF Alchemist, Adobe Normalizer, Adobe Content Server and Adobe Reader Mobile SDK. Datalogics is a member of the International Digital Publishing Forum (IDPF) and the Readium Foundation, and is on the board of the PDF Association.
For more information, visit www.datalogics.com.

Related Products
Adobe PDF Library


The Adobe PDF Library SDK is a low-level PDF library that contains a powerful set of native C/C++ APIs with interfaces for .NET and Java APIs. Systems integrators, independent software vendors (ISVs), enterprise IT developers, and others can integrate Adobe PDF functionality within custom applications in a client and / or server environment.

PDF Java Toolkit


Datalogics PDF Java Toolkit is a native Java library that provides high-level APIs for automating PDF workflows like processing PDF forms, verifying digital signatures, and extracting text. It also offers low-level APIs for working directly with the structure of the PDF for those times you need it.

Adobe Normalizer


Adobe Normalizer, is an API which allows developers to quickly and easily convert Encapsulated PostScript (EPS) and PostScript (PS) files to Adobe’s Portable Document Format (PDF). The industry-standard Adobe Distiller and Distiller Server are themselves built upon PDF Converter SDK; and now this API is available separately to application developers.

Adobe PDF Print Engine


The Adobe PDF Print Engine is a common rendering engine technology, packaged as a software development kit (SDK). It can be the basis for a variety of products for previewing and printing Adobe Portable Document Format (PDF) documents at different stages of the professional print workflow.

PDF2IMG


Datalogics PDF2IMG is a command-line utility that converts PDF files to a variety of image formats including PNG, JPG, TIFF, BMP, and more. It is built upon the Adobe PDF Library and uses Adobe technology for unrivaled color management during the PDF conversion process

PDF Alchemist


Datalogics PDF Alchemist is a new (C/C++) SDK for intelligently extracting text and images from PDFs and exporting to HTML 5 or EPUB. It employs sophisticated techniques to identify and reconstruct “text flows” within the PDF.