Datalogics PDF Alchemist, a Developer Toolkit for Converting PDF to HTML, Now Available
Chicago, IL, July 15, 2015 Datalogics, the premier source for Adobe PDF and eBook technologies, announced the release of Datalogics PDF Alchemist, a new SDK for converting PDF documents to HTML. PDF Alchemist recovers critical text flows that were lost during the initial conversion of the source document to PDF. These text flows are essential for repurposing document contents in a number of ways, including:
- Optimizing the viewing experience on mobile phones and tablets by enabling intelligent text reflow;
- Enabling improved semantic text search in content repositories and document management systems;
- Enabling the reconstruction of editable source documents in situations where the original was lost;
- And more.
PDF Alchemist employs advanced heuristics and sophisticated algorithms to scan across columns and pages of a PDF, linking related text and paragraphs together in the final output. Images are extracted as separate files, and are referenced inline in the HTML output; and formatting including text styling, indentation and justification are also preserved.
Recovering the text structure of PDFs is a bit of a ‘holy grail’ of PDF processing, notes Greg Manuel, Vice President of Marketing. Since the PDF format was first and foremost a page description language, encoding semantic text flow information within the document was never a priority. Consequently, today there’s a wide variety of tools for creating PDFs, each encoding the internal structure in a different way. PDF Alchemist employs a number of techniques, including using positional and style hints,” to reconstruct the flow of text across columns and pages. The resulting output contains text as a human would read it; and recovering these text flows is becoming increasingly important as companies look to improve the utilization of the institutional knowledge locked within these documents.”
PDF Alchemist is available for Windows, Linux and MacOS platforms, and exposes a full-featured API to facilitate integration into other software applications. The package also includes a command-line executable and sample files to get started quickly and easily. It accepts one input file and creates one output .zip file containing an HTML file, a CSS stylesheet, a set of extracted images, and a set of extracted fonts.
Additional features of PDF Alchemist include:
- Font style detection (bold, italic, underline) and conversion to HTML markup
- Text justification detection (left, center, right) and conversion to HTML markup
- Text flow indentation detection and conversion to HTML markup
- Text flow margin detection and conversion to HTML markup
- List detection and conversion to HTML markup
- Images are detected and extracted, and referenced in place in the HTML output
- Table detection and conversion to HTML table markup
- Detections of external URL links and conversions to HTML markup
Datalogics PDF Alchemist is available to software developers, integrators and IT professionals exclusively from Datalogics. Additional information about PDF Alchemist, including details about the free evaluation program, are available at www.datalogics.com.
Chicago-based Datalogics, Incorporated, an Adobe Portfolio Company, has dedicated over 45 years to delivering the highest quality software technologies and services which meet the most demanding customer needs. Datalogics is the premier source for Adobe eBook and PDF developer technologies, including the Adobe PDF Library, Datalogics PDF Java Toolkit, Datalogics PDF WebAPI, Datalogics PDF Alchemist, Adobe Normalizer, Adobe Content Server and Adobe Reader Mobile SDK. Datalogics is a member of the International Digital Publishing Forum (IDPF) and the Readium Foundation, and is on the board of the PDF Association.
For more information, visit www.datalogics.com.