The most notable feature of our PDF Alchemist scriptable server tool and SDK has always been its ability to intelligently extract text and images from PDFs. With our earliest product releases, we employed sophisticated techniques to automatically identify and reconstruct “text flows” in non-structured PDFs, but we initially offered limited customization features to focus the extraction around particular data or provide control over the output. In essence, this made for a practical conversion tool from PDF to HTML, but it did not allow for users to specify the specific PDF data that they cared to extract or the manner in which they wanted to see that data outside of the PDF.
Over the years, we’ve listened to our customers about the many ways in which PDF Alchemist’s core technology has been deployed to solve problems. Sometimes users are only interested in tabular data detected in their PDFs. Sometimes they only need particular text regions, pages, or even particular images from their PDFs. Sometimes HTML serves their needs, but others find formats such as XML easier to integrate with their workflows. Sometimes they want to hide repeated sections of content on every page, display borders around borderless tables, ignore hidden text…the list goes on.
Two patterns that have emerged have been:
With that in mind, we decided to redirect our efforts to evolve PDF Alchemist into a flexible data extraction tool with an extended focus on configurable formatting. Let’s walk through a timeline of enhancements we’ve introduced to empower users with more adjustable control:
From the earliest release, PDF Alchemist offered users the option to discard page contents that it determines are repeated headers and/or footers, including detected page numbers and running titles. This behavior is actually turned on by default and can be turned off with the -keepHeaderFooter option set to true.
With PDF Alchemist 2.2, we introduced XML as a new output format, adding to the existing HTML and EPUB formats via the -outputFormat option. This has reportedly been useful for users aiming to capture the structure of PDF data without the need for preserving visual styling components.
With PDF Alchemist 2.3, we introduced optical character recognition (OCR) support for retrieving text from images within PDF files (-ocrMode), unlocking previously trapped data and information. The two new OCR modes allow users to either retrieve image text from OCR as alternate text for pictures in the output (tag mode) or remove images and replace them with their textual content equivalents (replace mode).
With PDF Alchemist 2.3.9, we introduced an option to emit all text as black text, making it easier to view white or light colored text that was presented over dark-colored backgrounds in the PDF. When these background colors are not preserved, the text can become difficult to read via rendered HTML, so this option allows for all text to be more prominently visible.
With PDF Alchemist 2.4, we introduced an option for the user to specify any number of page ranges for extraction from the PDF (-pageRanges). This directly gives users the ability to further refine the content that they wish to pull out of a PDF. With this release, we also introduced an option to remove invisible text (-removeInvisibleText), which can be important for users that want to preserve the visible content of the PDF without displaying hidden content, such as white text displayed over a white background in the PDF input.
With PDF Alchemist 2.5, we added a “tables only” option (-tablesOnly) to limit the extracted content by only including data detected within tables in PDFs. This was a highly requested feature and supports many use cases that involve transitioning tabular data from PDFs into spreadsheets, relational databases, or other data-centric applications.
With PDF Alchemist 2.6, we added an option to specify the styling of table borders (-tableBorders), specifically allowing for HTML borders to be turned on or off for all tables or to match the detected styling of the tables in the PDF. This is useful to purposely show or hide the structure of detected tables via rendered HTML. We also added the ability to set the file location and name of external output files, including images (-imageDirectoryPath, -imageFilenamePrefix), the CSS stylesheet (-stylesheetPath), and fonts (-fontDirectoryPath, -fontFilenamePrefix). This was requested by a customer to avoid post-processing steps to achieve their desired file hierarchy.
PDF Alchemist has come a long way on its journey, and we are aiming to continue on this path of improvement, but we’re looking for your feedback to help drive useful new options for content extraction and output formatting. Whether PDF Alchemist meets your current needs, or whether you feel specific enhancements would offer valuable new solutions, we are always looking to hear and learn from you.
Original Post: https://blogs.datalogics.com/2019/10/08/data-extraction-enhancements-pdf-alchemist/
Eric Shore is the Vice President of Engineering at Datalogics, where he leads a talented team of software developers and PDF experts. Eric has an extensive background in engineering management and software development focused on native code toolkits and SDKs, document processing, production pipeline efficiency, digital asset management, and data analytics. He is also a father, artist, traveler, solution-finder, and …