PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

Laptop with computer code on the screen

Datalogics’ New Text Extraction Code Samples

The Datalogics team has created new extraction samples on GitHub for C++, .NET, and .NETCore to help you create precise workflows for your requirements.
About the author: Lindsey is a marketing professional with over 10 years of experience working with small and large companies alike. She is passionate about telling stories and connecting with others through digital … Read more
Lindsey Schroeder

Lindsey Schroeder
October 20, 2022

Member News


Print Friendly, PDF & Email

Get Information You Need with Precise & Simple Commands

Extracting text has become an essential part of the PDF workflow for many organizations. The Datalogics team has created new extraction samples on GitHub for C++, .NET, and .NETCore to help you create precise workflows for your requirements.

Text Extraction Samples & Use Cases

Fillable Forms

The ExtractAcroformFieldData  sample shows how to extract text from the AcroForm fields in a PDF document. This is useful for those who work with fillable forms in PDFs and need to extract the text within the Acroforms as a .JSON file to use in a text editor or web browser.

Patterns

ExtractTextByPatternMatch searches for patterns within the text of a document, such as phone numbers, using simple overarching commands and extracts the data into a .TXT file. For example, phone numbers in the U.S. are set up ###-###-####, but that format varies worldwide. This sample makes it easy to extract any phone number by simply using ‘PHONE_PATTERN’ in the code instead of ((1-)?(()?d{3}())?(s)?(-)?d{3}-d{4})

The ExtractCJKTextByPatternMatch sample shows how to search for Unicode characters such as Chinese, Japanese, and Korean (CJK). With more than 1.5 billion people speaking those languages (and growing), organizations must be able to extract millions of different types of characters correctly. The sample on GitHub uses a Korean character in its code.

Regions

ExtractTextByRegion has to do with extracting text from a specific region of a page in a PDF document, which then saves the extracted text to a .TXT file. For example, companies who have thousands of invoices with the same number format that need those numbers extracted from that specific region on the PDF, or when the IRS must pull social security numbers from that section of their 1044s, can use ExtractTextByRegion to accomplish that task.

ExtractTextFromMultiRegions  This processes PDF files in a folder and extracts text from multiple specific regions of its pages and saves the text to a .CSV file. For example, this command can create a single file with all the invoice numbers, dates, order numbers, customer IDs, and total from the invoices in the folder, so you have all the data you need in one view.

Consolidating Annotations

PDFs can contain thousands of annotations and the ExtractTextFromAnnotations sample shows how to pull that information out and save it to a separate text file (.JSON). For example, contract negotiations may include comments and questions that have been accepted or rejected, and this function can extract that data.

Style Preservation

ExtractTextPreservingStyleAndPositionInfo This sample extracts all text from the PDF along with information about the text (in a .JSON file) such as its style, color, and font size for style preservation.

Check out the Datalogics GitHub Repository for more information on Adobe PDF Library and samples for the creation, modification and management of PDF documents.


Datalogics, Inc. provides a complete SDK for PDF creation, manipulation and management for companies around the globe. Built on Adobe source code, our flagship product Adobe® PDF Library offers a choice of programming platforms and languages along with unsurpassed customer service, proven by our 94% customer retention rate. Datalogics offers…

Read more
WordPress Cookie Notice by Real Cookie Banner