PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area

OCR with the Adobe PDF Library .NET and Java Interface – Datalogics, Inc.

Here at Datalogics, we are continuously innovating and providing our customers with more value to better assist them with their PDF document needs. Over the past few months, we’ve added Optical Character Recognition Support (OCR) to many of our products. We are excited to announce that OCR support is now available within the Java and .NET interfaces of the Adobe PDF Library. We’ve combined the power of the Adobe PDF Library together with Tesseract (a widely-used open source OCR engine) to allow … Read more
About the author: Lindsey is a marketing professional with over 10 years of experience working with small and large companies alike. She is passionate about telling stories and connecting with others through digital … Read more

Here at Datalogics, we are continuously innovating and providing our customers with more value to better assist them with their PDF document needs. Over the past few months, we’ve added Optical Character Recognition Support (OCR) to many of our products. We are excited to announce that OCR support is now available within the Java and .NET interfaces of the Adobe PDF Library. We’ve combined the power of the Adobe PDF Library together with Tesseract (a widely-used open source OCR engine) to allow users to access and process the data and text within images.

One of the most common use cases for OCR is in preparing documents for searching or extracting the data into another process. By using our OCR APIs, the text data within these images is accessible without modifying the look of the input document. Let's walk through some of the key components of the API using .NET. You can view the full code by visiting our public sample GitHub repository.

OCRParams ocrParams = new OCRParams();
ocrParams.PageSegmentationMode = PageSegmentationMode.Automatic;
ocrParams.Performance = Performance.BestAccuracy;
OCREngine ocrEngine = new OCREngine(ocrParams)

Setting the PageSegmentationMode to Automatic lets the OCR engine choose how to segment the page for text detection. The Performance parameter allows for multiple levels of granularity when choosing speed vs performance. In this case, we are selecting the mode that will output the best accuracy. This is a common setting when you are unsure of the quality of your input document. The OCRParams will default to English; you'll need to use the Languages parameter to select other languages. Multiple languages can be selected at the same time.

List<LanguageSetting> langs = new List<LanguageSetting>();
LanguageSetting set = new  LanguageSetting(Language.Japanese, false);
langs.Add(set);
ocrParams.Languages = langs;

Once the OCREngine is configured, we can loop through the content of the document, identify the images, and apply the OCR processing:

Element e = content.GetElement(index);
if (e is Datalogics.PDFL.Image)
{
  Form form = engine.PlaceTextUnder((Image)e, doc);
  content.RemoveElement(index);
  content.AddElement(form, index -1);
}

The image object is replaced by a form, which contains the original image and the identified text laid out behind it. Once this step is complete, the resulting document can be saved and it will contain the original content and the identified text.

As an added benefit, the .NET and Java interfaces currently support Dutch, English, French, German, Italian, Portuguese and Spanish languages, and with additional Chinese, Japanese and Korean languages to be added shortly. Try it out yourself by requesting a free evaluation, and feel free to take a look at our full sample code for Java and .NET (which includes how to start this process from an image rather than a PDF) under the OpticalCharacterRecognition section inside Sample_Source.


Datalogics, Inc. provides a complete SDK for PDF creation, manipulation and management for companies around the globe. Built on Adobe source code, our flagship product Adobe® PDF Library offers a choice of programming platforms and languages along with unsurpassed customer service, proven by our 94% customer retention rate. Datalogics offers…

Read more
WordPress Cookie Notice by Real Cookie Banner