PDF Association logo

Discover pdfa.org

Key resources

Get involved

How do you find the right PDF technology vendor?
Use the Solution Agent to ask the entire PDF communuity!
The PDF Association celebrates its members’ public statements
of support
for ISO-standardized PDF technology.

Member Area


Presented at OctoberPDFest online
( 2020, Oct )

Support of complex scripts in PDF

Lego game of text composition and text extraction algorithms

Excerpt: Complex writing systems have always required special attention. Examples of such complex scripts are Arabic, Devanagari or Thai alphabets, but there are many more. In case of the PDF graphics model there are two key challenges when processing text in complex scripts: how to shape the correct visual representation out of glyphs and subglyphs, often shifting them in both horizontal and vertical directions, and then how to extract back the original Unicode representation of this text. In this talk … Read more
No items found
No items found

Description

Complex writing systems have always required special attention. Examples of such complex scripts are Arabic, Devanagari or Thai alphabets, but there are many more. In case of the PDF graphics model there are two key challenges when processing text in complex scripts: how to shape the correct visual representation out of glyphs and subglyphs, often shifting them in both horizontal and vertical directions, and then how to extract back the original Unicode representation of this text.

In this talk we are going to study mistakes one often makes when implementing complex text composition, occasionally receiving non-perfect fonts as input, and also explore the lifebuoys that PDF specification provides us to convert logical text sequence into visual one and vice versa. We will take a look at a couple of real world PDFs where text extraction becomes tricky, including cases where the best practices are violated, and try to come up with algorithms to overcome those difficulties.


WordPress Cookie Notice by Real Cookie Banner