PRESENTATION

Support of complex scripts in PDF

Lego game of text composition and text extraction algorithms

Presenter: Alexey Subach, iText Group NV
Language: English
Creation Reuse Date/Time: 0

Symbol calendar download
Download this presentation date to create your personal agenda

Description

Complex writing systems have always required special attention. Examples of such complex scripts are Arabic, Devanagari or Thai alphabets, but there are many more. In case of the PDF graphics model there are two key challenges when processing text in complex scripts: how to shape the correct visual representation out of glyphs and subglyphs, often shifting them in both horizontal and vertical directions, and then how to extract back the original Unicode representation of this text.

In this talk we are going to study mistakes one often makes when implementing complex text composition, occasionally receiving non-perfect fonts as input, and also explore the lifebuoys that PDF specification provides us to convert logical text sequence into visual one and vice versa. We will take a look at a couple of real world PDFs where text extraction becomes tricky, including cases where the best practices are violated, and try to come up with algorithms to overcome those difficulties.

REGISTRATION


View the PDF Days Europe 2022 agenda

This presentaton is part of PDF Days Europe 2022.
Register now!

View our terms and conditions.



PRESENTATIONS ON OTHER TRACKS AT THE SAME TIME

None