Lego game of text composition and text extraction algorithms
Complex writing systems have always required special attention. Examples of such complex scripts are Arabic, Devanagari or Thai alphabets, but there are many more. In case of the PDF graphics model there are two key challenges when processing text in complex scripts: how to shape the correct visual representation out of glyphs and subglyphs, often shifting them in both horizontal and vertical directions, and then how to extract back the original Unicode representation of this text.
In this talk we are going to study mistakes one often makes when implementing complex text composition, occasionally receiving non-perfect fonts as input, and also explore the lifebuoys that PDF specification provides us to convert logical text sequence into visual one and vice versa. We will take a look at a couple of real world PDFs where text extraction becomes tricky, including cases where the best practices are violated, and try to come up with algorithms to overcome those difficulties.