Imagine a scenario where you have copy-pasted some text from your PDF to another location and it comes out totally different from what you had copied (like in the screenshot below). Now what? You would want to explore the text inside the PDF, wouldn’t you?
The body of a PDF usually contains all object information, such as fonts, images, text, bookmarks, form fields and so on. By preference, any fonts that are used in a layout are also included in the PDF file in original format to make sure that the file can be viewed and printed as it was created by the designer. That means a PDF with text and an embedded font caries all information in the PDF structure that is needed to deal with the text.
In order to see what's inside of a PDF, we need a low-level PDF analysis tool. Every PDF developer has such a tool. We at callas decided to put it into our products as “Explore PDF” viewer. It has a 'Resource View' that summarizes all available information about the embedded font and its glyphs (glyphs are the character outlines that are specific to the font and provided through the font file). A simple worksheet with embedded fonts looks like in the screenshot below where there are several indicators with specifics for the respective embedded font we are looking at:
If such an indicator is red (like e, 1, 2, s, L or W in the screenshot), it means that the corresponding property of the indicator applies to the font.
Let us see what we can learn about this particular font.
The indicator lookup informs us that 'e' stands for glyphs without contour and 's' is for such empty glyphs with a width. So, in fact, one of the glyphs in the font is a space. The capital 'W' means that the glyph width is used for positioning, which means that the glyph is not positioned using coordinates but the width of a previous glyph. 'L' stands for ligature, which means that a glyph (outline) consists of two characters. Since we are looking at the whole font that is not a contradiction to 'space' since the screenshot shows the summary information for all glyphs in the respective font, further down you have the 'Glyph properties' section where you find the same information for each individual glyph.
'1' and '2' indicate a potential problem with Unicode representation of at least one glyph in the font. What does that mean? When you look at text in the PDF, there are actually two different lookups (encodings) taking place: one is for the glyph (outline) and is needed to display the character on the screen; the other is for the meaning (semantics) of the character and is needed to search for text or copy it out of the PDF file. In PDF, we usually say that the text needs to have a Unicode representation, since the Unicode standard defines the semantic for all characters, e.g. it associates the outline 'A' with 'Latin Capital Letter A' which has the Unicode code pointU+0041. By the way, there is a Check in pdfToolbox with the name 'Text cannot be mapped to Unicode' that allows you to find out whether there is Unicode representation for all text in a PDF.
And now we are back at our initial question: why was the text in the PDF accurately displayed but when copied just garbage? The reason is that the glyph lookup worked just fine, but the lookup of the Unicode representation did not. And '1' and '2' indicate that there is a mismatch between two ways to resolve to a Unicode code point: via a ToUnicode table in the PDF and via the information as present in the font encoding itself. Since the ToUnicode has priority, this is not necessarily a problem but an indication that there could be one.
Now that we have successfully explored the font information in the PDF structure, it's time to explore the font itself. Font files can be highly complex and very large files, with many glyphs, supporting dozens of non-roman letters, rich in features, or they can be very small, containing just a few icons for a website. Another explorer, the ‘Font Explorer’ lets you view the internal structure of embedded fonts in a PDF similar to 'Explore PDF', but with some additional information including in greater detail than the preflight results with a graphical view that shows the outline and coordinates of each glyph. Below, you can see a ligature (f and i) which has the Unicode code point U+FB01 'Latin Small Ligature fi':
callas software finds simple ways to handle complex PDF challenges. As a technology innovator, callas software develops and markets PDF technology for publishing, print production, document exchange and document archiving. callas software helps agencies, publishing companies and printers to meet the challenges they face by providing software to preflight, correct …Read more
Akash Choudhary, Product Manager at callas software GmbH considers himself as a lifelong student of PDF products and technologies. With a bachelor’s degree in Computer Science and Engineering and a Master of Business Administration (International Management), Akash has gathered experience in small and big firms in software performance testing, marketing, business development and content manager. Working as a point of …