At the PDF Europe 2018 Joris Schelekens from iText will hosting a presentation titled “Structure Recognition for Information Retrieval and Layout” – what’s that about?”. In this interview he gives some background information about his presentation.Five reasons developers should participate in PDF Days Europe
PDF Days Europe, the annual PDF technology education event, will take place from 14 to 16 May 2018 in Berlin at the Hotel Steglitz International. Of the many good reasons for developers to participate, here are five of the best.5 reasons why those implementing electronic document technologies should attend PDF Days Europe
PDF Days Europe, the annual PDF technology education event, will take place from 14 to 16 May 2018 in Berlin at the Hotel Steglitz International. Of the many good reasons for users to participate, here are five of the best.2018 PDF 2.0 Interop Workshop
Following the success of our previous interop workshops in Cambridge, England and Boston, Massachusetts, the 3rd PDF 2.0 Interop Workshop takes place on May 16, 2018 as part of the post-conference program immediately following this year’s PDF Days Euro …Post-Conference of PDF Days Europe 2018 in Berlin
On Wednesday, May 16, 2018, directly following PDF Days Europe, the PDF Days Post-Conference offers a variety of workshops on PDF 2.0 Interop or PDF/UA.
Documents must be archived. Electronic archiving has become a universally recognized and practical method of digitally maintaining information. The formats that are used vary from simple raster formats (BMP, PNG and so on) to formats that have complex structures (MO:DCA, AFP) and also include PDF and PDF/A. As the complexity of the individual formats increases, the requirements for the structure and completeness of the documents must be adjusted accordingly and realized consistently. The aim is to ensure that you can reproduce these documents even after a long time. It only becomes clear at the time of the reproduction whether the criteria for a successful reproduction were also consistently implemented and realized.
What are the criteria in this case? With a view to reproducing documents at a later stage, we can divide the criteria into three categories:
These seem to be very basic criteria and, in theory, they should be met without any problems. Unfortunately, in this topic, there is also a significant gap between the theory and reality.
The aim of this track is to identify the details of the requirements for a successful reproduction and to highlight the reason why PDF/A is suitable for archiving. The aim is also to highlight how these requirements can be used in PDF and PDF/A.
You want to reproduce a document after 10 years of digital archiving. To do this, you require the relevant tool that can interpret and reproduce the document.
How do you choose a suitable tool (for example, a viewer, a converter etc.)? The data format of the document is extremely important here. Based on the data, the format and the method of interpretation should be identifiable. The format must therefore be identifiable. Depending on the format, a problem may occur if insufficient information is available. An example of this is simple text documents that do not usually contain any information about the type of data. If the document was not set to ASCII when it was created, problems also occur when identifying a suitable encoding. As a result, for example, special characters may not be correctly converted. In contrast to this, it is very simple to identify PDF documents: The header of a document (specifically %PDF-1.x) specifies not only that this is a PDF document, but also the version of the PDF specification on which this document is based.
You can use two basic elements to identify PDF/A documents. Firstly, the document contains the PDF header that is mentioned above. Secondly, the XMP metadata contains further information about the PDF/A conformity. This information specifies the PDF/A version on which the document is based and the conformity level that was reached.
If the data format of a document could be identified, the reproduction then occurs (using the suitable tool). It is then important that all of the required data exists and is complete. What does this mean exactly and what information does this concern? For PDF, it is not easy to provide a qualified answer with regards to the required data. It is clear that the body of the document must follow the guidelines that are defined in the standard version.
However, we do not immediately recognize that additional standards (or at least the specified formats) are used in a document. As a result, for example, TrueType, Type1 and (as of PDF 1.6) OpenType fonts are integrated into PDF documents. Image data formats such as JPEG, JBIG2 and JPEG2000 (PDF 1.5) may also be used.
An important topic, and one of the reasons why PDF is increasingly used, is the integration of textual contents without having to destructively integrate these into an alternative format. This would be destructive because a format that is based on raster is often used for archiving (for example, TIFF). The text data can continue to be visually reproduced but cannot be used for any other purpose. Thus, texts and fonts in a PDF document are of particular interest and are therefore the focus of this examination.
According to the PDF specification (not PDF/A), it is optional to embed fonts in a document or not. This obviously contradicts the requirement that all of the information must be available. If the PDF specification also allows documents without embedded fonts, why is it important that these fonts are embedded for the reproduction of archived documents? For this, we must consider in more detail the process when displaying textual contents.
To be able to describe the details as clearly as possible, the concepts that are used below must be clarified. Three basic components are important for texts: The character codes, an encoding and the glyphs. Character codes are data that usually consist of one or more bytes. These codes aim to represent a specific character. A glyph is the graphic representation of a character that is provided by a font. Character codes and glyphs are not directly linked to each other. An encoding establishes an assignment. This specifies how to map the character codes to the glyphs. A simple example for this is ASCII, which, for example, defines that a byte that has the value 0x41 must be interpreted as the character A.
How is this procedure specified for PDF? To reproduce textual contents, the system reads the character codes that are stored in the document and maps these to the characters in a font. This procedure is described below using TrueType as the example. The procedure for other font formats varies somewhat in order to uncover the idiosyncrasies of the formats. However, the basic principle is similar.
Figure 1 Locating characters in the example TrueType
Figure 1 displays the basic procedure for mapping character codes to TrueType glyphs. A number of steps are required to reproduce the characters. Firstly, the character codes must be extracted from the PDF document. To map the character codes to glyphs, an encoding is required. For this, the encoding is defined in the document using a basis encoding and any adjustments. You can now map the character codes to the character names, for example, /A, /at or /atilde. Based on these character names, you can use two methods to select a glyph.
When creating documents, we meet a number of assumptions regarding the font that is to be used. Examples of these assumptions are the expectation that a specific Unicode value leads to a specific character, or that a specific postscript name can be used to select a glyph. If the font is available in the document, these assumptions do not cause any problems, because the system already determines which font to use for the reproduction when it generates the document.
In such a case, the processing application must find this special font and, if it does not exist, must find a suitable replacement. The application uses the data from the system to limit the search for a replacement. Usually, the application searches through the existing fonts in the system and selects a suitable replacement font. If, in the worst-case scenario, the system has only one font, the procedure is not very likely to succeed. The affected assumptions may or may not apply for the font that is used as a substitute. The results depend on the options for the processing system and, in certain circumstances, for the operating system that is used. This may frequently result in texts that have random characters, or a selection of incorrect special characters (for example, checkboxes).
The defined aim of PDF/A is to prevent these problems. This is achieved by, for example, requiring that fonts are always embedded. You can use this restriction to ensure that all of the required information is available. In addition to requiring embedding, PDF also clearly regulates the additional specifications regarding the fonts and texts. If some attributes are marked as optional in the PDF specification, these are required for a valid PDF/A document. These clear definitions increase the chances that the document will be successfully reproduced, even after a long period of archiving.
In accordance with PDF/A, all of the required data (fonts, colour profiles and so on) is now embedded into the document and is available for the reproduction. In spite of these preparations, in some documents, the images in some documents and the texts in others are displayed incorrectly or not displayed at all. What is the problem here? In addition to completeness, another important requirement arises and this requirement is extremely important for a reproduction. All of the data must conform to the standard. This means that even additional resources (such as fonts, images and so on) must be correct.
Problems with embedded resources are rarely identified at first glance but these may lead to problems in the long term. If, for example, the data for a font is incorrect, the processing application must deal with this. In this case, there is no specific information about how the data must be processed. If the data in a font is incorrect, heuristics are often used to attempt to identify the intention of the data. These heuristics are not documented and this is a detail of the implementation of the processing applications. In this case, most of todays products already behave in different ways. The future development of the applications is in debate.
If the data cannot be used, despite any corrections, it is common to use a font that already exists in the system as a substitute. This is the same as the situation in which the font is not embedded in the document. This situation also has all of the associated disadvantages. The same problems occur not only for fonts but also for all types of additional data. JPEG decompressors contain, for example, a number of heuristics to determine the colour space of the image data if this has not been explicitly defined.
PDF is a very comprehensive format. It offers a large number of options when processing contained data and when preparing additional information (for example, metadata). To successfully reproduce a document so that it is true to the original, a number of must be observed and adhered to. This process is made easier by using PDF/A because potential problems are clearly distinguished by the standard and, where required, optional information becomes compulsory. As a result, PDF/A documents are, to a large extent, self-explanatory. Questionable or incomplete aspects in the PDF reference were made clearer, excluded from PDF/A or restricted.
PDF/A is a future-proof format and is being used more and more. It offers various advantages when compared to the classic formats, such as TIFF and MO:DCA. The completeness of PDF/A and the fact that PDF/A is an international standard are good reasons to use PDF/A for digital long-term archiving.
In this context, we must also mention the work of the PDF/A Competence Center: It is made up of a large community that concerns itself with the topic of PDF/A. In this way, we can keep the PDF/A specification in the spotlight and we can exchange experiences. An additional group is the TWG (Technical Working Group), which deals with the details of the standard PDF/A and, through teamwork, clarifies many questionable aspects of the standard.