At the PDF Europe 2018 Joris Schelekens from iText will hosting a presentation titled “Structure Recognition for Information Retrieval and Layout” – what’s that about?”. In this interview he gives some background information about his presentation.Five reasons developers should participate in PDF Days Europe
PDF Days Europe, the annual PDF technology education event, will take place from 14 to 16 May 2018 in Berlin at the Hotel Steglitz International. Of the many good reasons for developers to participate, here are five of the best.5 reasons why those implementing electronic document technologies should attend PDF Days Europe
PDF Days Europe, the annual PDF technology education event, will take place from 14 to 16 May 2018 in Berlin at the Hotel Steglitz International. Of the many good reasons for users to participate, here are five of the best.2018 PDF 2.0 Interop Workshop
Following the success of our previous interop workshops in Cambridge, England and Boston, Massachusetts, the 3rd PDF 2.0 Interop Workshop takes place on May 16, 2018 as part of the post-conference program immediately following this year’s PDF Days Euro …Post-Conference of PDF Days Europe 2018 in Berlin
On Wednesday, May 16, 2018, directly following PDF Days Europe, the PDF Days Post-Conference offers a variety of workshops on PDF 2.0 Interop or PDF/UA.
The long-term archiving format PDF/A is a relatively new standard that opens up new possibilities for many industry sectors and users with regard to enabling digital documents to be read and processed even in years to come. This document focuses on some of the technical aspects of PDF/A.
As the name indicates, PDF/A-1 is the first part of the ISO range of standards for PDF/A. This international standard was formulated by the ISO Technical Committee TC 171 SC2 WG5 and was published on October 1st 2005. The Technical Corrigendum of 2007 further improved ISO 19005-1. PDF/A-1a is based on PDF 1.4 (the PDF version introduced with Acrobat 5).
PDF/A is not a newly invented PDF format – instead, a PDF/A file is a completely normal PDF file that is tailored in line with minimum requirements and prescribed and prohibited features. Traditional PDF does not always have to be completely unambiguous, but PDF/A has the clear aim of ensuring that the display of documents is entirely clear – both today and in the future.
PDF was invented by Adobe Systems, who specifically released the format for standardization.
What does PDF/A aim to achieve? PDF/A aims to produce files with static content that can therefore be visually reproduced completely precisely today and in many years time. Files that are subject to long-term archiving should work regardless of the device or operating system used. The future usability of PDF/A files must also be guaranteed in a manufacturer-independent manner – and this includes Adobe. PDF/A is a complete format. This means that PDF files that comply with the PDF/A standard are complete in themselves and use no external references or non-PDF data. The PDF/A-1 standard is based on PDF/A specification 1.4, which means that it works within the technical scope of the functions available in Acrobat 5.
A range of rules must be observed when generating PDF/A files in order to meet the goals named above. For example, when generating PDF/A, it is important to embed all fonts and clearly specify all colors. Forms, comments, and notes are only permitted to a limited extent. Compression is allowed as a general rule, but LZW and JPEG2000 are excluded. Transparent objects and layers (Optional Content Groups) are not permitted. PDF/A uses rules for metadata that are based on XMP (Extensible Metadata Platform). Finally, a PDF/A file must identify itself as such.
There are two PDF/A-1 levels: PDF/A-1a and PDF/A-1b. These two conformance levels allow for the fact that different user groups have different requirements of a file format for long-term archiving and that the source material can vary greatly.
PDF files that comply with the PDF/A-1a standard must fulfill additional prerequisites that enable PDF/A-1a to offer the benefits outlined above:
To enable precise searchability, all text must be reproducible using Unicode. Unicode is a system that enables characters (letters, digits, and symbols in international and historical font systems) to be precisely mapped to a code.
The publication of PDF/A-2 is planned for the end of 2008/start of 2009. This PDF/A standard will enable transparency and more recently implemented PDF features, since it is based on PDF 1.7 (PDF/A-1 is based on PDF 1.4). These new features include the following: JPEG2000, PDF layers (Optional Content Groups), UserUnits (page scaling), new comment types implemented since PDF 1.4, and Unicode paths for hyperlinks.
PDF/A-2 will also apply stricter rules for glyphs in embedded fonts. PDF/A-2 will not allow the use of .notdef glyphs and will only permit so-called empty glyphs for white space.
The following three conformance levels will exist for PDF/A-2:
Documents that are stored in PDF/A-1 format today will remain valid following the introduction of PDF/A-2. However, future PDF/A versions will not always be backwards-compatible.
For PDF/A, some PDF areas have to fulfill certain prerequisites so that they can be unambiguously reproduced and therefore be considered to be future-proof.
Text is displayed using fonts. PDF/A outlines certain rules for fonts in order to enable the precise reproduction of content today and for a long time in the future.
First, it is important to ensure that all fonts used in a PDF file are embedded. All glyphs used must be stored within the PDF itself – a simple reference (load font xyz here) is not sufficient for PDF/A. Another requirement is that the character set encoding must be achieved in a way that enables the intended depiction of text. If, for example, the tracking is incorrect, a PDF file cannot be precisely reproduced and is therefore not PDF/A-compliant.
Glyphs are graphical representations of characters (letters, digits, and symbols). All of us have seen cases where problems have occurred when trying to display font characters. For example, characters that are completely missing result from an incorrect sub-setting for TrueType fonts – only an empty glyph is displayed in this case.
In the case of Type 1 fonts, however, the system uses a .notdef glyph (replacement character) instead. This is often simply a box containing an X, as shown in the graphic below.
The inappropriate multiple usage of subset fonts is generally an effect that can occur as a result of problems when generating a PDF.
Bugs in the viewer or printer can also result in incorrect reproductions – the incorrect caching of a font instance can have negative effects.
Problems can occur before and/or during the generation of PDFs. A PDF/A file can be formally correct yet still have incorrect glyphs. Only a careful visual check can uncover this problem. Because generation problems also affect Unicode mapping, the problem attracts the attention when a visual check is carried out on the extracted text.
In PDF/A, text/font usage is specified uniquely enough to ensure that it cannot be incorrect.
If viewers or printers do not offer complete support for encoding systems, this can result in problems with regard to PDF/A.
Inconsistencies can occur for glyph width specifications. To ensure that this does not occur, the specifications in the PDF Font Dictionary must correspond to the specifications in the embedded font. In reality, slight deviations resulting from differing dimensioning are unavoidable and must be tolerated. Specifications in the Width Array property in PDF are usually integer values, but this is not mandatory.
PDF generation: Problems occur when the tracking specifications are based on a font other than the font that is actually embedded. The subsequent embedding of fonts therefore also involves risks.
Displaying/printing PDF/A: Problems can occur when displaying or printing PDF if the viewer or printer being used uses a replacement font rather than the embedded font.
Incorrect colors can result in a completely different message being imparted by an image than was originally intended. It is therefore important for colors to be reliably reproducible in PDF/A. In relation to this task, color profiles act as instruction leaflets that give information on how to handle colors.
The precise display of colors via color management is important not only for photographs in PDF/A but also for fonts or graphics – just think, for example, of the corporate image of a company.
PDF/A provides several ways of ensuring the exact reproduction of colors.
If using CMYK, it is important to remember that the ICC profiles can be extremely large – in particular, prtr profiles (output profiles for printers) can require memory space of between 500 KB and 2 MB.
Note: For Separation/DeviceN color spaces: The so-called alternate space is not subject to the same requirements as the process color spaces.
Some comments are allowed, but other comment types are prohibited. Comments in the form of movies, audio clips, and attachments are not permitted (additional programs are required to display these types of comment, and the programs in question may not be available in the future). In addition, any feature that was implemented after PDF 1.4 is not PDF/A-compatible. This includes Polygon, PolyLine, Caret, Screen, Watermark, and 3D. Highlight markup annotation is permitted but not always desirable, since this type of annotation usually uses transparency, which is not permitted in PDF/A-1.
With regard to hyperlinks, the special Link annotation type is permitted with or without activated Appearance. As a rule, URL links and other links are permitted – they must be somehow displayed by a PDF/A viewer, but they must not necessarily be executed. As far as the standard is concerned, it does not matter whether or not a link references a valid target.
If you still wish to convert a form with critical functions into a PDF/A file, you can flatten its form fields. While reducing/flattening form fields achieves the required PDF/A-1 conformity, the assignment of content to the form fields in question is lost. As an alternative, you can avoid problematic components in advance with the following measures:
The only absolutely necessary metadata specification is the PDF/A-1 property along with the relevant conformance level (PDF/A-1a or PDF/A-1b) in Metadata (XMP). All other metadata is optional.
If using schemata that are not predefined in the XMP specification, the schema in question must be embedded in the XMP.
If traditional document properties exist, they must be mirrored in the XMP metadata in order to achieve PDF/A conformity. This applies to the following: Title, Author, Subject, Keywords, Creator (application), Producer (PDF created with ), CreationDate, and ModDate.
The namespace and metadata mapping sections of the standard are formulated in a way that is not completely clear. For a better explanation, see the Technical Corrigendum (cf. TechNotes 0001 and 0003).