Nowadays, activities between enterprises and end-users are collectively referred to as B2C (Business to Consumer). This class of business, also commonly known as e-business, typically involves a high volume of communication in the form of offers, invoices, order confirmation, performance reports, policies or bank statements. While the volume of individual, physical documents (in the vernacular – letters) continues to fall in almost all countries, the percentage of electronic documents distributed as e-mail or via web portals increases disproportionately. When these documents have to be facsimiles of the original paper form there is no getting around PDF and as a consequence PDF/A.
From a technical perspective there are two types of IT systems which have to deal with these documents: The output management system (OMS) is used to create documents and provide the dispatch logic, while the classic document management system (DMS) is used to archive the same documents according to the relevant regulations for periods ranging from months (e.g. for itemized billing) to a number of years (life insurance documentation). In recent years a higher-level discipline, Enterprise Content Management (ECM) has come to be seen as uniting both requirements. At the first glance, both sets of needs can be fulfilled by PDF/A, but in practice differing technologies are used:
- High-volume printing, meaning for example industrial production and collection into an accordingly large spooling file containing up to one million letters in a single day. A typical format for doing this would be AFP, developed over 20 years ago, and particularly appropriate for resource optimization for large printing systems. Datastreams may also be PostScript or PCL.
- Selected individual documents may also be stored in an archiving system to secure customer documents or processes. Unfortunately, due to weaknesses in the software, TIFF raster format is often encountered here. More and more companies have come to recognize the advantages of PDF and in particular PDF/A, and changed their archiving system.
The conflict is readily apparent: While printing involves collecting a lot of documents, so that resources such as fonts or images are present only once in the datastream, for archiving documents have to be stored individually, with the result that relevant fonts and images always have to be embedded with each document. An often used argument against the use of PDF/A is the real need to have to embed all fonts, which from the perspective of long-term archiving should not really be a matter of contention. In any case, independently of the way in which documents are archived, they should be provided in PDF/A since the end-user will perhaps want to save them in a (perhaps smaller) archive.
The following section details the options of how to overcome this dilemma.
Optimised File Size
When documents are saved as individual files together with their embedded resources, the simple question is – how large will each individual file be? In many cases the features available to create an efficient PDF/A are fully used, so that the PDF/A files are almost as large as the corresponding TIFF file size. By considering the points following, PDF/A file sizes should be smaller than the related TIFF files:
- Select the right compression: PDF/A offers a wide selection of compression options appropriate for each of the different data types. So, for example, JBIG2 compression for black and white images is significantly better than FAX G4. And JPEG is usually worse for colour line drawings than flat compression.
- Re-use resources: PDF/A offers almost unlimited options to store a single instance of often used resources such as images or overlays. Unfortunately many applications, in particular print drivers, make little or no use of these options.
- Use font sets: PDF/A supports the use of font subsets. This means that only the actual characters used in a document are saved as a font. This has the advantage that large fonts containing hundreds of characters of many 100 kilobytes are reduced to just a few kilobytes.
- Use a limited number of fonts: A single font, even as a font subset, uses a multiple of the storage space required for the text contained on a page. Apart from this, using just a minimal number of fonts saves not only space, but also topographically looks better.
- Check the colour profile: PDF/A forces the use of device-independent colours such as RGB or CMYK colour profiles to ensure correct colour reproduction. For many archiving requirements the use, for example of CIELAB can be discontinued by specifying device-independent colour. If a colour profile has to be used, it does not have to always be more than 1 megabyte in size.
- Remove unwanted information: Metadata is wonderful, can however lead to very large files which are not required in the archive.
Optimised PDF/A files often require less than 100 kilobytes to save multi-paged documents, whereby the advantage of PDF/A compared to TIFF really comes to bear as the size of the document increases.
Keep Documents Together
Separating individual documents creates an unacceptable increase in storage requirements, it can be worthwhile to convert all the spool files to PDF/A, and to then archive them. In this case individual documents are extracted from the combined files when the archive is read. Converting arbitrary print data to PDF/A while retaining resource management is now-a-days no longer a significant technical challenge. On the archive side however, there is more work to be done because in the archive index the page number and size of the target document within the large spool file has to be managed, and in addition, for the purposes of retrieving an individual document, an active component has to be included so that the target page can be extracted. For performance reasons there will be no noticeable delay because extracting a number of pages from a PDF/A file containing many thousand pages on a system with fast disks happens in fractions of a second. The PDF format has internal structures designed specifically to meet this requirement. These processes are by no means new, but have been included in archiving systems developed to do spool-based archiving since many years.
If the effort to save a document in an archive in comparison to the effort necessary to retrieve a single document is too high, late conversion may be an appropriate alternative. This means the document will be converted into PDF/A format from its original print datastream only when it is read from the archive. If hundreds of thousand documents need to be archived daily, but only a few need to be taken from the archive, this method should be considered, taking into consideration the total cost per page. The work involved from the point of view of the archive is not significantly higher than the method described previously, but it should be noted that support for the datastream must to be guaranteed for many years to come and that a digital signature cannot be implemented with this approach.
In summary it is desirable to keep the number of formats in an archive to the absolute minimum. So for example, single documents coming from say office applications, or scanned input documents are to be immediately saved as PDF/A and that only the high-volume mass data is to be treated in a special way. If only AFP and PDF/A data are to be managed, there is a high level of certainty that both formats can continue to be supported.
Re-creating Print Data from PDF/A Documents
The reverse direction from archived data back to datastream is by no means an ordinary task. This can be an issue when documents are stored in a portal application directly as PDF/A, but nonetheless a printout request must be fulfilled. The conversion from PDF/A to a print datastream is no longer a noteworthy challenge; there are a number of products and printing systems available which can directly print PDF/A.
One difficulty may well be hidden in this approach: When many small PDF/A files with font subsets are concatenated or converted, it is not unusual that a print file is created containing hundred if not thousands of fonts, bring not just performance problems but which can also lead to unprintable files. This problem can be circumvented by creating a single font from those fonts which are the same but include different characters. This is referred to as a font super-setting, which is a minor challenge for all those involved.
There is no obvious silver bullet solution to choosing an archiving process for mass data. What is important is to carefully study the alternatives and to find a solution independent of the existing components being used. The cost associated with large documents in an archiving system may still play a role today, but within a few years this could be irrelevant. So, at this stage: better PDF/A today than tomorrow.