It’s a question that vexes vendors of web-based solutions everywhere: why do people still insist on PDF files? And why does PDF’s mindshare keep going up? “PDF is such antediluvian technology!” they say. “It’s pre-web, are you kidding me? It’s so old-f …PDF Association technical resources: an overview
PDF is PDF because files produced with one vendor’s software can be read using a different vendor’s software with no loss of fidelity. Interoperability is key to our industry. The PDF Association is a international membership organization dedicated to …2022: The last year of paper for records-keeping
NARA (The National Archives and Records Administration) is the final depository for the long-term records generated by all other agencies of the U.S. Federal Government. The agency has a key role in preserving the cultural history of the republic as we …PDF 2.0 examples now available
The PDF Association is proud to present the first PDF 2.0 example files made available to the public. Created and donated to the PDF Association by Datalogics, this initial set of PDF 2.0 examples were crafted by hand and intentionally made simple in construction to serve as teaching tools for learning PDF file structure and syntax.PDF 2.0 interops help vendors
The PDF 2.0 interop workshops included many vendors with products for creating, editing and processing PDF files. They came together in Boston, Massachusetts for a couple of days to test their own software against 3rd party files.
Nowadays, activities between enterprises and end-users are collectively referred to as B2C (Business to Consumer). This class of business, also commonly known as e-business, typically involves a high volume of communication in the form of offers, invoices, order confirmation, performance reports, policies or bank statements. While the volume of individual, physical documents (in the vernacular letters) continues to fall in almost all countries, the percentage of electronic documents distributed as e-mail or via web portals increases disproportionately. When these documents have to be facsimiles of the original paper form there is no getting around PDF and as a consequence PDF/A.
From a technical perspective there are two types of IT systems which have to deal with these documents: The output management system (OMS) is used to create documents and provide the dispatch logic, while the classic document management system (DMS) is used to archive the same documents according to the relevant regulations for periods ranging from months (e.g. for itemized billing) to a number of years (life insurance documentation). In recent years a higher-level discipline, Enterprise Content Management (ECM) has come to be seen as uniting both requirements. At the first glance, both sets of needs can be fulfilled by PDF/A, but in practice differing technologies are used:
The conflict is readily apparent: While printing involves collecting a lot of documents, so that resources such as fonts or images are present only once in the datastream, for archiving documents have to be stored individually, with the result that relevant fonts and images always have to be embedded with each document. An often used argument against the use of PDF/A is the real need to have to embed all fonts, which from the perspective of long-term archiving should not really be a matter of contention. In any case, independently of the way in which documents are archived, they should be provided in PDF/A since the end-user will perhaps want to save them in a (perhaps smaller) archive.
The following section details the options of how to overcome this dilemma.
When documents are saved as individual files together with their embedded resources, the simple question is how large will each individual file be? In many cases the features available to create an efficient PDF/A are fully used, so that the PDF/A files are almost as large as the corresponding TIFF file size. By considering the points following, PDF/A file sizes should be smaller than the related TIFF files:
Optimised PDF/A files often require less than 100 kilobytes to save multi-paged documents, whereby the advantage of PDF/A compared to TIFF really comes to bear as the size of the document increases.
Separating individual documents creates an unacceptable increase in storage requirements, it can be worthwhile to convert all the spool files to PDF/A, and to then archive them. In this case individual documents are extracted from the combined files when the archive is read. Converting arbitrary print data to PDF/A while retaining resource management is now-a-days no longer a significant technical challenge. On the archive side however, there is more work to be done because in the archive index the page number and size of the target document within the large spool file has to be managed, and in addition, for the purposes of retrieving an individual document, an active component has to be included so that the target page can be extracted. For performance reasons there will be no noticeable delay because extracting a number of pages from a PDF/A file containing many thousand pages on a system with fast disks happens in fractions of a second. The PDF format has internal structures designed specifically to meet this requirement.
These processes are by no means new, but have been included in archiving systems developed to do spool-based archiving since many years.
If the effort to save a document in an archive in comparison to the effort necessary to retrieve a single document is too high, late conversion may be an appropriate alternative. This means the document will be converted into PDF/A format from its original print datastream only when it is read from the archive. If hundreds of thousand documents need to be archived daily, but only a few need to be taken from the archive, this method should be considered, taking into consideration the total cost per page. The work involved from the point of view of the archive is not significantly higher than the method described previously, but it should be noted that support for the datastream must to be guaranteed for many years to come and that a digital signature cannot be implemented with this approach.
In summary it is desirable to keep the number of formats in an archive to the absolute minimum. So for example, single documents coming from say office applications, or scanned input documents are to be immediately saved as PDF/A and that only the high-volume mass data is to be treated in a special way. If only AFP and PDF/A data are to be managed, there is a high level of certainty that both formats can continue to be supported.
The reverse direction from archived data back to datastream is by no means an ordinary task. This can be an issue when documents are stored in a portal application directly as PDF/A, but nonetheless a printout request must be fulfilled. The conversion from PDF/A to a print datastream is no longer a noteworthy challenge; there are a number of products and printing systems available which can directly print PDF/A.
One difficulty may well be hidden in this approach: When many small PDF/A files with font subsets are concatenated or converted, it is not unusual that a print file is created containing hundred if not thousands of fonts, bring not just performance problems but which can also lead to unprintable files. This problem can be circumvented by creating a single font from those fonts which are the same but include different characters. This is referred to as a font super-setting, which is a minor challenge for all those involved.
There is no obvious silver bullet solution to choosing an archiving process for mass data. What is important is to carefully study the alternatives and to find a solution independent of the existing components being used. The cost associated with large documents in an archiving system may still play a role today, but within a few years this could be irrelevant. So, at this stage: better PDF/A today than tomorrow.