How do we gain insight into how users’ views of documents are shifting? Google Trends is an increasingly interesting source of high-level marketplace data. By aggregating Google’s search data over time, reporting a term’s popularity as compared with all other searches.Participating in the PDF Techniques Accessibility Summit
The PDF Techniques Accessibility Summit’s objective is to establish a broad-based understanding of how PDF files should be tagged for accessibilty. It’s an opportunity to focus on establishing a common set of examples of accessible PDF content, and identify best-practice when tagging difficult cases.Members supporting PDF features!
The typical adoption curve for PDF technologies until approximately 2007 tended to track with that of the original PDF developer. Since then the marketplace has shifted; it’s no longer clear that Adobe drivesPDF feature support worldwide. Accordingly, we are happy to report that adoption of PDF 2.0 continues apace, with new vendors announcing their support every month.Modernizing PDF Techniques for Accessibility
The PDF Techniques Accessibility Summit will identify best-practices in tagging various cases in PDF documents. Questions to be addressed will likely include: the legal ways to tag a nested list, the correct way to caption multiple images, the appropriate way to organize content within headings.Refried PDF
My hospital emailed me a medical records release form as a PDF. They told me to print it, fill it, sign it, scan it and return it to the medical records department, in that order. In 2018? To get the form via email (i.e., electronically), yet be asked to print it? Did the last 20 years just… not mean anything! So I thought I’d be clever. I’d fill it first, THEN print it. Or better yet, never print it, but sign it anyhow, and return it along with a note making the case for improving their workflow. The story continues…
JHOVE is an open source tool for identifying, characterizing and validating common formats such as pdf, tiff, jpeg, aiff and wave. JHOVE includes validation modules for twelve different file formats, including PDF.
The PDF format is widely used by memory institutions and is one of the most commonly used long term archiving formats. However, because of the wide variety of potential source formats, conversion tools and PDF readers in use, a large percentage of PDF files fail to actually meet the PDF standard.
Naturally, many memory institutions use JHOVEs PDF module on a daily basis for digital long term archiving. I wish to discuss the extent to which JHOVEs PDF validation tools can be used for risk management and quality assurance when seeking to assure long term access to documents. Here, a valid PDF means that the PDF meets all criteria of the specification that it is correct, in other words. Invalid, on the other hand, means that the file fails to meet at least one point of the specification and is therefore incorrect.
During the OPF (Open Preservation Foundation)s PDF Hackathon in Hamburg, together with the ZBW (the German National Library of Economics) and Goportis (the Leibniz Library Network for Research Information), the PDF Associations Olaf Drummer reported an error message from JHOVE which incorrectly identifies a breach of the PDF specification.
Pages within a PDF file are usually stored as a page tree, allowing the user to reach a given page as quickly as possible. This is often represented as a balanced page tree. Although the PDF standard references this option, it does not by any means prescribe it.
It also complies with the PDF standard if the page tree is not balanced, for instance, if only one single page tree node exists, which directly reference all the PDF pages of the whole document. The only drawback is that this makes it less efficient to access a given page (making for slower navigation through the document), particularly when dealing with a PDF containing a particularly large number of pages. JHOVE, however, reports an error if pages are not stored in a balanced page tree. As this is not an error and does not present a risk to long-term archiving, the message can be ignored.
Common advice for long-term archiving is to preferentially use the PDF/A format. However, this no longer matches to the day-to-day reality of many workflows which use JHOVE for validation tests.
JHOVEs PDF module is certainly capable of validating PDF/A files. According to JHOVEs developer, however, this feature was implemented late, is unstable, and does not work well. Additionally, the process does not analyze the content of the data streams, meaning that it cannot validate PDF/A compliance in line with ISO standards.
The PDF/A format differs from the standard PDF format in that it explicitly forbids or requires certain options considered essential for long-term archiving. A PDF/A-1b file forbids the following:
A PDF/A file also requires the following:
Additionally, PDF/A-1 is based on PDF 1.4. All functions introduced from PDF 1.5 onwards, including JPEG2000 compression and layering, are lost when converting to PDF/A-1 format. PDF/A-2, in contrast, is based on PDF 1.7.
The Isartor Test Suite provided by the PDF Association consists of 204 invalid PDF/A files, only one of which was recognized by JHOVE as invalid. JHOVE failed to identify 51 of these as PDF/A files at all. As a result, JHOVE only checked to see whether these files met the PDF 1.4 standard.
These tests are more than sufficient to prove that JHOVE is not suited to PDF/A validation. Since JHOVE is a component in several out-of-the-box long term archiving solutions, however, it is still interesting to know what JHOVE does deliver in terms of PDF/A, to allow well informed decision making.
JHOVEs PDF module checks to see whether a file is a PDF/A, in which case it uses the PDF/A profile subcategory. In this case, JHOVE tests whether a file identified as a PDF/A contains:
If so, the file is identified as invalid. JHOVE also checks whether the file contains:
If not, then the PDF/A file will again be recognized as invalid.
A quick test of 670 invalid PDF/A files as identified by PDF Box shows that JHOVE identifies only five files as not well formed (4) or invalid (1). These are all breaches of the standard PDF specification, however: in these cases, JHOVE failed to identify the PDF/A files correctly and therefore only tested their validity as standard PDFs.
Modern conversion tools make it difficult to integrate components into a PDF/A file which JHOVEs PDF module would test and identify as errors. In fact, I am not aware of any currently available PDF/A conversion tool which would even permit it. For example, the protection applied to encrypted PDF files means that either they cannot be converted to PDF/A at all, or the encryption is identified and removed by the tool during the conversion process. The same applies for other components which are forbidden in PDF/A and checked by JHOVE.
JHOVEs PDF/A validation tools are therefore so rudimentary that I am unsure if any PDF/A files at all would break its rules and be detected by JHOVE as invalid.
JHOVE is not suited to PDF/A validation. I know of no alternative to JHOVE for validating standard PDFs. As many memory institutions primarily use the PDF format and the quality of their files is not always enough of an argument for converting them to PDF/A, I believe that a standard PDF validator remains as necessary as it always has been. In general, JHOVE will continue to be used, despite its limitations, and decisions regarding the archivability of a given file will be dependent on the results JHOVE gives.
JHOVE can still be useful, provided users understand its error reports and are aware of ways to resolve them. So far there is not a great deal of documentation on this issue. Both nestor (AG Format Recognition) and the Open Preservation Foundation aim to do their part to improve this situation soon.
Some JHOVE errors, such as Invallid PDF trailer, have proven to be a very useful part of day to day work. PDF files which trigger this error can usually not be opened. This is because the file has not been fully up- or downloaded, resulting in an incomplete file. The ability to automatically recognize such serious errors and report them to the data provider for correction is very useful when working with large archives. Other errors affecting a PDFs structure can also be identified using JHOVE and easily fixed.
Even before taking into account the meaning and the effects of a JHOVE error report to say nothing of the resolution options JHOVE remains an excellent option for providing initial guidance. Since JHOVE developer Gary McGath himself warns against using JHOVE as the final word on the matter, we will continue to avoid making our decisions and workflows dependent on JHOVE alone.