PDF/raster: An Overview
Earlier in 2017, the PDF Association and TWAIN Working Group published the first public draft of the PDF/raster specification. In this article, I’ll give an introduction to PDF/raster and its relationship to PDF, and talk about use cases. I’ll suggest you grab a copy of version 1.0 of the specification in order to follow along.
The PDF/raster format aims to provide the scanning and document industry with a standardized format for creating and exchanging sets of page images. Typically, these images would be created by a device such as a scanner. Each page of scanned input is represented either by one image, or by a series of image strips that can be put together to form a page image. These images may be compressed, and may be color, gray, or black and white depending on the capture device. A document is a collection of images of scanned input pages.
PDF/raster improves upon the most popular existing formats – TIFF and JPEG – in the following key ways:
- As PDF files, PDF/raster files follow an open ISO standard. TIFF is based on a vendor standard and a quarter-century of both “generally understood” and proprietary tags (extensions) to the format.
- PDF/raster files can contain multiple pages and can represent pages as multiple page strips. JPEG files are limited to containing one raster.
- PDF/raster files may mix color, gray, and black and white page images – and mix multiple color spaces for page images – for maximum compression.
- PDF/raster files support data encryption in the file format itself, for better security of data in transit and at rest.
The PDF format already contains support for all these features. PDF/raster uses only the syntax from PDF that is required to support its use cases. This syntax is simple enough that a full-fledged PDF parser is not required to create or read PDF/raster files. The PDF/raster format is well-suited for creation and consumption by restricted-CPU environments, such as scanners and MFP (multifunction peripheral) output preview stations.
A PDF/raster file is a PDF file
PDF/raster is a substantially restricted subset of PDF syntax. Because it does not rely on any features not already within the PDF specification, all PDF/raster files are valid PDF files. This means that any PDF reader that can handle PDF files properly can open PDF/raster files as well. For unencrypted PDF/raster files, this means that any PDF 1.7 (ISO 32000-1) compatible viewer can view these. Encrypted PDF/raster files, which are based on PDF 2.0 (ISO 32000-2), can be opened and viewed by any viewer that can handle PDF 2.0 files.
Creating PDF/raster files requires explicit support from a PDF creator or processor. There are two reasons for this. First, because PDF/raster files are denoted by a specific PDF language comment in the PDF/raster file. General PDF processors are not required to write any specific PDF comments or to preserve these at all (other than the beginning of file and end of file comments). A PDF/raster file is identified by a comment placed just before the startxref comment indicating conformance. Therefore, to write a PDF/raster file, a processor must understand the importance of writing the comment in its specific location. Second, because PDF/raster allows only very specific page content sequences and commands, PDF creators need to understand and write very specific syntax to create PDF/raster files.
There are many important PDF concepts and constructs that are not permitted in PDF/raster files. Some of the most important restrictions include:
- Page contents may only include images in CCITT G4 (FAX), DCT (JPEG), and uncompressed (RAW) formats. These images are restricted to RGB color, grayscale, and black and white (monochrome) format.
- Page contents may not include anything other than raster images: no text, no line art, no forms, or other graphical elements.
- No annotations, AcroForms, or XFA are permitted.
- Transparency and layers are not permitted.
- No compression of non-image data is permitted. Compressed object streams are disallowed.
- Only page content streams and document metadata are allowed. Other elements such as interactive actions, bookmarks, search indexes, and marked content are explicitly not permitted.
In other words, PDF/raster is intentionally a very limited subset of PDF. PDF/raster focuses on storing and transmitting scanned page data. PDF/raster is not intended to support updating or annotating scanned page data. But, since PDF/raster files are always valid PDF files, it’s very easy to annotate or update a PDF/raster file and save it as a general purpose PDF file.