Why the need to redact implies using PDF

How do you redact content in HTML? The format doesn’t really include that concept. In both principle and practice, committing to a rendering is definitive; final, portable and generic; it’s what the use case demands. Redaction demands PDF.

About the author: As CEO of the PDF Association and as an ISO Project Leader, Duff coordinates industry activities, represents industry stakeholders in a variety of settings and promotes the advancement and adoption of … Read more

Duff Johnson
May 30, 2019

Article

Historically, the process of redaction - removing content from documents for national security, intellectual property, privacy or other reasons – has involved physical implements such as razors, grease-paint, magic-markers and photocopiers.

Redacted document with the word "document" highlighted. Then came PDF.

Once people were able to share documents without using paper the physical tools became obsolete. The need for redaction as a feature, however, exploded.

Redaction differs from editing, which is key to understanding why redacting PDF differs from simply deleting text. To be useful, redaction must leave un-redacted content fully intact, but it must also account for the fact that a redaction has occurred, typically by introducing a colored box in place of the redacted content.

The proliferation of websites emails, posts, documents and other content bearing (potentially) personally identifiable information (PII) and other sensitive material, from court records to medical information to trade secrets, Freedom of Information Act (FOIA) requests and government investigations, makes redaction a critical capability in managing digital content.

The Mueller Report

A Technical and Cultural Assessment of the Mueller Report PDF

Even with OCR, the Mueller Report PDF isn't fully searchable

DoJ reposts the Mueller Report

The Mueller Report

A Technical and Cultural Assessment of the Mueller Report PDF

Even with OCR, the Mueller Report PDF isn't fully searchable

DoJ reposts the Mueller Report!

Further, in an era where document distribution no longer requires physical paper; when it's possible to transfer millions of pages in seconds, and search them in even less time... it's hard not to notice that end users' ideas about redaction remain stapled firmly to the age of paper.

We see this in the prototypical error users make when redacting digital documents – the use of a highlighter tool, which only changes background color, and doesn't actually result in redaction. Some users; even legal professionals using desktop computers for 20 years or more, even those who have always used digital documents, sometimes fail to understand the difference between highlighting and redacting.

When they make this mistake, these users are confusing the color of the highlight for the functionality they'd expect from paper. It's an ironic testament to the success of PDF in representing paper documents in a digital context.

It's time to stop thinking of documents as paper, and start thinking about them as PDF files!

Of course, PDF is not the answer to all documents, or even most. It is, however, the only good answer in one extraordinarily horizontal use case: documents that require redaction. It's a big category, since almost any type of document (business records, correspondence, invoice, etc.) in any format (web page, email, word-processor, spreadsheet, scan, etc.) may require redaction.

If your workflow might include redaction, then your format should be PDF. With tagged PDF, it's possible to have your source file information and semantics, and a rendering too.

The future of redaction

There's no (good) solution in HTML for the idea of removing specific content while leaving an indication of removal and somehow leaving other content undisturbed. One can imagine all-new XML constructs to handle the requirements of redaction workflows in HTML contexts, but it's not easy. Committing to a rendering is definitive; final, portable... and generic. PDF happens to have a standardized model, and it's already ubiquitous.

Just a little extra support - most notably in tagged PDF - and the web could begin to integrate rendered content far more cleverly than is possible today.

PDF provides a highly capable rendering model that already... took over the world
PDF provides a proven platform and solution for redaction irrespective of layout, format, CSS, implementation etc.
Tagged PDF provides a solution for semantic reuse of PDF content, including for accessibility purposes
PDF 2.0 provides a specific, targeted solution for rich reuse of redacted documents
PDF 2.0 even includes namespaces (ISO 32000-2, 14.8.6), a mechanism for capturing arbitrary source content semantics in the PDF context

In a world where privacy needs demand solutions from content technology providers everywhere, PDF offers a universal, and universally-accepted solution. Since PDF content is reusable, it's possible to use PDF for critical document features (e.g., redaction) without loss of unredacted data.

Technical Annex: Redaction features in the PDF specification

Alone among file formats, PDF explicitly includes redaction features, including new features added in the latest iteration of the PDF specification, PDF 2.0. Let's take a quick tour.

Redaction annotations (PDF 1.7)

Redaction annotations (introduced in PDF 1.7 back in 2006) are designed to allow users to mark content for redaction using familiar box-drawing or text-selection tools, manual or programmatic. Users may think of redaction annotations as "draft" or "candidate" redactions.

The redaction annotation feature is completely described and fully standardized, any vendor can support the redaction annotations model in PDF. Redaction annotations added by one party, with one type of software, may be changed by another party, with different software.

Interoperability is a key feature of PDF, and this principle extends throughout the specification.

Redaction annotations are used by the redacting software to effect the desired redactions and remove the specified content. Of course, the annotations are themselves removed from the document during the final redaction process. In addition, software intended to perform redaction typically offers a "sanitization" process to eliminate metadata and other potentially sensitive information from the file.

Semantic enhancements for redaction (PDF 2.0)

ISO 32000-2 (PDF 2.0) introduces two enhancements to the basic redaction model. Superficially modest, these features point the way towards flexible reuse of redacted content, including for accessibility purposes.

Although the redaction annotations defined in PDF 1.7 allowed for exemption codes (such as "Grand Jury") to be included in the redacted version of the document, that specification did not provide for indicating any semantic information about such artifacts. In PDF 2.0, the new Artifact structure element type solves this problem by allowing redactions to be indicated in the logical content.

New Artifact structure element type

Example of line numbers. The Artifact structure element type was introduced in PDF 2.0 to "...accommodate artifact content in cases that have positional context relative to real content within the structure tree." (ISO 32000-2:2017, 14.8.2.2.2).

What is "positional context"? Think of line numbers in a contract. As a user, you'd like to be able to read the contract without being forced to also read the line numbers. But on the other hand, you need to know what line you are reading. You might also want to reflow the text and retain information about the line numbers.

In PDF, content semantics are established by means of "tagged PDF" which establishes a logical ordering of the semantic content (headings, paragraphs, lists, etc.) in a structure tree, making it available for reuse.

If we assume logical reading order in the structure tree, then using Artifact structure elements to enclose the line numbers associated with each line of text makes it possible to represent this use case in a reusable fashion.

New subtypes for use with the Artifact structure element type

In PDF 2.0, achieving these objectives in PDF syntax means simply tagging the line numbers with an Artifact structure element with the appropriate subtype. The rest is up to supporting implementations, which may then choose to represent the line numbers to the end user, or not, as may be desired.

Line numbers were the first and most obvious case considered for an "artifact with logical position with respect to real content", and are addressed via Artifact structure elements with subtype LineNum (see ISO 320000-2, Table 363).

Screenshot indicating a redaction with Artifact tag and subtype of "Redaction". The second use case was redaction.

On the same basis as line numbers, PDF 2.0 Artifact structure elements with subtype Redaction make it possible to semantically locate a redaction (or more properly, the content identifying a redaction) in the logical order of the PDF file.

Well-tagged PDF 2.0 files make it possible for any (in principle) PDF content to be located, marked for redaction, and then redacted, with the redaction preserved for any downstream reuse.

Much as a razor shows redactions via a hole in the paper, PDF redaction annotations and structure elements provide a significant improvement in the end product.

Featured articles

Discover pdfa.org

Key resources

Get involved

Why the need to redact implies using PDF

The Mueller Report

The Mueller Report

The future of redaction

Technical Annex: Redaction features in the PDF specification

Redaction annotations (PDF 1.7)

Semantic enhancements for redaction (PDF 2.0)

New Artifact structure element type

New subtypes for use with the Artifact structure element type