Closing known routes for leakage in redacted documents
There is a well-defined mechanism for doing Redactions in PDF files; first, creating “redaction annotations” to mark up the items to be redacted, and then second, applying those annotations all at once.
While simple in theory, in practice this process has been found to leak information. In particular, in a number of recent, high profile cases PDF-based encryption has been found to have leaked data.
Recently, an Australian Government-sponsored research project found significant information leakage even using supposedly secure solutions. Adobe has attempted to address this with a technical note suggesting a range of additional (involved) operations that should be performed to close the currently known routes for leakage, but few other solutions yet implement all these suggestions.
Even with these mechanisms fully implemented, it is by no means clear that there won’t be additional problems found in future.
Because of the history of PDF’s failures in this field, the standard way of redacting documents appears to be to print out the document, apply black marker, and to rescan the documents back to PDF. An optional OCR step can be used to ensure that the (unredacted) text is still searchable. This has obvious quality and cost implications.
We propose an equivalent process (without the printing or scanning), dubbed High-Security Redactions. In this method, first standard PDF redactions are applied and the pages are converted to images. Second, the images are OCRd and repackaged into PDF pages with the OCRd text invisibly drawn to provide searchability, selection/copy, and text annotation (e.g. highlighting) capability. In this way, we can guarantee that ‘what you see is what you get’ - i.e. all the information in the final document is either the page images themselves, or is derived from those images.