Facebook
Twitter
YOUTUBE
LINKEDIN
XING
Datalogics
Status: Partner Member
Country: US
Sector: All industries
Contact:
Joined at: Feb 08
Website: http://www.datalogics.com/

Linked User
Maryanne Pavlin
Matt Kuznicki
Nicki Bullock
Vel Genov
Emma Kaschke
Leonard Ho

Is the Information You Just Redacted Really Gone?

So, your organization is redacting sensitive information, like social security numbers, out of documents prior to making them available to the public. One of the common practices used to be adding a black box over the targeted content using Word. Then, the document was converted to a PDF file using Word’s built-in converter. The resulting PDF looks perfectly redacted – after all, the content is blacked out. Everything is going great until a couple of years down the line, someone in your organization realizes that anyone can just move the black box “redaction” to uncover the social security number underneath.

This happens more often than you would think. I was on a call with a company that had done just that. They had thousands of incorrectly redacted documents and were looking for an automated solution to perform real redactions on those documents.

I came to realize that a lot of the problems around bad redactions could possibly stem from the fact that it’s not clear what real redactions really are. So, what is redaction? In PDF, redaction is the act of removing content directly from the content stream of the page. An optional piece of content is usually added in place of the removed content to indicate something has changed. This is traditionally a black box, however, it does not have to be a box, and the color does not have to be black. The important and mandatory part of redaction is that the content is permanently removed from the document.

Redaction is typically a 2 step process, with an optional 3rd step

  1. The content to be redacted is identified and redaction annotations are placed over it
  2. The redaction annotations are reviewed and applied, permanently removing the content
  3. An additional step is to sanitize the document, cleaning up sneaky data like metadata, bookmarks, links, and anything that could have content in it that you do not want available

As you can see, if those steps are not followed properly, many things can go wrong, and you might end up distributing documents that still contain sensitive information. The most common example of incorrectly redacted documents is the one that I started the article with. We focus on the optional black box that goes over the content, and don’t realize the content is still readily available in the document. What if we decide to manually select and delete the content, and then manually add a black box over it? Aside from this being a laborious process, there are some major downsides to it. A lot of tools keep versions of a document without us ever realizing that. Those versions will contain previously deleted content. Metadata can also contain previously deleted content, or references to it. Properly redacting a document will take care of all of those issues.

Another common mistake while attempting to redact a document is to change the font color of sensitive information to simply match the background. The idea is that if you can’t see the text, it’s not there. This is perhaps the least secure of all the incorrect redaction methods available. Simply selecting all the text on a page will reveal all the “hidden” text. Furthermore, this text can be searched for, and changing the font color back to a visible one is easy.

You want to use a tool that is designed for proper redaction. But, what if we use a tool that claims to redact a document, but does a poor job? How do you know? Such tools are more common than you would think,  so let me give you some pointers on redaction:

  1. Make sure the document is sanitized after the redaction. As I mentioned before, the document’s metadata can contain sensitive information. There can be bookmarks and links. Previous versions of documents can contain information we thought is redacted, making it readily available. Search indexes and review comments are also good hiding spots for sensitive data. A good PDF redaction tool will clean up all those, and more, during sanitization.
  2. Use a tool that has a clean redaction. Some tools can redact just fine, but they are what I call too ‘loose’. Instead of just redacting the social security number for example, you also lose nearby content that could be above, below, left, and right of what you were targeting. See image below for an example of what I mean.
  3. Redaction is commonly used with text. However, redaction can apply to different types of content – diagrams for example. While you are trying to partially redact sensitive information out of a diagram, the tool you are using might not be able to do that and redact the whole image instead. Make sure the tool you are using handles image redaction properly.

The screenshot above is from a document redacted with a popular PDF tool. The tool not only redacted the desired information, but also text on one line above and below each redaction.

This is what this document should look like when it’s properly redacted.  See image below.

What tools and methods can we use to redact PDF documents properly? Adobe Acrobat is the industry leader when it comes to end user PDF tools. It has a very comprehensive set of redaction tools. It’s the preferred and recommended tool of many US government institutions. Plus, there are great resources available that explain the redaction process in detail. You can see instructions on how to redact a PDF document using Acrobat DC here. If you are using Acrobat X, check out Rick’s Acrobat X Redaction Guide.

While Acrobat is great for redacting single documents, what happens when you want to redact batches of documents? That’s where Datalogics comes in. We offer the tool that drives Acrobat, and its redaction process – the Adobe PDF Library. We also offer another Adobe tool that can help you redact documents – Datalogics PDF Java Toolkit. Both can offer batch redaction functionality as well as an automated, but on the fly, document by document process.

Redaction is a very important tool in the document market. If performed incorrectly, sensitive information can leak to the public, potentially leading to lawsuits, scandals, etc. To avoid that, you need to make sure your documents are redacted correctly, you have the right processes in place, and are using the right tools.

For more information about redaction, contact us.

Related Products
Adobe PDF Library


The Adobe PDF Library SDK is a low-level PDF library that contains a powerful set of native C/C++ APIs with interfaces for .NET and Java APIs. Systems integrators, independent software vendors (ISVs), enterprise IT developers, and others can integrate Adobe PDF functionality within custom applications in a client and / or server environment.

PDF Java Toolkit


Datalogics PDF Java Toolkit is a native Java library that provides high-level APIs for automating PDF workflows like processing PDF forms, verifying digital signatures, and extracting text. It also offers low-level APIs for working directly with the structure of the PDF for those times you need it.

Adobe Normalizer


Adobe Normalizer, is an API which allows developers to quickly and easily convert Encapsulated PostScript (EPS) and PostScript (PS) files to Adobe’s Portable Document Format (PDF). The industry-standard Adobe Distiller and Distiller Server are themselves built upon PDF Converter SDK; and now this API is available separately to application developers.

Adobe PDF Print Engine


The Adobe PDF Print Engine is a common rendering engine technology, packaged as a software development kit (SDK). It can be the basis for a variety of products for previewing and printing Adobe Portable Document Format (PDF) documents at different stages of the professional print workflow.

PDF2IMG


Datalogics PDF2IMG is a command-line utility that converts PDF files to a variety of image formats including PNG, JPG, TIFF, BMP, and more. It is built upon the Adobe PDF Library and uses Adobe technology for unrivaled color management during the PDF conversion process

PDF Alchemist


Datalogics PDF Alchemist is a new (C/C++) SDK for intelligently extracting text and images from PDFs and exporting to HTML 5 or EPUB. It employs sophisticated techniques to identify and reconstruct “text flows” within the PDF.