For a few years now, the PDF industry has been working on a new mechanism to improve integrity protection in encrypted documents. ISO/TC 171/SC 2, the committee that manages the PDF standard, took ownership of that effort in the form of the ISO 32004 project.
A few months ago, that project moved into the DTS stage — the final step prior to publication! Now's a great time to go over which problems ISO/TS 32004 sets out to solve, and perhaps more importantly, which problems it doesn't solve. Since I've been leading the project for the past year, I figured that I was in a good position to provide that write-up.
The goal of this post is not to explain what ISO/TS 32004 says exactly, nor to tell you how you should implement it at the technical level. Rather, I want to help other PDF developers understand the context of the document so they can make better use of it.
A few years ago, a team of security researchers at the Ruhr University Bochum (RUB) described a number of attacks against PDF's encryption feature set.
Broadly, the attacks announced fall into two categories.
The former category is more properly classified as exploiting the behaviour of specific viewers, and that's not what this post is about. However, the latter set of attacks really point to issues with the specification itself. Here are some of the highlights.
First up, we have cryptographic malleability issues.
Moreover, there are also fundamental ways in which the design of the PDF standard's encryption features leaves documents wide open to manipulation, irrespective of any cryptographic issues.
So, the upshot is this: given a legitimate encrypted PDF document, a sufficiently clever attacker can make the document say whatever they want without knowing the key. If an unsuspecting user then opens the document in their favourite viewer, they'll get a password prompt (as they probably expected), When the correct password is entered, the viewer will dutifully display the attacker-controlled content, and our poor user is none the wiser.
This is a problem, because in the mind of a business user — and the same probably goes for most tech people! — something being "password-protected" is pretty much synonymous with "kept under lock and key". People expect password-protected data to be totally inaccessible and untouchable without knowledge of the password. This expectation is being subverted by PDF's approach to encrypting documents.
Clearly, something had to be done...
The mismatch between users' (by no means unreasonable) expectations and reality is due to the conflation of the following two security properties:
Historically, the authentication aspect was not a concern in the design of PDF's encryption features. That point of view is now very dated 1, but the benefit of hindsight doesn't really help us solve the issue at hand. So, what can we do?
In modern cryptographic practice, the authentication problem is typically addressed by augmenting the ciphertext with a Message Authentication Code (MAC). A MAC is a kind of keyed 2 digest, computed over the ciphertext. Tampering with the ciphertext invalidates the MAC, and to recompute the MAC, you need access to the key. The idea is that the receiving party independently computes the MAC over the given ciphertext prior to decryption, and rejects the message if the result doesn't match the value supplied by the sender.
It's crucial here that validating a MAC requires being able to recreate it. Both operations require access to the same key. There's no public/private key distinction. All participants in the process have exactly the same capabilities when it comes to producing and verifying MACs.
There are many different ways to construct MACs. Here are some examples:
Most authenticated encryption schemes in common use — including AES-GCM, AES-CCM, ChaChaPoly1305 and many others — are constructed by combining an encryption primitive with a MAC function 3.
So, having digested all that information, it seems that all we have to do is to apply a MAC to our encrypted data. Now, let's figure out how that's supposed to work in a PDF document.
The initial requirements of the PDF MAC project were more or less the following.
The MAC scheme standardised in ISO/TS 32004 integrates into PDF in much the same way as digital signatures.
The following figure illustrates what an ISO/TS 32004 MAC looks like in context. The colour coding indicates the parts of the covered byte range:
This ByteRange-based approach works well enough, but there's a snag: any given revision of a PDF file can only have a single "complete" ByteRange! Early drafts of ISO/TS 32004 solved this problem by decreeing that MACs could only be used in unsigned documents.
Since MACs and digital signatures serve very different purposes, that incompatibility didn't sit well with me. Especially since there's an easy, backwards-compatible solution! PDF signatures use CMS `SignedData`, which supports attributes, so we could simply let the MAC token "hitch a ride" on the signature. That way we only need a single ByteRange to make both the signature and the MAC work 5. The structure of the MAC token is otherwise pretty much the same as in the unsigned case.
While signatures in encrypted documents aren't a very common sight, we occasionally come across signed encrypted documents. By allowing MACs in any encrypted document, we can achieve (more or less) the same integrity guarantees for all such documents. This uniformity also benefits validation: a MAC checker with zero signature validation capabilities shouldn't have to make judgment calls about whether documents with a signature (and no MAC) are adequately protected. In addition, the fix was simple enough that sacrificing compatibility wasn't worth it. After some discussion, we decided to put it in the spec.
For an example where the separation of concerns between MACs and signatures is even more clear: a document timestamp signature ordinarily has no authenticating value. Timestamping servers don't care about what they sign, and it's possible to add a timestamp to an encrypted document without knowing the key. In other words, there's no accountability at all. Adding an ISO/TS 32004 MAC token as an (unsigned) attribute on the signature is a way to solve that problem.
In the PDF world, backwards compatibility is a big deal. When new functionality is considered for standardisation, one of the most important criteria involves evaluating how existing software would cope with the change.
This was no different for ISO/TS 32004: a document with a MAC still needs to be understood by software that doesn't know how MACs work. That, in itself, is a good thing.
The converse problem is more tricky, though: how can a MAC-aware processor tell the difference between a "legacy" document without a MAC, and a document from which the MAC has been (maliciously) stripped? 6 Paranoid implementations could perhaps enforce MACs rigorously, but that might not be feasible for everyone.
To address this concern, ISO/TS 32004 defines an extra permission bit to indicate whether a MAC is expected to be present. Since the permission bits are already protected by a "pseudo-MAC" of sorts 7 in PDF 2.0, there's a degree of tamper-resistance built in.
PDF's standard security handler distinguishes between "user passwords" and "owner passwords". It's somewhat common for people to apply encryption to a PDF document, but leave the user password empty, while still setting the owner password to something else. This is the digital equivalent of a "No Trespassing" sign on an unguarded fence. Sure, bona fide viewers will enforce permission bits if the owner password is not supplied, but nonetheless, anyone can compute the file encryption key if the user password is left empty.
In other words, from a purely technical perspective, anyone can decrypt and modify the document content. This is no different in a scenario where MACs are used: if the user password is empty, anyone can validate, but also regenerate the MAC. In other words, a MAC offers precisely zero protection in this situation.
Remember: a MAC that anyone can verify is a MAC that anyone can forge. MACs are based on shared secrets. They're most useful if the relationship between the parties involved in a workflow is symmetric (i.e. everyone has the same access level).
In fact, if I'd write some piece of software that required MACs for everything, I'd actively reject PDF files with empty user passwords.
A MAC should cover the entire document to which it is applied (other than the MAC container itself). As with digital signatures, coverage is indicated by the associated ByteRange. When receiving a document with a MAC, it's very important to check the ByteRange: if the covered region is too small, unauthorised changes could still lurk in the "unprotected" regions. Processors incrementally updating a document are also expected to update the MAC (including the coverage range).
This presents a problem: what to do when one receives a document with a "stale" MAC (i.e. a MAC that doesn't cover the full document anymore). That could be the result of a malicious edit, but also due to an authorised change by a tool that doesn't implement ISO/TS 32004.
Again, this one boils down to choosing compatibility vs. security, and there's no easy answer. Personally, I would write my implementation to reject such documents by default.
Cryptographically, speaking, adding MACs to all 8 encrypted PDF documents seems like the obvious thing to do. Ultimately, the security of the MAC process is protected by the same shared secret as the document encryption, so there's no need to involve external keying material (as would've been the case with signatures). In particular, the user doesn't necessarily need to do anything special to benefit from MACs.
That said, MACs do come with an I/O performance cost. As with signatures, the PDF data needs to be serialised to a place where the MAC can be inserted in place later.
Depending on the application architecture, that cost might be completely negligible, or prohibitive 9. I'd still recommend turning it on by default if you can — especially when updating files that already have MACs!
ISO/TS 32004 provides you with a MAC-based tool to protect your encrypted PDF files from malicious tampering. The MAC is bootstrapped from the same shared secret as the encryption uses.
Additionally, ISO/TS 32004 is fully backwards compatible and can be used in conjunction with all PDF 2.0 features, including digital signatures.
Things to keep in mind:
Oh, and if you're keen to implement ISO/TS 32004 yourself once it's out, please give Annex B a proper read for some extra info on what to look out for when validating MACs in PDF documents.
1: The technology and standards to include this kind of integrity protection have also been around for a long time. For example, the research around HMAC dates from the mid-90s, and the IETF first standardised it in '97. PDF 2.0, which included a major restructuring of the file encryption functionality in PDF, first saw the light of day two decades later. ⇐
2: The MAC key and the encryption key are usually derived from a common piece of secret data that is shared between the communicating parties. ⇐
4: To be pedantic: the MAC key in itself actually isn't derived from the password or the file encryption key in ISO/TS 32004. Rather, it's encrypted using a key derived from the file encryption key, following a common pattern used with CMS `AuthenticatedData` and `EncryptedData`. ⇐
5: In a signed revision of a PDF document, the MAC token is actually computed over the digest of the byte range together with a digest of the signature. As such, it protects both the signature and the document content. ⇐
7: The "pseudo-MAC" being a separate entry computed by encrypting the permission bits using AES-ECB (!). Perhaps ironically, this piece of known AES plaintext was a large part of why we needed ISO/TS 32004 in the first place... ⇐
9: There are ways around that, though. In unsigned revisions, ISO/TS 32004 forces the MAC container to reside in (a subdictionary of) the document trailer, which means that it's somewhere near the end of the file. That can be leveraged to generate MACs very efficiently, even on large documents that don't fit into memory. Retrofitting that onto a legacy codebase is of course easier said than done. ⇐
A version of this article was originally posted at https://mvalvekens.be/blog/2022/about-iso32004.html
Independent PDF expert and FOSS developer, co-representing the Belgian standards body NBN in ISO/TC 171/SC 2. I’m the current project leader for ISO/TS 32004. I was employed as a Research Engineer at iText in Ghent between 2020 and 2022. Before that, I spent a few years doing mathematical research at KU Leuven, where I obtained my PhD in mathematics in …