MD5 To Be Considered Harmful Someday
I’ve been doing some analysis on MD5 collision announced by Wang et al. Short version: Yes, Virginia, there is no such thing as a safe hash collision — at least in a function that’s specified to be cryptographically secure. The full details may be acquired at the following link:
A tool, Stripwire, has been assembled to demonstrate some of the attacks described in the paper. It may be acquired at the following address:
Incidentally, the expectations management is by no means accidental — the paper’s titled “MD5 To Be Considered Harmful Someday” for a reason. Some people have said there’s no applied implications to Joux and Wang’s research. They’re wrong; arbitrary payloads can be successfully integrated into a hash collision. But the attacks are not wildly practical, and in most cases exposure remains thankfully limited, for now. But the risks are real enough that responsible engineers should take note: This is not merely an academic threat, systems designed with MD5 now need to take far more care than they would if they were employing an unbroken hashing algorithm, and the problems are only going to get worse.
Some highlights from the paper:
- The attack itself is pretty limited — essentially, we can create “doppelganger” blocks (my term) anywhere inside a file that may be swapped out, one for another, without altering the final MD5 hash. This lets us create any number of binary-inequal files with the same md5sum.
- MD5 uses an appendable cascade construction — in other words, if you happen to find yourself with two files that MD5 to the same hash, an arbitrary payload can be applied to both files and they’ll still have the same hash. This leads to…
- Attacks are possible using only the proof of concept test vectors released by Wang — the actual attack is not necessary.
- Stripwire emits two binary packages. They both contain an arbitrary payload, but the payload is encrypted with AES. Only one of the packages (“Fire”) is decryptable and thus dangerous; the other (“Ice”) shields its data behind AES. Both files share the same MD5 hash.
- Digital Signature systems are vulnerable, as they almost always sign a hashed representation of data rather than the data itself.
- This is an excellent vector for malicious developers to get unsafe code past a group of auditors, perhaps to acquire a required third party signature. Alternatively, build tools themselves could be compromised to embed safe versions of dangerous payloads in each build. At some later point, the embedded payload could be safely “activated”, without the MD5 changing. This has implications for Tripwire, DRM, and several package management architectures.
- HMAC’s invulnerability has been slightly overstated. It’s definitely possible, given the key, to create two datasets with the same HMAC. Attacker possession of the key violates MAC presumptions, so the impact of this is particularly questionable.
- Very interesting possibilities open up once the full attack is made available — among other things, we can create self-decrypting executables (fire.exe and ice.exe) that exhibit differential behavior based on their internal colliding payloads. They’ll still have the same MD5 hash.
- Several doppelgangers may (relatively quickly, as per Joux) be computed within a single multicollision-friendly block. As such, the particular selection of doppelganger sets within a file can itself be made to represent data. It’s relatively straightforward to embed a 128 bit signature inside an arbitrary file, in such a way that no matter the value of the signature, a constant MD5 hash is maintained. This is curiously steganographic.
- Many popular P2P networks (and innumerable distributed content databases) use MD5 hashes as both a reliable search handle and a mechanism to ensure file integrity. This makes them blind to any signature embedded within MD5 collisions. We can use this blindness to track MP3 audio data as it propagates from a custom P2P node. “Strikeback” capacity against executable trafficking is even more pronounced — it’s possible to create application installers that self-modify with host identifying characteristics but still successfully retransmit on P2P networks under the global search hash.
I hope this paper proves useful to the security community at large, and I welcome feedback.