Why collisions are dangerous

The md5 hashing algorithm has been broken since 2004. Today, anyone can generate md5 collisions in a matter of seconds using the hashclash tool, written by Marc Stevens. Note, however, that this tool will not let you generate any reasonable text. It also won’t let you to find a text with a prechosen hash (e.g. a preimage attack). Instead, you will just get 2 random colliding binary blobs.

Here is one example why having 2 colliding texts could be dangerous:

good.py

#!/usr/bin/python

# -*- coding: utf-8 -*-

blob = """
*binary blob #1 generated by hashclash, see the file by link*
"""
from hashlib import sha256
if sha256(blob).hexdigest() == '*sha256 hash of blob #1*':
	print "I come in peace."
elif sha256(blob).hexdigest() == '*sha256 hash of blob #2*':
	print "Prepare to be destroyed!",
else:
	print "Fail"

evil.py

#!/usr/bin/python

# -*- coding: utf-8 -*-

blob = """
*binary blob #2 generated by hashclash, see the file by link*
"""
from hashlib import sha256
if sha256(blob).hexdigest() == '*sha256 hash of blob #1*':
	print "I come in peace."
elif sha256(blob).hexdigest() == '*sha256 hash of blob #2*':
	print "Prepare to be destroyed!"
else:
	print "Fail"

Both of these files have the same md5sum, but different behavior:

$ python good.py
I come in peace.
$ python evil.py 
Prepare to be destroyed!
$ openssl dgst -md5 good.py evil.py 
MD5(good.py)= f03936f224a469b596902303c4258712
MD5(evil.py)= f03936f224a469b596902303c4258712

Ultimately, this could be used against systems, that utilize md5 hash to verify authenticity of files.

Multiple collisions

So, hashclash can generate a pair of collisions. But what if one wants to have N colliding files? Surprisingly, quick search in Google didn’t give me general-purpose script, so I decided to write one. Thankfully, hashclash allows you to pick the prefix before generating a colliding pair, so what you have to do is:

  1. Generate a pair of colliding files, say file0 and file1.
  2. Use one of them, say file0, as prefix to generate new pair: file00 and file01.
  3. Cut the last 128 bytes (i.e. the just generated colliding blobs) of file00 and file01 and append them to file1 to generate file10 and file11. At this point you have 4 colliding files: file00, file01, file10 and file11.
  4. Use file00 as prefix to generate file000 file001.
  5. Cut the last 128 bytes of file000 and file001 and append them to file01, file10 and file11 to generate file010, file011, file100, file101, file110 and file111. At this point you have 8 colliding files.
  6. etc.

Generation of meaningful exploit

Let’s consider a meaningful exploit to be any set of multiple files with same md5 hash, but different behavior. The easiest way to create such an exploit would be to generate the file with following structure:

colliding_file_$i

blob = 'binary blob $i'
if blob == blob1:
	behavior1()
elif blob == blob2:
	behavior2()
...
elif blob == blob_i:
	behavior_i()
...
else:
	some_other_behavior()

Obviously, one can think of better ways to do it this Python (e.g. one could check for a particular sequence of bytes in blob), but this approach is highly generalizable: I tested it in both python (see source code) and in the ~~ugly turing-complete piece of~~ printing language named postscript.

All you have to do for either language is to specify the following:

  • Prefix
    • part before the unique blob(assignment)
    • part after the unique blob
  • if clause:
    • part before the blob1
    • part after the blob1
  • elseif clause:
    • part before the blob$i
    • part after the blob$i
  • else clause:
  • ending
  • function to generate behaviour$i
  • (optional) function to generate final else behaviour

Basically, you have to specify some basic syntax and script will do the rest. Here’s the example config for postscript, that generates 2^N postscript files, each of them prints ‘Behavior $i’:

# Setup for postscript

PREFIX_PRE="%!PS-Adobe-1.0
%%Creator: hax0r
/blob ("
PREFIX_POST=") def\n"

IF_CLAUSE_PRE="blob ("
IF_CLAUSE_POST=") eq\n{"

ELSEIF_CLAUSE_PRE="} if\nblob ("
ELSEIF_CLAUSE_POST=") eq\n{"

ELSE_CLAUSE="} if{\n"

ENDING="}\nshowpage"

function gen_payload_by_id() {
	local COLLISION_ID=$1
	((COLLISION_ID--))
	printf "/Times-Roman findfont 64 scalefont setfont \n
	30 200 moveto\n(Behavior %s) show\n" $COLLISION_ID
}

Source code

The source code lives in this GitHub repo. The repo has python config and generated examples.