Why collisions are dangerous
The md5 hashing algorithm has been broken since 2004. Today, anyone can generate md5 collisions in a matter of seconds using the hashclash tool, written by Marc Stevens. Note, however, that this tool will not let you generate any reasonable text. It also won’t let you to find a text with a prechosen hash (e.g. a preimage attack). Instead, you will just get 2 random colliding binary blobs.
Here is one example why having 2 colliding texts could be dangerous:
Both of these files have the same md5sum, but different behavior:
Ultimately, this could be used against systems, that utilize md5 hash to verify authenticity of files.
So, hashclash can generate a pair of collisions. But what if one wants to have N colliding files? Surprisingly, quick search in Google didn’t give me general-purpose script, so I decided to write one. Thankfully, hashclash allows you to pick the prefix before generating a colliding pair, so what you have to do is:
- Generate a pair of colliding files, say file0 and file1.
- Use one of them, say file0, as prefix to generate new pair: file00 and file01.
- Cut the last 128 bytes (i.e. the just generated colliding blobs) of file00 and file01 and append them to file1 to generate file10 and file11. At this point you have 4 colliding files: file00, file01, file10 and file11.
- Use file00 as prefix to generate file000 file001.
- Cut the last 128 bytes of file000 and file001 and append them to file01, file10 and file11 to generate file010, file011, file100, file101, file110 and file111. At this point you have 8 colliding files.
Generation of meaningful exploit
Let’s consider a meaningful exploit to be any set of multiple files with same md5 hash, but different behavior. The easiest way to create such an exploit would be to generate the file with following structure:
Obviously, one can think of better ways to do it this Python (e.g. one could check for a particular sequence of bytes in blob), but this approach is highly generalizable: I tested it in both python (see source code) and in the ~~ugly turing-complete piece of~~ printing language named postscript.
All you have to do for either language is to specify the following:
- part before the unique blob(assignment)
- part after the unique blob
- if clause:
- part before the blob1
- part after the blob1
- elseif clause:
- part before the blob$i
- part after the blob$i
- else clause:
- function to generate behaviour$i
- (optional) function to generate final else behaviour
Basically, you have to specify some basic syntax and script will do the rest. Here’s the example config for postscript, that generates 2^N postscript files, each of them prints ‘Behavior $i’:
The source code lives in this GitHub repo. The repo has python config and generated examples.