Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
Computer Assignment 1: MD5 Collision Attack
ECE 544/644: Trustworthy Computing
1 SUBMISSION GUIDELINE
You can do this assignment in groups of two. Only one of the team members needs to submit
the work.
For this assignment you need to submit a report, briefly describing the steps you take.
There are also some questions throughout the assignment. You need to include your answers
to those questions in your report. For some of the steps, you need to also show the outcome
of your work using screenshots. Please make sure those screenshots are also included in the
report.
Some of the tasks require developing C programs. For those tasks, include a screenshot
of your programs in the report and also submit the code on Moodle. Please follow the naming
guideline when submitting your code.
Once you have all the materials ready, submit your report along with your C files on
Moodle as individual files. Do not put them in an archive.
1
2 LAB OBJECTIVE
The objective of this lab is for you to investigate the damages that can be caused if a widely-
used one-way hash function’s collision-resistance property is broken. To achieve this goal,
you will be guided through different tasks to launch actual collision attacks against the MD5
hash function.
3 TASKS
To do this lab, you need to set up a 64-bit Linux environment. Alternatively, you can download
the Virtual Machine with all the tools already installed on it from here. If you choose to use
the VM, the require files are in /home/tiny/projects/project1.
TASK1: GENERATING TWO DIFFERENT FILES WITH THE SAME MD5 HASH
In this task, we will use the md5collgen tool to generate two different files with the same
MD5 hash values. The way the tool works is that it receives a prefix file from the user. It
then generates two different sequences of bytes, m1 and m2. These two-byte sequences are
generated with a very interesting property. If you create a file by concatenating the prefix and
m1 and another file, by concatenating the prefix and m2, these two files would have the same
MD5 hash values. However, although the first parts of the two files are the same, the last parts
are clearly different, and therefore, these two files are not identical. Figure 1 shows the way
md5collgen operates.
Figure 1: md5collgen generating two different files with the same MD5 hash.
The following command generates two output files, out1.bin, and out2.bin, with a given
prefix file, prefix.txt:
md5collgen -p prefix.txt -o out1.bin out2.bin
2
Since these files are binary files (sequences of bytes), you cannot open them using text
editors. However, you can use hex editor tools to look at what is inside them. xxd is a hex
editor that you can use and it is available on the VM.
xxd out1.bin
Create a prefix file (named prefix.txt) including the full names of both of the team mem-
bers. Use the md5collgen tool to generate out1.bin and out2.bin. Go over the following steps
and include the required materials in your report:
1. Using the md5sum tool, check the MD5 hash of the binary files. Submit a screenshot
showing the hash values of the two files. You can calculate the MD5 hash value of a file
using the following command:
md5sum out1.bin
2. Use the xxd tool to look inside one of the binary files. Submit a screenshot of the output
of the tool. Annotate the screenshot to show which part is your prefix and which part is
m1/m2.
3. Create another prefix file with exactly 64 bytes in it and regenerate out1.bin and out2.bin.
Re-run the previous two steps. What differences do you see?
TASK2: UNDERSTANDING MD5’S PROPERTIES
Figure 2 shows the iterative structure of MD5. In this algorithm, the input is divided into
blocks of 64 bytes, and then it is fed into a compression function. The compression func-
tion takes two inputs, a 64-byte data block and the outcome of the previous iteration. The
compression function then produces a 128-bit IHV (Intermediate Hash Value). This output
is then fed into the next iteration. The final hash value will be the IHV produced by the last
iteration of the algorithm. The IHV input for the first iteration (IHV0) is a fixed value.
Figure 2: MD5 block diagram.
Given two inputs M and N, if MD5(M) = MD5(N), meaning that the MD5 hashes of M
and N are the same, then for any arbitrary bit sequence, T, MD5(M || T) = MD5(N || T), where ||
3
represents concatenation. In other words, if inputs M and N have the same hash, adding the
same suffix T to them will result in two outputs that have the same hash value. This property
holds for the MD5 hash algorithm and many other hash algorithms, such as the SHA family.
In this part, you need to develop a C code to investigate that property. In this C code,
you should utilize the out1.bin and out2.bin files that you generated in the previous part. You
can open these two files and read them into local buffers. Then, generate a random sequence
of bytes and extend those buffers by appending this random sequence to them. Finally, using
the md5sum tool, check the MD5 hash of the appended buffers and make sure they have the
same hash values.
Also, to investigate whether this property holds for any suffix, you must repeat this pro-
cedure for a large number of different random sequences. You can assume that trying 10000
different sequences is enough. In your code, you need to generate random bytes of variable
lengths, varying from 100 bytes up to 6436 bytes, with increments of 64 bytes (total of 100
different sequence lengths). For each length, you also need to try 100 different random se-
quences.
For this part, take a screenshot of your C code and put it in your report. Also, name the
C code task2.c and include it in your final submission.
TASK3: GENERATING TWO EXECUTABLE FILES WITH THE SAME MD5 HASH
Consider the code shown in Figure 3. You can download the code directly into your VM using
the following command:
wget http://people.umass.edu/ apouraghily/ece644/projects/project1/simpleCode.c
Figure 3: C program for Task3.
In this task, you have to modify the content of the xyz buffer in a way that the resulting
program will have the same MD5 hash value as the original code. You can work either on the
source code of the program directly or you can compile it and modify the binary using the
xxd tool. If you choose to work on the binary, you need to use the xxd code to get the hex
4
representation of the binary into an intermediary file using the following command:
xxd {binary_file_name} > {intermediary_file_name}
You can then open that intermediary file using any text editor such as gedit or vim and
modify the hex representation. Finally you can use xxd to transform your hex representation
back to binary by using the xxd with the -r switch as follows:
xxd -r {intermediary_file_name} > {reversed_binary_file_name}
Don’t forget to add the execution permission to the reversed file before running it (chmod
+x). You can find some hints in Figure 4. Name the modified code, task3.c and include it in
your final submission.
Figure 4: Hint for Task3.
TASK 4 : MAKING TWO PROGRAMS BEHAVE DIFFERENTLY
Assume that you have developed a useful software. You can submit this software to a certi-
fication authority for certification. They analyze the program thoroughly and since it does
not have any vulnerabilities and it does do anything malicious, they will certify this program
by issuing a certificate. The certificate includes the MD5 hash value of the program binary,
signed using the CA’s private key. By publishing the certificate, anyone who downloads the
application binary, can also download the certificate and validate the integrity of the program
by comparing the MD5 hash of the downloaded binary against the MD5 hash of the original
code included in the signed certificate.
Now, if someone can come up with a malicious code that has the same MD5 hash value
as your code, they can use your good name and publish their malicious code under your
program’s name. Now, if anyone downloads this malicious code under the assumption that
you developed it, there is no way for them to realize otherwise. They can verify that program
by checking your program’s certificate, and they will be fooled into running it as if it were your
program certified by the CA.
5
Figure 5: C code template for Task4.
In this task, we will investigate how a bad actor can achieve that if the hash function
used in the certificate is MD5. In the previous task, we proved that it is possible to have
two programs with different binaries but with similar hash values. However, the difference
between those two programs was very subtle. In this task, we want to build on that experience
and create two different programs where one of them does a benign task while the other one
does something malicious.
For this task, you can either use the template shown in Figure 5 or you can design your
own template. If you choose to design your own template, you need to submit both versions.
Name them Task4-1.c and Task4-2.c. If you decide to modify the provided template, you only
need to submit the modified version that does something malicious. Name that Task4.c and
submit it.