FIT 5124 Emerging Topics for Cybersecurity in Practice
Emerging Topics for Cybersecurity in Practice
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
FIT 5124 Emerging Topics for Cybersecurity in Practice
Assignment 1: Total marks 100
Due on 1st April 11:55:00 pm (Firm!)
1 Overview
The objective of this assignment is to assess the learning outcomes of searching over
encrypted data. Specifically, the tasks will evaluate your knowledge of security analysis of
searchable symmetric encryption (SSE), count attack against SSE, countermeasures for
SSE schemes, and inference attacks against encrypted databases built from
property-preserving encryption.
2 Submission Policy
You need to submit a report (one single PDF file) to describe what you have done and what
you have observed with screenshots whenever necessary; you also need to provide
explanations or codes to the observations that are related to the tasks. In your report, you
are expected to answer all the questions listed in this manual. Typeset your report into .pdf
format (make sure it can be opened with Adobe Reader) and name it as the format:
[Your Name]-[Student ID]-FIT5124-Assignment1,
HarryPotter-12345678-FIT5124-Assignment1.pdf.
All source code if required should be embedded in your report. Please upload the PDF file to
Moodle. Note that the assignment is due on 1 April, Saturday, 11:55 pm (No extension).
Late submission penalty: 10-point deduction per day. Your assignment will not be
marked after a delay of 7 days.
policies/academic-integrity
Generative AI tools are not restricted for this assessment task: In this
assessment, you can use generative artificial intelligence (AI) to assist you in any
way. Any use of generative AI must be appropriately acknowledged. How to cite the
use of generative AI can be found here:
Note that there will be one subtask that requires you to use ChatGPT to
complete (Task 3.2).
FIT 5124 Emerging Topics for Cybersecurity in Practice (S1 2023)
3 Security Analysis of Searchable Symmetric Encryption (30 Marks)
Perform the formal security analysis for the scheme in the paper of “Structured Encryption
and Controlled Disclosure” https://eprint.iacr.org/2011/010.pdf (see the screenshot below).
More detailed construction and notations can be found in the paper.
3.1 Security Definition (10 Marks)
Please give the simulation-based security definition for the above scheme regarding
non-adaptive security.
3.2 Security Analysis (20 Marks)
Please perform the formal security analysis based on the security definition and the
construction. Hint: leakage function definition (5 marks), steps on how to simulate the
security game (5 marks), explanation on how the simulated view is indistinguishable
from the real view (5 marks).
Interaction with ChatGPT (5 marks): please kindly note that you will interact with
ChatGPT to answer this question. Please list your prompts and the responses from
ChatGPT, perform correctness analysis on the ChatGPT response, and show how
you fix the errors or inaccurate statements if any.
FIT 5124 Emerging Topics for Cybersecurity in Practice (S1 2023)
4 Count Attacks against Searchable Symmetric Encryption (40 marks)
4.1 Case study I: given a document set {d1, d2, d3, d4, d5, d6}, and d1 = {w1, w2,
w3, w4, w6}, d2 = {w1, w2, w3, w6}, d3 = {w1, w2, w4, w6}, d4 = {w1, w4, w6}, d5 =
{w4, w5, w6}, d6 = {w4, w6}, and assume that the adversary knows the above
information about the document set. Then the adversary starts to launch count
attacks for the document set encrypted by an SSE scheme. He observed the
following query response from the server.
q eids
q1 e1 e2 e3 e5 e6
q2 e3 e4
q3 e1 e2 e3 e4
q4 e2 e3 e4
q5 e1 e2 e3 e4 e5 e6
q6 e6
Note that e1,..., e6 are the permuted document IDs and the adversary does not know
the mapping between them and the original ones.
Q: Please explain how to recover the underlying keyword of each query step by step,
and write down the mapping between queries and keywords (10 marks).
4.2 Case study II: given a document set {d1, d2, d3, d4, d5, d6}, and d1 = {w1, w2,
w3, w4, w6}, d2 = {w2, w3, w6}, d3 = {w1, w3, w4, w6}, d4 = {w1, w4, w6}, d5 = {w4,
w5, w6}, d6 = {w5}, and assume that the adversary knows the above information
about the document set. Then the adversary starts to launch count attacks for the
document set encrypted by an SSE scheme. He observed the following query
response from the server.
FIT 5124 Emerging Topics for Cybersecurity in Practice (S1 2023)
q eids
q1 e2 e3 e4 e5 e6
q2 e2 e3 e5 e6
q3 e1 e3
q4 e4 e5
q5 e2 e5 e6
q6 e4 e5 e6
Note that e1,..., e6 are the permuted document IDs and the adversary does not know
the mapping between them and the original ones.
Q: Please explain how to recover the underlying keyword of each query step by step,
and write down the mapping between queries and keywords. Can the count attack
recover all queries? If not, please explain the reason. (15 marks)
4.3 Case study III: given a document set {d1, d2, d3, d4, d5, d6}, and d1 = {w1, w2,
w3, w6}, d2 = {w2, w3, w4, w6}, d3 = {w1, w2, w3, w4, w6}, d4 = {w1, w3, w4, w6},
d5 = {w5, w6}, d6 = {w5, w6}, and assume that the adversary knows the above
information about the document set. Then the adversary starts to launch count
attacks for the document set encrypted by an SSE scheme. He observed the
following query response from the server.
q eids
q1 e1 e5 e6
q2 e1 e2 e6
FIT 5124 Emerging Topics for Cybersecurity in Practice (S1 2023)
q3 e3 e4
q4 e1 e2 e3 e4 e5 e6
q5 e2 e5 e6
q6 e1 e2 e5 e6
Q: Please explain how to recover the underlying keyword of each query step by step,
and write down the mapping between queries and keywords. Can the count attack
recover all queries? If not, please explain the reason. (15 marks)
5 Padding Countermeasures in Searchable Symmetric Encryption (20 Marks)
The primary countermeasure for the above count attack is padding. In this task, you
are given the attached source code and data:
● Inverted_index_5000: This binary file contains the inverted index of the top
5000 most frequent keywords in the real Enron email dataset
● frequency_5000.csv: This .csv file details the (keyword, frequency) in the
above index. This .csv file is already sorted in descending order.
● app.py and utilities.py. The current app.py script reads the inverted index into
a Python dictionary data structure, namely inverted_dict. Then, using
inverted_dict[keyword] will give you the list of document identifiers mapping to
that keyword. You can review these two scripts for more details.
The first padding approach (Approach #1) is to add more padding (w,id’) pairs for
each keyword w such that all the keywords have the same number of real and
padding pairs, where id’ is a padding document identifier. In this way, all keywords
will have the result length which is the same as the result length of the most frequent
keyword when the server executes the search operation in the SSE scheme.
The second padding solution (Approach #2) is to add more padding (w,id’) pairs to w
such that the query’s result length of w is a multiplication of an integer. For example,
the current frequency of w is 6, then it needs 4 padding (w,id’) pairs if the
multiplication of 5 is used.
The above two approaches can be found in [4].
FIT 5124 Emerging Topics for Cybersecurity in Practice (S1 2023)
The third padding solution (Approach #3) is to only pad keywords that have similar
frequencies into the same result length. As a result, this approach will minimise the
padding overhead. However, this approach first requires defining the clusters of
keywords in advance. Then, we can identify the most frequent keyword in each
cluster and pad the remaining keywords in the same cluster to that length. More
details about this approach can be found in [5].
Q1: Please develop the padding solution in Approach #1 and identify the padding
overhead. Note that the padding overhead is the ratio between the total number of
real and padding pairs in the inverted index over the real number of pairs. Add the
screenshot of your python code and the result into your report (5 marks)
Q2: Please develop the padding solution in Approach #2 and identify the padding
overhead if the multiplication of 100 is used. Add the screenshot of your python code
and the result into your report (5 marks)
Q3: Please develop the padding solution in Approach #3 and identify the padding
overhead. Note that you are given the cluster configuration in the app.py. Add the
screenshot of your python code and the result into your report (5 marks)
For example,
cluster_points_256 =
[392,648,904,1160,1416,1672,1928,2184,2440,2696,2952,3208,3464,3720
,3976,4232,4488,4744]
denotes that the 1st keyword to the 392nd keyword (inclusive) in
frequency_5000.csv will be in the first cluster, and the 4745th keyword to the
5000th keyword (inclusive) will be in the last cluster. There are 19 clusters in
this setting. With this configuration, there are at least 256 keywords in each
cluster (i.e., cluster size).
Another configuration is that,
cluster_points_512 = [904,1416,1928,2440,2952,3464,3976,4488]
will have at least 512 keywords in the same cluster.
Q4: Compare the padding overhead between the above Approach #1, Approach #2,
and Approach #3 with cluster sizes 256 and 512. Add the screenshot of your python
code and the result into your report (5 marks)
6 Inference Attacks against Order-preserving Encryption (10 Marks)
FIT 5124 Emerging Topics for Cybersecurity in Practice (S1 2023)
As introduced in Week 3, sorting attack is applicable to uncover dense
OPE-encrypted column ( ). In that setting, a ciphertext contains at least aδ = 1
fraction of plaintext(s) to perform a direct mapping on the same sorting order.δ = 1
However, the attack becomes less applicable for low-density OPE-encrypted
columns if the fraction is less than 1. Intuitively, the plaintext space (i.e., auxiliary
dataset) known by the attacker who is launching the inference attack might be
smaller than the underlying plaintext space of the ciphertext space.
To address the limitation of the sorting attack when dealing with low-density
OPE-encrypted columns, Naveed et. al. [1] introduced the cumulative attack. In
short, the attack combines the ordering with the frequencies to increase the ability to
match a plaintext to a ciphertext. For example, if a ciphertext matches 90% of its
ciphertext in the encrypted column, then the mapping should match that ciphertext to
a plaintext that also matches 90% of its plaintext. To do this, the attacker uses an
empirical cumulative distribution function [2] (ECDF, or simply called CDF) to
evaluate the probability of a real-valued ciphertext. Formally, the attack is defined as:
Cumulative-Atk(c, z): Given an OPE-encrypted column c over the ciphertext space
C and the auxiliary plaintext dataset z over the plaintext space M, the adversary
computes:
1. compute andψ←() φ←()
2. compute andπ←() µ←()
3. output
∈
=1
||
∑ (|ψ
−
∙π| + |φ
−
∙µ| )
where and are the histograms of c and z, respectively; and are the CDFs of cψ π φ µ
and z, respectively; is the set of permutation matrices. || × ||
The step 3 can be formulated as a linear sum assignment problem (LSAP) as below:
minimise
=1
∑
=1
∑
subject to //sum of all weights on a column =1
=1
∑
= 1, 1≤ ≤||
//sum of all weights on a row =1
=1
∑
= 1, 1≤ ≤ ||
,
∈{0, 1} 1≤ , ≤ |
|
FIT 5124 Emerging Topics for Cybersecurity in Practice (S1 2023)
where the cost matrix gives the cost for mapping a plaintext to a ciphertext as
the sum of the mismatch in the frequencies plus the mismatch in cumulative
frequencies:
= |ψ
− π
|2 + |φ
− µ
|2
In the attached material, you are given an OPE-encrypted column in
ope_enc_covid_age.csv, and the auxiliary plaintext dataset in auxiliary.csv. The
dataset is extracted from the Mexican government’s COVID19 dataset [3]. The
cumulative_attack.py contains the instructions on how to perform Steps 1-3.
Q1. Complete Step 1 by calculating the histograms of the cipher c and the plaintext
z. Please add your script screenshot to your report (2.5 marks)
Q2. Complete Step 2 by calculating the empirical CDFs of c and z. Please add your
script screenshot to your report. (2.5 marks)
Q3. In Step 3, compute the cost matrix C and apply the LSAP resolver to identify the
optimal weight matrix X. Please add your script screenshot to your report (2.5
marks)