Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
COMP9334 Project
Computing clusters
Version 1.01
Updates to the project, including any corrections and clarifications, will be posted on the
course website. Make sure that you check the course website regularly for updates.
• Version 1.01 (27 March 2024). There is a mistake in the denominators of the two probability
density functions in Section 5.1.1. For g0(t), it should be t raised to the power of η0+1 where
the +1 was missing. A similar error appeared in g1(t), it should be t raised to the power of
η1+1.
• Version 1.00. Issued on 19 March 2024.
1 Introduction and learning objectives
You have learnt in Week 4A’s lecture that a high variability of inter-arrival times or service times
can cause a high response time. Measurements from real computer clusters have found that the
service times in these clusters have very high variability [1]. The reference paper [1] also has a
number of suggestions to deal with this issue. One suggestion is to separate the jobs according
to their service time requirements, and have one set of servers processing jobs with short service
times and another set of servers for jobs with long service times. This arrangement is the same
as supermarkets having express checkouts for customers buying not more than a certain number
of items and other checkouts that do not have a limit on the number of items. You had seen this
theory in action in Week 4A’s revision Problem 1. We also highly recommend you to read the
paper [1].
In this project, you will use simulation to study how to reduce the response time of a server
farm that uses different servers to process jobs with different service time requirements.
In this project, you will learn:
1. To use discrete event simulation to simulate a computer system
2. To use simulation to solve a design problem
3. To use statistically sound methods to analyse simulation outputs
We mentioned a number of times in the lectures that simulation is not simply about writing
simulation programs. While it is important to get your simulation code correct, it is also important
that you use statistically sound methods to analyse simulation outputs. There, roughly half of
the marks of this project is allocated to the simulation program, and the other half to statistical
analysis; see Section 7.2.
1
Server 0
Server n - 1
New jobs
submitted
by users
Dispatcher ••
•Queue 0 ↓
Queue 1 ↑
Jobs that have completed
their processing will
depart the system
permanently
Jobs that are killed are
sent back
to the dispatcher
Jobs killed by servers in
Group 0
Server n0
Server n0 - 1
•
•
•
Jobs that have completed
their processing will
depart the system
permanently
Group 0 →
Group 1 →
Figure 1: The multi-server system for this project.
2 Support provided and computing resources
If you have problems doing this project, you can post your question on the course forum. We
strongly encourage you to do this as asking questions and trying to answer them is a
great way to learn. Do not be afraid that your question may appear to be silly, the
other students may very well have the same question! Please note that if your forum post
shows part of your solution or code, you must mark that forum post private.
Another way to get help is to attend a consultation (see the Timetable section of the course
website for dates and times).
If you need computing resources to run your simulation program, you can do it on the VLAB
remote computing facility provided by the School.
3 Multi-server system configuration with job isolation
The configuration of the multi-server system that you will use in this project is shown in Figure
1. The system consists of a dispatcher and n servers where n ≥ 2. The n servers are parti-
tioned into 2 disjoint groups, called Groups 0 and 1, with at least one server in each group. The
number of servers in Groups 0 and 1 are, respectively, n0 and n1 where n0, n1 ≥ 1 and n0+n1 = n.
The servers in Group 0 are used to process short jobs which require a processing time of no
more than a time limit of Tlimit. The servers in Group 1 do not impose any limit on service time.
2
The dispatcher has two queues: Queue 0 and Queue 1. The jobs in Queue i (where i = 0, 1)
are destined for servers in Group i. Both queues have infinite queueing spaces.
When a user submits a job to this multi-server system, the user needs to indicate whether the
job is intended for the servers in Group 0 or Group 1. The following general processing steps are
common to all incoming jobs:
• If a job is intended for a server in Group i (where i = 0, 1) arrives at the dispatcher, the job
will be sent to a server in Group i if one is available, otherwise the job will join Queue i.
• When a job departs from a server in Group i, the server will check whether there is a job at
the head of Queue i. If yes, the job will be admitted to the available server for processing.
Recall that the servers in Group 0 have a service time limit. The intention is that the users
make an estimate of the service time requirement of their submitted jobs. If a user thinks that
their job should be able to complete within Tlimit, then they submit it to Group 0; otherwise, they
should send it to the Group 1.
Unfortunately, the service time estimated by the users is not always correct. It is possible that
a user sends a job which cannot be completed within the time limit to Group 0. We will now
explain how the multi-server system will process such a job. Since the user has indicated that the
job is destined for Group 0, the job will be processed according to the general processing steps
explained earlier. This means the job will receive processing by a server in Group 0. After this
job has been processed for a time of Tlimit, the server says that the service time limit is up and
will kill the job. The server will send the job to the dispatcher and tell it that this is a killed job.
The dispatcher will check whether a server in Group 1 is available. If yes, the job will be send to
an available server; otherwise, it will join Queue 1 to wait for a server to become available. When
a server in Group 1 is available to work on this job, it will process the job from the beginning,
i.e., all the previous processing in a Group 0 server is lost.
If a job has completed its processing at a Group 0 server, which means its service time is less
than or equal to Tlimit, then the job leaves the multi-server system permanently. Similarly, a job
completed its processing at a Group 1 server will leave the system permanently.
We make the following assumptions on the multi-server system in Figure 1. First, it takes
the dispatcher negligible time to classify a job and to send a job to an available server. Second,
it takes a negligible time for a server to send a killed job to the dispatcher. Third, it takes a
negligible time for a server to inform the dispatcher on its availability. As a consequence of these
assumptions, it means that: (1) If a job arriving at the dispatcher is to be sent to an available
server right away, then its arrival time at the dispatcher is the same as its arrival time at the
chosen server; (2) The departure time of a job from the dispatcher is the same as its arrival time
at the chosen server; and (3) The departure time of a killed job from a server is the same as its
arrival time at the dispatcher. Ultimately, these assumptions imply that the response time of the
system depends only on the queues and the servers.
We have now completed our description of the operation of the system in Figure 1. We will
provide a number of numerical examples to further explain its operation in Section 4.
You will see from the numerical examples in Section 4 that the number of Group 0 servers n0
can be used to influence the mean response time. So, a design problem that you will consider in
this project is to determine the value of n0 to minimise the mean response time.
Remark 1 Some elements in the above description are realistic but some are not. Typically,
users are required to specify a walltime as a service time limit when they submit their jobs to a
computing cluster. If a server has already spent the specified walltime on the job, then the server
3
will kill the job. All these are realistic.
The re-circulation of a killed job is normally not done. A user will typically have to resubmit
a new job if it has been killed. If a killed job is re-circulated, then it may be given a lower priority,
rather than joining the main queue which is the case here.
Some programming technique (e.g., checkpointing) allows a killed job or crashed job to resur-
rect from the last state saved rather than from the beginning. However, that may require a sizeable
memory space.
In order to make this project more do-able, we have simplified many of the settings. For
example, we do not use lower priority for the re-circulated killed jobs.