EE524/CPTS561 Final exam
Final exam
EE524/CPTS561 Final exam
• Please show your work, justify the claims you make and point to a source if you are using any.
You may lose points if you use a resource without appropriate references.
• Please DO NOT collaborate on your answers. You should work on the exam questions by
yourself.
• DO NOT post the questions to any forum online
Please type or hand-write your answers and upload them to Blackboard. Please convert your files or
images to PDF format and upload the PDF. It makes annotations and returning feedback to you
much easier.
Submit your exams by 8:00 am on Tuesday, 15 December 2020. Late submissions will be
penalized at 10 points/hour for each hour of delay.
Name: EE524/CPTS561, Fall 2020
Question 1 [8/8/9 points]
(a) For each of the following sequence of branches, assume that you are using a correlating branch
predictor. What is the minimum length of global history required in each case for accurate predictions
for the last branch? As usual, assume that if a branch is taken, it means that the code inside if
condition is not executed. Assume that val is randomly chosen. Explain your reasoning for the
answers.
1. i f ( va l % 12 == 0) { /∗ B1 ∗/
sum += val ; /∗ Not taken path f o r B1 ∗/
}
i f ( va l % 4 == 0) { /∗ B2 ∗/
sum += val ; /∗ Not taken path f o r B2 ∗/
}
i f ( va l % 3 == 0) { /∗ B3 ∗/
sum += val ; /∗ Not taken path f o r B3 ∗/
}
2
Name: EE524/CPTS561, Fall 2020
2. i f ( va l % 2 == 0) { /∗ B1 ∗/
sum += val ; /∗ Not taken path f o r B1 ∗/
}
i f ( va l % 5 == 0) { /∗ B2 ∗/
sum += val ; /∗ Not taken path f o r B2 ∗/
}
i f ( va l % 4 == 0) { /∗ B3 ∗/
sum += val ; /∗ Not taken path f o r B3 ∗/
}
i f ( va l % 20 == 0) { /∗ B4 ∗/
sum += val ; /∗ Not taken path f o r B4 ∗/
}
3
Name: EE524/CPTS561, Fall 2020
(b) Consider the following branches. Assume that if a branch is taken, it means that the code inside
if condition is not executed.
i f ( va l > 5) { /∗ B1 ∗/
/∗ Not taken path f o r B1 ∗/
}
i f ( va l > 3) { /∗ B2 ∗/
/∗ Not taken path f o r B2 ∗/
}
i f ( va l > 2) { /∗ B3 ∗/
/∗ Not taken path f o r B3 ∗/
}
i f ( va l < 0) { /∗ B4 ∗/
/∗ Not taken path f o r B4 ∗/
}
We are using a two-level branch predictor with following properties:
• One bit global history
• One bit local history
Now, we know that the whole branch address is not used to index into the local history table.
As a result of this, B2 and B4 map to the same entry in the local history table. That is, the branch
predictor suffers from aliasing. You have to determine if the branch predictor suffers from destructive
aliasing under any conditions. Please state your reasoning. You may ignore any other branches that
are present in the program.
4
Name: EE524/CPTS561, Fall 2020
5
Name: EE524/CPTS561, Fall 2020
Question 2 [5/6/5/9 points]
(a) Consider a 4 way set associative cache with a total size of 256 KB. Assume that the cache is
virtually indexed and physically tagged. Assuming a 32-bit address and 64-byte blocks, complete
the following table. You can use the next page to show your work.
Table 1: Cache parameters
Parameter Value
Tag bits
Block offset bits
Cache index bits
No. of sets
We will use these cache parameters for the rest of this question.
6
Name: EE524/CPTS561, Fall 2020
(b) For this part, assume that the page size is 128 KB for the machine on which the cache is
implemented. The cache parameters remain the same as in Table 1. If the cache is physically-tagged
and virtually-indexed, does the way-predicting cache suffer from aliasing? Why or why not? Explain
your answer.
(c) What is the minimum page size for which the cache does not suffer from aliasing? Explain your
answer.
7
Name: EE524/CPTS561, Fall 2020
(d) For each of the following addresses, give the cache tag, cache index and set number.
1. 0x80864580
2. 0x1f4dcdde
3. 3c71bb71
8
Name: EE524/CPTS561, Fall 2020
Question 3 [8/9/8 points]
The single thread performance of processor generally increases with the area of the processor.
Consider that in the technology node available to us, the single-thread performance of a core
increases with the square root of the area. For example, if you increase the area of a core by four
times, the performance doubles. You are analyzing the performance of three processors (P1, P2, and
P3) that have different area for a single core as follows:
P1: Single core area = A
P2: Single core area = 4A
P3: Single core area = 16A
(a) You are now analyzing a workload that as a serial portion S, and infinitely parallelizable
portion 1− S. You execute the workload on a system that has 8 processor P1s, and observe a
speedup of 5 over the performance on just a single processor P1. What is the value of S for
the workload?
9
Name: EE524/CPTS561, Fall 2020
(b) Now assume that you have an area budget of 32A. You can utilize this area to include multiple
cores of the same type, e.g. 8 cores of P2 or 2 cores of P3. Which of the three processors
will give the maximum speedup in this case? What is the speedup with respect to a single
processor P1? The workload is same as that of part (a), i.e. the serial portion is S.
(c) Finally, you build a heterogeneous system that has one processor P3 and 16 processor P1s
(total area is 32A). Now the serial portion is of the workload is running on P3 while the parallel
portion is running on the 16 smaller cores (P1). What is the speedup of the heterogeneous
system with respect to a single core P1?
10
Name: EE524/CPTS561, Fall 2020
Question 4 (Short answer questions) [5 points each]
(a) Consider the following two cache replacement and insertion policies in a set associative cache:
• Policy 1: Evict the line at the least recently used position and insert the new line at the
most recently used position.
• Policy 2: Evict the line at the least recently used position and insert the new line at the
least recently used position.
Under what conditions does policy 2 perform better (lower miss rates and hence execution
time) than policy 1?
(b) What are the key differences between shared memory and message passing parallel programming
models?
11
Name: EE524/CPTS561, Fall 2020
(c) Describe the reasons why memory consistency is important in processors. What are the
sufficient conditions to achieve sequential consistency in memory operations?
(d) What are the steps needed to enable trace scheduling in VLIW processors? How can the
processor recover when the trace generated ends up being incorrect?
12
Name: EE524/CPTS561, Fall 2020
(e) One of the design guidelines for domain specific architectures is to invest resources into more
arithmetic units or bigger memories. What are the reasons behind having this design guideline?
13