Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute, memory, write back) and the code below. All ops are 1 cycle except LW and SW, which are 1 + 2 cycles, and branches, which are 1 + 1 cycles. There is no forwarding. Show the phases of each instruction per clock cycle for one iteration of the loop.
Loop: LW R3,0(RO)
LW R1, 0(R3)
ADDI R1, R, #1
SUB R1,R2,R3
SW R1, 0(R3)
BNZ R4, Loop
a. How many clock cycles per loop iteration are lost to branch overhead?
b. Assume a static branch predictor, capable of recognizing a backwards branch in the Decode stage. Now how many clock cycles are wasted on branch overhead?
c. Assume a dynamic branch predictor. How many cycles are lost on a correct prediction?