Computers spend most of their time in loops, so multiple loop itera- tions are great places to speculatively find more work to keep CPU resources busy. Nothing is ever easy, though: the compiler emitted only one copy of that loop's code, so even though multiple iterations are handling distinct data, they will appear to use the same registers. To keep multiple iterations' register usages from colliding, we rename their registers. Figure 3.49 shows example code that we would like our hardware to rename. A compiler could have simply unrolled the loop and used different registers to avoid conflicts, but if we expect our hard- ware to unroll the loop, it must also do the register renaming. How? Assume your hardware has a pool of temporary registers (call them T registers, and assume that there are 64 of them, TO through T63) that it can substitute for those registers des- ignated by the compiler. This rename hardware is indexed by the src (source) register designation, and the value in the table is the T register of the last destina- tion that targeted that register. (Think of these table values as producers, and the src registers are the consumers; it doesn't much matter where the producer puts its result as long as its consumers can find it.) Consider the code sequence in Fig ure 3.49. Every time you see a destination register in the code, substitute the next available T, beginning with T9. Then update all the src registers accordingly, so that true data dependences are maintained. Show the resulting code. (Hint: See Figure 3.50.)

From the provided Hint in Figure 3.50 and also The baseline performance ( in cycles, per loop iteration) of the code sequence in Figure 2.35 shows that if no new instruction’s execution could be initiated until the previous instruction’s execution had completed, is 40.

Now, you might ask that how did I come up with that number?

Each instruction requires one clock cycle of execution (a clock cycle in which that instruction, and only that instruction, is occupying the execution units; since every instruction must execute, the loop will take at least that many clock cycles). To that base number, we add the extra latency cycles. Don’t forget the branch shadow cycle.

Therefore, the resulting code would be:

Loop: LD F2,0(Rx) 1 + 4

DIVD F8,F2,F01 + 12

MULTD F2,F6,F2 1 + 5

LDF4,0(Ry) 1 + 4

ADDD F4,F0,F4 1 + 1

ADDD F10,F8,F2 1 + 1

ADDI Rx,Rx,#8 1

ADDIRy,Ry,#81

SDF4,0(Ry) 1 + 1

SUB R20,R4,Rx 1

BNZ R20,Loop 1 + 1

Cycle per loop iter 40