`
`219
`
`byte of the instruction is not in main memory-a situation that requires the
`saved PC to point 50 bytes earlier. Imagine the difficulties of restarting an in(cid:173)
`struction with six operands, each of which could be misaligned and thus be
`partially in memory and partially on disk!
`The instructions that are hardest to restart are those that modify some of the
`machine state before it is known whether interrupts can occur. The VAX
`autoincrement and autodecrement addressing modes would naturally modify
`· registers during the addressing phase of execution rather than at the writeback
`phase, and so would be vulnerable to this difficulty. To avoid this problem,
`recent VAXes keep a history queue of the register specifiers and the operations
`on the registers, so that the operations can be reversed on an interrupt. Another
`approach, used on the earlier VAXes, is to record the specifiers and the original
`values of the registers, restoring the original values on interrupt. (The primary
`difference is that it only takes a few bits to record how the address was changed
`due to autoincrement or autodecrement versus the full 32-bit register value.)
`It is not just addressing modes that make the VAX difficult to restart; long(cid:173)
`running instructions mean that interrupts must be checked in the middle of
`execution to prevent long interrupt latency. MOVC3, for example, copies up to
`216 bytes and can take tens of milliseconds to finish-far too long to wait for an
`urgent event. On the other hand, even if there were a way to undo copying in the
`middle of execution so that MOVC 3 could be restarted, interrupts would occur so
`frequently, relative to this long-running instruction (see Figure 5.10 on page
`216), that MOVC3 would be restarted repeatedly under those conditions. Such
`wasted effort from incomplete copies would render MOVC 3 worse than useless.
`DEC divided the problem to conquer it. First, the operands-source address,
`length, and destination address-are fetched from memory and placed into
`general-purpose registers Rl, R2, and R3. If an interrupt occurs during this first
`phase, these registers are restored, and the MOVC 3 is restarted from scratch.
`After this first phase, every time a byte is copied, the length (R2) is decremented
`and addresses (Rl and R3) are incremented. If an interrupt occurs during this
`second phase, MOVC 3 sets the first part done (FPD) bit in the program status
`word. When the interrupt is serviced and the instruction is reexecuted, it first
`checks the FPD bit to see if the operands have already been placed in registers.
`If so, the VAX doesn't fetch the address and length operands, but just continues
`with the current values in the registers, since that is all that remains to be copied.
`This permits more rapid response to interrupts while allowing long-running
`instructions to make progress between interrupts.
`IBM had a similar problem. The 360 included the MVC instruction, which
`copies up to 256 bytes of data. For the early machines without virtual memory,
`the machine simply waited until the instruction was completed before servicing
`interrupts. With the inclusion of virtual memory in the 370, the problem could
`no longer be ignored. Control first tries to access all possible pages, forcing all
`possible virtual memory miss interrupts to occur before moving any data. If any
`interrupts occur in this phase, the instruction is restarted. Control then ignores
`interrupts until the instruction is complete. To allow longer copies, the 370
`
`Ex.1035.251
`
`DELL
`
`
`
`
`
`
`
`
`
`
`
`
`
`Basic Processor Implementation Techniques
`
`225
`
`DLX instructions
`
`Loads
`Stores
`ALU
`Set
`Jumps
`Jump and links
`Branch (taken)
`Branch (not taken)
`
`Minimum
`clock cycles
`
`Memory
`accesses
`
`Total clock
`cycles
`
`6
`5
`5
`6
`3
`5
`4
`3
`
`2
`2
`1
`1
`1
`1
`1
`1
`
`8
`7
`6
`7
`4
`6
`5
`4
`
`FIGURE 5.19 Clock cycles per instruction for DLX categories using the state
`diagram in Figures 5.13 through 5.18. Determining the total clock cycles per category
`requires multiplying the number of memory accesses-including instruction fetches-times
`the average number of wait states, and adding this product to the minimum number of clock
`cycles. We assume an average of 1 clock cycle per memory access. For example, loads
`take eight clock cycles if the average number of wait states is one.
`
`untaken branches need just one. Adding these times to the first portion of
`instruction execution yields the clock cycles per DLX instruction class shown in
`Figure 5.19.
`From Chapter 2, one way to calculate CPI is
`n
`
`~i
`
`1
`
`)
`Instruction count
`
`CPI = ~(CPI· *
`,£..J
`i=l
`Using the DLX instruction mix from Figure C.4 in Appendix C for GCC
`(normalized to 100% ), the percentage of taken branches from Figure 3.22 (page
`107), and one for the average number of wait states per memory access, the
`DLX CPI for this datapath and state diagram is calculated:
`
`Loads
`Stores
`ALU
`Set
`Jumps
`Jump and links
`Branch (taken)
`Branch (not taken)
`
`8 * 21% =
`7 * 12% =
`6 * 37% =
`7 * 6% =
`4 * 2% =
`6 * 0% =
`5 * 12% =
`4 * 11% =
`Total CPI:
`
`1.68
`0.84
`2.22
`0.42
`0.08
`0.00
`0.60
`0.44
`6.28
`
`Thus, the DLX CPI for GCC is about 6.3.
`
`Ex.1035.257
`
`DELL
`
`
`
`
`
`
`
`228
`
`5.7 Putting It All Together: Control for DLX
`
`Example
`
`Answer
`
`Assume a machine with a 10-ns clock cycle (100-MHz clock rate). Suppose that
`on closer inspection the designer discovered that all states could be executed in 9
`ns, except states that use the shifter. Would it be wise to split those states, taking
`two 9-ns clock cycles for shift states and one 9-ns clock for everything else?
`
`Assuming the improvement in the previous example, the average instruction
`execution time for the 100-MHz machine is 5.9*10 ns or 59 ns. The shifter is
`only used in the states of four instructions: SLL, SRL, SRA, and LHI (see Figure
`5.20). In fact, each of these instructions takes 5 clock cycles (including one wait
`state for memory access), and only one of the five original clock cycles need be
`split into two new clock cycles. Thus, the average execution time of these in(cid:173)
`structions changes from 5* 10 ns, or 50 ns, to 6*9 ns, or 54 ns. From Figure C.4
`these 4 instructions are about 11 % of the instructions executed for GCC (after
`normalization), making the average instruction execution time 89% * (5.9*9 ns)
`+ 11 %*54 ns or 53 ns. Thus, splitting the shift state results in a machine that is
`about 10% faster-a wise decision. (See Exercise 5.8 for a more sophisticated
`version of this tradeoff.)
`
`Hardwired control is completed by listing the control signals activated in each
`state, assigning numbers to the states, and finally generating the PLA. Now let's
`implement control using microcode in a ROM.
`
`Microcoded Control for DLX
`
`A custom format such as this is a slave to the architecture of the hardware and
`instruction set which it serves. The format must strike a proper compromise
`between ROM size, ROM-output decoding circuitry size, and machine execution
`rate.
`
`Jim McKevit et al. [ 1977]
`
`Before microprogramming can commence, the microinstruction set must be
`determined. The first step is to list the possible entries for each field of the DLX
`microinstruction format from Figure 5.6 on page 209. Figure 5.7 on page 211
`. lists them for the Destination, Sourcel, and Source2 fields. Figure 5.22 below
`shows the values for the remaining fields.
`Sequencing of microinstructions i:equires further explanation. The
`microprogrammed control includes a microprogram counter to specify th€
`address of the next microinstruction if a branch is not taken, as in Figu_re 5.5 on
`page 208. In addition to the branches using the Jump address field, three tables
`are used to decode the DLX macroinstructions. These tables are indexed with
`the opcodes of the DLX instruction, and supply a microprogram address
`depending on the value in the opcode. Their use will become clear as we
`examine the DLX micr_oprogram.
`
`Ex.1035.260
`
`DELL
`
`
`
`Basic Processor Implementation Techniques
`
`229
`
`Value
`
`ALU
`
`Misc
`
`Cond
`
`0
`
`1
`
`2
`
`3
`
`4
`5
`
`6
`7
`
`8
`
`9
`
`ADD
`
`+
`
`SUB
`
`-
`
`RSUB
`-r
`(reverse sub)
`&
`AND
`
`OR
`XOR
`
`J',
`
`SLL
`SRL
`
`I
`/\
`
`<<
`>>
`
`SRA
`
`>>a
`
`Pass Sl SJ
`
`10
`
`Pass S2 S2
`
`Instr Read
`
`/Rf-
`M[PC]
`DataRead MDRf-
`M[MAR]
`M[MAR]f-
`MDR
`ABf-RF LoadA&B
`from Reg. File
`Write Rd
`Write R31
`(for call)
`
`Write
`
`Rdf-C
`R31f-C
`
`- - -
`
`Go to next sequential microinstruction
`
`Uncond
`
`Always jump
`
`Int?
`
`Pending (between instruction) interrupt?
`
`Mem?
`
`Memory access not complete?
`
`Zero?
`Negative?
`
`Is the ALU output zero?
`ls the ALU output less than zero?
`
`Load?
`Decodel
`(Fig. 5.24)
`Decode2
`(Fig. 5.26)
`Decode3
`(Fig. 5.26)
`
`Is the macroinstruction a DI.x load?
`Address table 1 determines next micro-
`instruction (uses main opcode)
`Address table 2 determines next micro-
`instruction (uses ''func" opcode)
`Address table 3 determines next micro-
`instruction (uses main opcode)
`
`FIGURE 5.22 The options for three fields of the DLX microinstruction format in Figure 5.6 on page 209. The
`possible names are shown on the left of the field name, with an explanation of each field to the right. The real
`microinstruction would contain a bit pattern corresponding to the number in the first column. Combined with Figure 5. 7
`(page 211 ), all the fields are defined except the Constant and Jump address fields, which contain numbers supplied by
`the microprogrammer. »a is an abbreviation for shift right arithmetic and -,means reverse subtract (B -,A= A- B).
`
`Following the lead of the state diagram, the DLX microprogram is divided
`into Figures 5.23, 5.25, 5.27, 5.28, and 5.29, with each section of microcode cor(cid:173)
`responding to one of Figures 5.13 to 5.18 (pages 220-224). The first state in
`Figure 5.13 becomes the first two microinstructions in Figure 5.23. The first
`microinstruction (address 0) branches to microinstruction 3 if there is an
`interrupt pending. Microinstruction 1 fetches an instruction from memory,
`branching back to itself as long as the memory access is not complete.
`Microinstruction 2 increments the PC by 4, loads A and B, and then does the
`first-level decoding. The address of the next microinstruction then depends on
`which macroinstruction is in the instruction register. The microinstruction
`addresses for this first-level macroinstruction decode are specified in Figure
`5.24. (In reality, the table shown in this figure is specified after the
`microprogram is written, as both the number of entries and the corresponding
`locations aren't known until then.)
`
`Ex.1035.261
`
`DELL
`
`
`
`230
`
`5.7 Putting It All Together: Control for DLX
`
`Loe Label Dest ALU
`
`Sl
`
`S2
`
`c Misc
`
`Cond
`
`Jump Comment
`label
`
`0
`
`1
`
`2
`
`3
`
`4
`
`If etch:
`
`Hoop:
`
`Interrupt?
`
`Intrpt
`
`Check interrupt
`
`Instr Read Mem?
`
`Hoop
`
`IR <:-M[PC];
`wait for memory
`
`PC
`
`ADD
`
`PC
`
`Constant
`
`4 AB<:-RF Decodel
`
`Intrpt:
`
`IAR
`
`Pass Sl PC
`
`PC
`
`Pass S2
`
`Constant
`
`0
`
`Uncond
`
`If etch
`
`Interrupt
`
`PC<:-0 & go
`fetch next
`instruction
`
`FIGURE 5.23 The first section of the DLX microprogram, corresponding to the states in Figure 5.13 (page 220).
`The first column contains the absolute address of the microinstruction, followed by a label. The rest of the fields contain
`values from Figures 5. 7 (page 211) and 5.22 for the microinstruction format in Figure 5.6 (page 209). As an example,
`microinstruction 2 corresponds to the second state of Figure 5.13. It sends the output from the ALU into PC, tells the ALU
`to add, puts PC onto the Source1 bus, and a constant from the microinstruction (whose value is 4) onto the Source2 bus.
`In addition, A and Bare loaded from the register file according to the specifiers in IR. Finally, the address of the next
`microinstruction to be executed comes from decode table 1 (Figure 5.24), which depends on the opcode· in the instruction
`register (IR).
`
`Opcodes (symbolically
`specified)
`
`Absolute
`address
`
`Label
`
`Figure
`
`Memory
`Move to special
`Move from special
`S2=B
`S2 = Immediate
`Branch equal zero
`Branch not equal zero
`Jump
`Jump register
`Jump and link
`Jump and link register
`Trap
`
`5
`20
`21
`23
`24
`50
`52
`54
`55
`56
`58
`60
`
`Mem:
`Movl2S:
`MovS2I:
`Reg:
`Imm:
`Beq:
`Bne:
`Jump:
`JReg:
`JAL:
`JALR:
`Trap:
`
`5.25
`5.25
`5.25
`5.27
`S.27
`5.29
`5.29
`5.29
`5.29
`5.29
`5.29
`5.29
`
`FIGURE 5.24 Opcodes and corresponding addresses for decode table 1. The
`opcodes are shown symbolically on the left, followed by the addresses with the absolute
`microinstruction address, a label, and the figure where the microcode can be found. If this
`table were implemented with a ROM it would contain 64 entries corresponding to the 6-bit
`opcode of DLX. As this would clearly result in many redundant or unspecified entries, a
`PLA could be used to minimize hardware.
`
`Figure 5.25 contains the DLX load and store instructions. Microinstruction 5
`calculates the effective address, and branches to microinstruction 9 if the
`
`Ex.1035.262
`
`DELL
`
`
`
`Basic Processor Implementation Techniques
`
`231
`
`macroinstruction in the IR is a load. If not, microinstruction 6 loads MDR with
`the value to be stored, and microinstruction 7 jumps to itself until the memory is
`finished writing the data. Microinstruction 8 then jumps back to microinstruction
`0 (Figure 5.23) to begin the execution cycle all over again. If the macroinstruc(cid:173)
`tion was a load, microinstruction 9 loops until the data has been read. Micro(cid:173)
`instruction 10 then uses decode table 2 (specified in Figure 5.26) to specify the
`address of the next microinstruction. Unlike the first decode table, this table is
`used by other microinstructions. (There is no conflict in multiple uses since the
`opcodes for each instance are different.)
`Suppose the instruction were load halfword. Figure 5.26 shows that the result
`of decode 2 would be to jump to microinstruction 15. This microinstruction
`shifts the contents of MDR to the left 16 bits and stores the result in Temp. The
`following microinstruction shifts Temp right arithmetically 16 bits and puts the
`result in C. C now contains the 16 rightmost bits of MDR, with the upper 16 bits
`containing the extended sign. This microinstruction jumps to location 22, which
`writes C back into the destination register specifier in IR, and then jumps to
`fetch the next macroinstruction starting at location 0 (Figure 5.23).
`
`f',
`
`Loe Label
`
`Dest
`
`ALU
`
`MAR
`MDR
`
`ADD
`Pass S2
`
`5 Mem:
`6 Store:
`7 Dloop:
`8
`9 Load:
`IO
`11 LB:
`
`Sl
`
`A
`
`S2
`
`c Misc
`
`immI6
`B
`
`Cond
`
`Load?
`
`Jump Comment
`label
`Load
`
`M emorv instruct.
`Store
`
`Data write Mem?
`Uncond
`Data read Mem?
`Decode2
`
`Dloop
`If etch
`Load
`
`Fetch next instr.
`LoadMDR
`
`Temp
`
`SLL
`
`MDR Constant 24
`
`I2
`
`c
`
`SRA
`
`Temp Constant 24
`
`Temp
`13 LBU:
`c
`14
`Temp
`IS LH:
`c
`I6
`Temp
`17 LHU:
`c
`18
`c
`I9 LW:
`20 Movl2S:
`IAR
`21 MovS21: c
`22 Write I:
`
`MDR Constant 24
`SLL
`Temp Constant 24
`SRL
`MDR Constant
`I6
`SLL
`Temp Constant
`I6
`SRA
`MDR Constant
`I6
`SLL
`Temp Constant
`I6
`SRL
`Pass Sl MDR
`Pass SI A
`IAR
`Pass SI
`
`Rdf--C
`
`Uncond Write I
`
`Uncond Write I
`
`Uncond Write I
`
`Load byte; shift left to
`remove uvver 24 bits
`Shift right arithmetic
`to sifm extend
`LB unsi1med
`Shift rif~ht lof!.ical
`Load half
`Shift rif!.ht arithmetic
`LH Unsif!.ned
`Uncond Write I
`Shift rif!.ht lof!.ical
`Uncond Write I Load word
`If etch Move to special
`Uncond
`Move from spec.
`If etch Write back & go fetch
`next instruction
`
`Uncond
`
`FIGURE 5.25 The section of the DLX microprogram for loads and stores, corresponding to the states in Figure
`5.14 (page 221). The microcode for bytes and halfwords takes an extra microinstruction to align the data (see Figure
`3.10, page 97). Note that microinstruction 5 loads A from Rd, just in case the instruction is a store. The label lfetch is for
`microinstruction 0 in Figure 5.23 on page 230.
`
`Ex.1035.263
`
`DELL
`
`
`
`232
`
`5.7 Putting It All Together: Control for DLX
`
`Opcode
`
`Load byte
`Load byte unsigned
`Load half
`Load half unsigned
`Load word
`ADD
`SUB
`AND
`OR
`XOR
`SLL
`SRL
`SRA
`LHI
`Set equal
`Set not equal
`Set less than
`Set greater than or equal
`Set greater than
`Set less than or equal
`
`Absolute
`address
`
`Label
`
`Figure
`
`11
`13
`15
`17
`19
`25
`26
`27
`28
`29
`30
`31
`32
`33
`35
`37
`39
`41
`43
`45
`
`LB:
`LBU:
`LH:
`LHU:
`LW:
`ADD/I:
`SUB/I:
`AND/I:
`OR/I:
`XOR/I:
`SLL/I:
`SRL/I:
`SRA/I:
`LHI:
`SEQ/I:
`SNEil:
`SLT/I:
`SGE/I:
`SGT/I:
`SLE/I:
`
`5.25
`5.25
`5.25
`5.25
`5.25
`5.27
`5.27
`5.27
`5.27
`5.27
`5.27
`5.27
`5.27
`5.27
`5.28
`5.28
`5.28
`5.28
`5.28
`5.28
`
`FIGURE 5.26 Opcodes and corresponding addresses for decode tables 2 and 3. The
`opcodes are shown symbolically on the left, followed by the absolute microinstruction
`address, the corresponding label, and the figure where the microcode can be found. Since
`the opcodes are shown symbolically, and they go to the same place in both tables, the
`same information can be used for specifying decode tables 2 and 3. This similarity is
`attributable to the immediate version and register version of the DLX instructions sharing
`the same microcode. If a table were implemented with a ROM, it would contain 64 entries
`corresponding to the 6-bit opcode of DLX. Again, the many redundant or unspecified
`entries suggest the use of a PLA to minimize hardware cost.
`
`The ALU instructions are found in Figure 5.27. The first two microinstruc(cid:173)
`tions correspond to the states at the top of Figure 5.15 (page 222). After loading ..
`Temp with either the register or the immediate, each uses a decode table to
`vector to the microinstruction that executes the ALU instruction. To save
`microcode space, the same microinstruction is used whether the operand is a
`register or an immediate. One of the microinstructions between 25 and 33 is
`executed, storing its result in C. It then jumps to microinstruction 34, which
`stores C into the register specified in the IR, and in turn jumps to fetch the next
`macroinstruction.
`
`Ex.1035.264
`
`DELL
`
`
`
`Basic Processor Implementation Techniques
`
`233
`
`Loe Label
`
`Dest ALU
`
`Sl
`
`S2
`
`c Misc
`
`Cond
`
`Jump Comment
`label
`
`23 Reg:
`24 Imm:
`25 ADD/I:
`26 SUB/I:
`27 AND/I:
`28 OR/I:
`29 XOR/I:
`30 SLL/I:
`31 SRL/I:
`32 SRA/I:
`33 LHI:
`34 Write2:
`
`Temp Pass S2
`Temp Pass S2
`c
`ADD
`c
`SUB
`c
`AND
`c
`OR
`c
`XOR
`c
`SLL
`c
`SRL
`c
`SRA
`c
`SLL
`
`B
`Imm
`Temp
`A
`Temp
`A
`Temp
`A
`Temp
`A
`Temp
`A
`Temp
`A
`Temp
`A
`Temp
`A
`Temp Constant 16
`
`.,
`
`Rdf-C
`
`--
`
`source2 = ref!.
`Decode2
`source2 = imm.
`Decode3
`Uncond Write2 ADD
`Uncond Write2
`SUB
`Uncond Write2 AND
`Uncond Write2 OR
`Uncond Write2 XOR
`Uncond Write2 SLL
`Uncond Write2 SRL
`Uncond Write2
`SRA
`Uncond Write2 LHI
`If etch Write back & go
`Uncond
`fetch next instruction
`
`FIGURE 5.27 .Like the first two states in Figure 5.15 (page 222), microinstructions 23 and 24 load Temp with an
`operand and then vector to the appropriate microinstruction, depending on the opcode in IR. One of the nine
`following microinstructions is executed, leaving its result in C. C is written back into the register specified in the register
`destination field of DLX macroinstruction in IR in microinstruction 34.
`
`Loe Label
`
`Dest ALU
`
`Sl
`
`S2
`
`c Misc
`
`35 SEQ/I:
`36
`37 SNE/I:
`38
`39 SLT/I:
`40
`41 SGE/I:
`42
`43 SGT/I:
`44
`45 SLE/I:
`46
`47 Seto:
`48 Setl:
`49 Write4:
`
`c
`
`c
`
`c
`
`c
`
`c
`
`c
`c
`c
`
`A
`
`A
`
`A
`
`A
`
`A
`
`A
`
`SUB
`Pass S2
`SUB
`Pass S2
`SUB
`Pass S2
`SUB
`Pass S2
`RSUB
`Pass S2
`RSUB
`Pass S2
`Pass S2
`Pass S2
`
`Temp
`Constant 0
`Temp
`Constant 1
`Temp
`Constant 0
`Temp
`Constant 1
`Temp
`Constant 0
`Temp
`Constant 1
`Constant 0
`Constant 1
`
`Rdf-C
`
`Cond
`
`Jump' Comment
`label
`Set equal
`Setl
`Zero?
`Uncond Write4 AR (set to false)
`Zero?
`Seto
`Set not equal
`Uncond Write4 AR (set to true)
`Negative? Setl
`Set less than
`Uncond Write4 A~T (set to false)
`Negative? Seto
`Set GT or equal
`Uncond Write4 A~T (set to true)
`Negative? Setl
`Set weater than
`Uncond Write4 T~ (set to false)
`Negative? Seto
`Set LT or equal
`Uncond Write4 T~ (set to true)
`Uncond Write4 Set to 0 =false
`Set to 1 = true
`If etch Write back &fetch
`· next instruction
`
`Uncond
`
`FIGURE 5.28 Corresponding to Figure 5.16 (pages 222-223), this microcode performs the DLX Set instructions.
`As in the previous figure, to save space these same microinstructions execute either the version of set using registers or
`the version using immediates. The tricky microcode is found in microinstructions 43 and 45, where the subtraction Temp -
`A is unlike the earlier microcode. Remember that A-, Temp= Temp-A (see Figure 5.22 on page 229).
`
`Ex.1035.265
`
`DELL
`
`
`
`234
`
`5.7 Putting It All Together: Control for DLX
`
`Loe Label
`
`Dest
`
`c Misc
`
`Figure 5.28 corresponds to the states in Figure 5.16 (pages 222-223), except
`that the top two states that load Temp are microinstructions 23 and 24 of the pre(cid:173)
`vious figure; the decode tables will either jump to locations 25 to 34 in Figure
`5.27, or 35 to 45 in Figure 5.28, depending on the opcode. The microinstructions
`for Set perform relative tests by having the ALU subtract Temp from A and then
`test the ALU output to see if the result is zero or negative. Depending on the test
`result, C is set to 1 or 0 and written back in the register file before going to fetch
`the next macroinstruction. Tests for A = Temp, A '* Temp, A < Temp, and A;;:::
`Temp are straightforward using these conditions on the ALU output A - Temp.
`A > Temp and A ~ Temp, on the other hand, are not simple, but can be done
`using the negative condition with the subtraction reversed:
`(Temp-A< 0) = (Temp< A) = (A> Temp)
`If the result is negative, then A > Temp, otherwise A ~ Temp. Voila!
`Figure 5.29 contains the last of the DLX microcode and corresponds to the
`states found in Figures 5.17 and 5.18 (pages 222-224). Microinstruction 50,
`corresponding to the macroinstruction branch on equal zero, tests if A equals
`zero. If it does, the macroinstruction branch succeeds, and the microinstruction
`jumps to the microinstruction 53. This microinstruction loads the PC with the
`PC-relative address and then jumps to the microcode that fetches the new
`macroinstruction (location 0). If A does not equal zero, the macroinstruction
`branch fails, so that the next sequential microinstruction (51) executes, jumping
`to location 0 without changing the PC.
`A state usually corresponds to a single microinstruction, although in a few
`cases above two microinstructions were needed. The jump and link instructions
`have the reverse case, with two ptates collapsing into one microinstruction. The
`actions in the last two states of] ump and link in Figure 5 .17 are found in micro(cid:173)
`instruction 57, and similarly for the jump and link register with microinstruction
`59. These microinstructions load the PC with the PC-relative branch address and
`save C into R3 l.
`Sl
`S2
`ALU
`
`SUB
`
`A
`
`Constant 0
`
`Constant 0
`imm16
`imm26
`
`50 Beq:
`51
`52 Bne:
`53 Branch:
`54 Jump:
`55 JReg:
`56 JAL:
`57
`58 JALR:
`59
`60 Trap:
`61
`
`A
`SUB
`PC
`ADD
`PC
`ADD
`Pass Sl A
`Pass Sl PC
`ADD
`PC
`Pass Sl PC
`Pass Sl A
`Pass Sl PC
`Pass S2
`
`PC
`PC
`PC
`c
`PC
`c
`PC
`IAR
`PC
`
`imm26
`
`R31f-C
`
`Uncond
`
`If etch
`
`R31f-C
`
`Uncond
`
`If etch
`
`imm26
`
`Uncond
`
`If etch
`
`Cond
`
`O?
`Uncond
`O?
`Uncond
`Uncond
`Uncond
`
`Jump Comment
`label
`Branch
`If etch
`If etch
`If etch
`If etch
`If etch
`
`Instr is branch =0
`:;t(): not taken
`Instr is branch ;t: 0
`:;t(): taken
`Jump
`Jump reRister
`Jump and link
`Jump & save PC
`Jump & link ref!
`Jump & save PC
`Trap
`
`FIGURE 5.29 The microcode for branch and jump DLX instructions, corresponding to the states i11 Figures 5.17
`and 5.18 on pages 222-224.
`
`Ex.1035.266
`
`DELL
`
`
`
`Basic Processor Implementation Techniques
`
`. 235
`
`Performance of Microcoded Control for DLX
`
`Before trying to improve performance or reduce costs of control, the existing
`performance must be assessed. Again, the process is to count the clock cycles
`for each instruction, but this time there is a larger variety in performance.
`All instructions execute microinstructions 0, 1, and 2 in Figure 5.23 (page
`230), giving a base of 3 clocks plus wait states, depending on the repetition of
`microinstruction 1. The clock cycles for the rest of the categories are:
`4 for stores, plus wait states
`5 for load word, plus wait states
`6 for load byte or load half (signed or unsigned), plus wait states
`3 for ALU
`4 for set
`2 for branch equal zero (taken or untaken)
`2 for branch not equal zero (taken)
`1 for branch not equal zero (untaken)
`1 for jumps
`2 for jump and links
`
`Using the instruction mix for GCC in Figure C.4, and assuming an average of 1
`wait state per memory access, the CPI is 7.68. This is higher than the hardwired
`control CPI, because the test for interrupt takes another clock cycle at the begin(cid:173)
`ning, loads and stores are slower, and branch equal zero is slower for the
`untaken case.
`
`Reducing Cost and Improving Performance of DLX
`When Control Is Microcoded
`
`The size of a completely unencoded version of the DLX microinstruction is
`calculated from the number of entries in Figures 5. 7 (page 211) and 5 .22 (page
`229) phis the size of the Constant and Jump address fields. The largest constant
`in the fields is 24, which requires 5 bits, and the largest address is 61, which
`requires 6. Figure 5.30 shows the microinstruction fields, the unencoded widths,
`and the encoded widths. Encoding almost halves the size of control store.
`
`Dest
`
`ALU
`operation
`
`Sourcel Source2 Constant Misc
`
`Cond
`
`Jump
`address
`
`Total
`
`Unencoded
`Encoded
`
`7
`3
`
`11
`4
`
`9
`4
`
`9
`4
`
`5
`5
`
`6
`3
`
`10
`4
`
`6
`6
`
`= 63 bits
`= 33 bits
`
`FIGURE 5.30 Width of field in bits of unencoded and encoded microinstruction formats. Note that the Constant
`and Jump address fields are not encoded in this example, placing fewer restrictions on the microprogram using the
`encoded format.
`
`Ex.1035.267
`
`DELL
`
`
`
`
`
`Basic Processor Implementation Techniques
`
`237
`
`Loe Label
`
`Type. Dest ALU
`
`Sl
`
`S2
`
`Misc Cond
`
`0 If etch: M!f /B
`1 !loop: M!f /B
`
`---
`---
`
`---
`---
`
`2
`3
`
`PC
`
`A/J
`M{f /B
`
`ADD
`---
`
`PC
`
`Constant
`---
`
`4 In trot: A/J
`5
`A/J
`
`IAR
`PC
`
`..
`
`Pass Sl PC
`SUB
`Temp Temp
`
`Interrupt?
`Instr Mem?
`Read
`---
`---
`ABf- Decodel
`RF
`---
`---
`
`---
`---
`
`Const/ Comment
`Jump
`
`Intrpt
`!loop
`
`4
`
`Check interruJJt
`IR f-M[PC]; wait
`for memory
`Increment PC
`
`5
`If etch
`
`Interrupt
`PC~O (t minus t=O)
`& go fetch next
`instruction
`
`FIGURE 5.32 Version of Figure 5.23 (page 230) using the dual-format microinstruction in Figure 5.31. Note that
`ALU/Jump microinstructions check the 81 and 82 fields for a constant specifier to see if the next address is sequential (as
`in microinstruction 2); otherwise they go to the Jump address (as in microinstructions 4 and 5). The microprogrammer
`changed the last microinstruction to generate a zero by subtracting a register from itself rather than through straight(cid:173)
`forward use of constant 0. Using the constant would have required an additional microinstruction since this format goes to
`the next sequential instruction if a constant is used. (See Figure 5.31.)
`
`Sometimes performance can be improved by finding faster sequences of
`microcode, but normally it requires changes to the hardware. The branch equal
`zero instruction takes one extra clock cycle when the branch is not taken with
`hardwired control, but two with microcoded control; while branch not equal zero
`has the same performance for hardwired and microcoded control. Why would
`the former differ in performance? Figure 5.29 shows that microinstruction 52
`branches on zero to fetch the next microinstruction, which is correct for the
`branch on not equal zero macroinstruction. Microinstruction 50 also tests for
`zero for the branch on zero macroinstruction and branches to the
`microinstruction that loads the new PC. The not zero case is handled by the
`following microinstruction (51 ), which jumps to fetch the next instruction(cid:173)
`hence, one clock cycle for untaken branch on not equal zero and two for untaken
`branch on equal zero. One solution is simply to add "not zero" to the microcode
`branch conditions in Figure 5.22 (page 229) and change the branch on equal
`microcode to the version in Figure 5.33. Since there are only ten branch
`conditions, adding the eleventh would not require more than the four bits needed
`for an encoded version of that field.
`
`Loe Label
`
`Dest
`
`ALU
`
`Sl
`
`S2
`
`c Misc
`
`Cond
`
`Jump Comment
`label
`
`50 Beq:
`51
`
`SUB
`ADD
`
`A
`PC
`
`Constant 0
`imm16
`
`PC
`
`notO?
`Uncond
`
`!fetch
`If etch
`
`Branch =0
`=0: taken
`
`FIGURE 5.33 Branch not equal microcode from Figure 5.29 (page 234) rewritten by using a not zero condition in
`microinstruction 44.
`
`Ex.1035.269
`
`DELL
`
`
`
`238
`
`5.7 Putting It All Together: Control for DLX
`
`This change drops the CPI from 7 .68 to 7 .63 for microcoded control, yet this
`is still higher than the CPI for hardwired control.
`
`Example
`
`Let's improve microcoded control so that the CPI for GCC is closer to the
`original CPI under hardwired control.
`
`Answer
`
`The main performance culprit is the separate test for interrupts in Figure 5.23.
`By modifying the hardware, decodel can kill two birds with one stone: In
`addition to jumping to the appropriate microinstructions corresponding to the
`opcode, it also jumps to the interrupt microcode if an interrupt is pending. Figure
`5.34 shows the revised microcode. This modification saves one clock cycle from
`each instruction, reducing the CPI to 6.63.
`
`Loe Label
`
`Dest
`
`ALU
`
`Sl
`
`S2
`
`c Misc
`
`Cond
`
`!fetch:
`
`Instr Read Mem?
`
`PC
`
`ADD
`
`PC
`
`Constant 4 AB<::-RF
`
`Decode I
`
`Intrpt:
`
`IAR
`
`SUB
`
`PC
`
`Constant 4
`
`0
`
`1
`
`2
`
`3
`
`Jump Comment
`label
`If etch
`
`IR <::-M[PC]; wait
`for memory
`Also go to interrupt
`if vendinR interrupt
`Interrupt: undo PC
`increment
`PC<::-0 & go fetch
`next instruction
`
`PC
`
`Pass S2
`
`Constant 0
`
`Uncond
`
`If etch
`
`FIGURE 5.34 Revised microcode that takes advantage of a change of the hardwar-e1o have decode1 go to
`microinstruction 2 if there is a pending interrupt. This microinstruction must reverse the increment of PC in the prior
`microinstruction so that the correct value is saved.
`
`5.8 I Fallacies and Pitfalls
`
`,
`Pitfall: Microcode implementing a complex instruction may not be faster than
`macrocode.
`
`At one time, microcode had the advantage of being fetched from a much faster
`memory than macrocode. Since caches came into use in 1968, microcode no
`longer has such a consistent edge in fetch time. Microcode does, however, still
`have the advantage of using internal temporary registers in the computation,
`which can be helpful on machines with few general-purpose registers. The
`disadvantage of micn:~code is that the algorithms must be selected before the
`machine is announced and can't be changed until the next model of the archi-
`
`Ex.1035.270
`
`DELL
`
`
`
`Basic Processor Implementation Techniques
`
`239
`
`tecture; macrocode, · on the other hand, can utilize improvements in its
`algorithms at any time during the life of the machine.
`The VAX Index instruction provides an example: The instruction checks to
`see if the index is between two bounds, one of which is usually zero. The VAX-
`11/780 microcode uses two compares and two branches to do this, while
`macrocode can perform the same check in one compare and one branch. The
`macrocode checks the· index against the upper limit using unsigned
`comparisons, rather than two's complement comparisons. This treats a negative
`index (less than zero and so failing the comparison) as if it were a very large
`number, thus exceeding the upper limit. (The algorithm can be used with
`nonzero lower bounds by first subtracting the lower bound from the index.)
`Replacing the index instruction by this VAX macrocode always improves
`performance on the V AX-11/780.
`
`J
`
`Fallacy: If there is space in control store, new instructions are free of cost.
`
`Since the length of control store is usually a power of two, at times there may be
`unused control store available to expand the instruction set. The analogy here is
`that of building a house and discovering, near completion, that you have enough
`land and materials left to add a room. This room wouldn't be free, however,
`since there would be the costs of labor and maintenance for the life of the home.
`The temptation to add "free" instructions can only occur when the instruction set
`is not fixed, as is likely to be the case in the first model of a computer. Because
`instruction set compatibility is a lo