throbber
Basic Processor Implementation Techniques
`
`219
`
`byte of the instruction is not in main memory-a situation that requires the
`saved PC to point 50 bytes earlier. Imagine the difficulties of restarting an in(cid:173)
`struction with six operands, each of which could be misaligned and thus be
`partially in memory and partially on disk!
`The instructions that are hardest to restart are those that modify some of the
`machine state before it is known whether interrupts can occur. The VAX
`autoincrement and autodecrement addressing modes would naturally modify
`· registers during the addressing phase of execution rather than at the writeback
`phase, and so would be vulnerable to this difficulty. To avoid this problem,
`recent VAXes keep a history queue of the register specifiers and the operations
`on the registers, so that the operations can be reversed on an interrupt. Another
`approach, used on the earlier VAXes, is to record the specifiers and the original
`values of the registers, restoring the original values on interrupt. (The primary
`difference is that it only takes a few bits to record how the address was changed
`due to autoincrement or autodecrement versus the full 32-bit register value.)
`It is not just addressing modes that make the VAX difficult to restart; long(cid:173)
`running instructions mean that interrupts must be checked in the middle of
`execution to prevent long interrupt latency. MOVC3, for example, copies up to
`216 bytes and can take tens of milliseconds to finish-far too long to wait for an
`urgent event. On the other hand, even if there were a way to undo copying in the
`middle of execution so that MOVC 3 could be restarted, interrupts would occur so
`frequently, relative to this long-running instruction (see Figure 5.10 on page
`216), that MOVC3 would be restarted repeatedly under those conditions. Such
`wasted effort from incomplete copies would render MOVC 3 worse than useless.
`DEC divided the problem to conquer it. First, the operands-source address,
`length, and destination address-are fetched from memory and placed into
`general-purpose registers Rl, R2, and R3. If an interrupt occurs during this first
`phase, these registers are restored, and the MOVC 3 is restarted from scratch.
`After this first phase, every time a byte is copied, the length (R2) is decremented
`and addresses (Rl and R3) are incremented. If an interrupt occurs during this
`second phase, MOVC 3 sets the first part done (FPD) bit in the program status
`word. When the interrupt is serviced and the instruction is reexecuted, it first
`checks the FPD bit to see if the operands have already been placed in registers.
`If so, the VAX doesn't fetch the address and length operands, but just continues
`with the current values in the registers, since that is all that remains to be copied.
`This permits more rapid response to interrupts while allowing long-running
`instructions to make progress between interrupts.
`IBM had a similar problem. The 360 included the MVC instruction, which
`copies up to 256 bytes of data. For the early machines without virtual memory,
`the machine simply waited until the instruction was completed before servicing
`interrupts. With the inclusion of virtual memory in the 370, the problem could
`no longer be ignored. Control first tries to access all possible pages, forcing all
`possible virtual memory miss interrupts to occur before moving any data. If any
`interrupts occur in this phase, the instruction is restarted. Control then ignores
`interrupts until the instruction is complete. To allow longer copies, the 370
`
`Ex.1035.251
`
`DELL
`
`

`

`

`

`

`

`

`

`

`

`

`

`Basic Processor Implementation Techniques
`
`225
`
`DLX instructions
`
`Loads
`Stores
`ALU
`Set
`Jumps
`Jump and links
`Branch (taken)
`Branch (not taken)
`
`Minimum
`clock cycles
`
`Memory
`accesses
`
`Total clock
`cycles
`
`6
`5
`5
`6
`3
`5
`4
`3
`
`2
`2
`1
`1
`1
`1
`1
`1
`
`8
`7
`6
`7
`4
`6
`5
`4
`
`FIGURE 5.19 Clock cycles per instruction for DLX categories using the state
`diagram in Figures 5.13 through 5.18. Determining the total clock cycles per category
`requires multiplying the number of memory accesses-including instruction fetches-times
`the average number of wait states, and adding this product to the minimum number of clock
`cycles. We assume an average of 1 clock cycle per memory access. For example, loads
`take eight clock cycles if the average number of wait states is one.
`
`untaken branches need just one. Adding these times to the first portion of
`instruction execution yields the clock cycles per DLX instruction class shown in
`Figure 5.19.
`From Chapter 2, one way to calculate CPI is
`n
`
`~i
`
`1
`
`)
`Instruction count
`
`CPI = ~(CPI· *
`,£..J
`i=l
`Using the DLX instruction mix from Figure C.4 in Appendix C for GCC
`(normalized to 100% ), the percentage of taken branches from Figure 3.22 (page
`107), and one for the average number of wait states per memory access, the
`DLX CPI for this datapath and state diagram is calculated:
`
`Loads
`Stores
`ALU
`Set
`Jumps
`Jump and links
`Branch (taken)
`Branch (not taken)
`
`8 * 21% =
`7 * 12% =
`6 * 37% =
`7 * 6% =
`4 * 2% =
`6 * 0% =
`5 * 12% =
`4 * 11% =
`Total CPI:
`
`1.68
`0.84
`2.22
`0.42
`0.08
`0.00
`0.60
`0.44
`6.28
`
`Thus, the DLX CPI for GCC is about 6.3.
`
`Ex.1035.257
`
`DELL
`
`

`

`

`

`

`

`228
`
`5.7 Putting It All Together: Control for DLX
`
`Example
`
`Answer
`
`Assume a machine with a 10-ns clock cycle (100-MHz clock rate). Suppose that
`on closer inspection the designer discovered that all states could be executed in 9
`ns, except states that use the shifter. Would it be wise to split those states, taking
`two 9-ns clock cycles for shift states and one 9-ns clock for everything else?
`
`Assuming the improvement in the previous example, the average instruction
`execution time for the 100-MHz machine is 5.9*10 ns or 59 ns. The shifter is
`only used in the states of four instructions: SLL, SRL, SRA, and LHI (see Figure
`5.20). In fact, each of these instructions takes 5 clock cycles (including one wait
`state for memory access), and only one of the five original clock cycles need be
`split into two new clock cycles. Thus, the average execution time of these in(cid:173)
`structions changes from 5* 10 ns, or 50 ns, to 6*9 ns, or 54 ns. From Figure C.4
`these 4 instructions are about 11 % of the instructions executed for GCC (after
`normalization), making the average instruction execution time 89% * (5.9*9 ns)
`+ 11 %*54 ns or 53 ns. Thus, splitting the shift state results in a machine that is
`about 10% faster-a wise decision. (See Exercise 5.8 for a more sophisticated
`version of this tradeoff.)
`
`Hardwired control is completed by listing the control signals activated in each
`state, assigning numbers to the states, and finally generating the PLA. Now let's
`implement control using microcode in a ROM.
`
`Microcoded Control for DLX
`
`A custom format such as this is a slave to the architecture of the hardware and
`instruction set which it serves. The format must strike a proper compromise
`between ROM size, ROM-output decoding circuitry size, and machine execution
`rate.
`
`Jim McKevit et al. [ 1977]
`
`Before microprogramming can commence, the microinstruction set must be
`determined. The first step is to list the possible entries for each field of the DLX
`microinstruction format from Figure 5.6 on page 209. Figure 5.7 on page 211
`. lists them for the Destination, Sourcel, and Source2 fields. Figure 5.22 below
`shows the values for the remaining fields.
`Sequencing of microinstructions i:equires further explanation. The
`microprogrammed control includes a microprogram counter to specify th€
`address of the next microinstruction if a branch is not taken, as in Figu_re 5.5 on
`page 208. In addition to the branches using the Jump address field, three tables
`are used to decode the DLX macroinstructions. These tables are indexed with
`the opcodes of the DLX instruction, and supply a microprogram address
`depending on the value in the opcode. Their use will become clear as we
`examine the DLX micr_oprogram.
`
`Ex.1035.260
`
`DELL
`
`

`

`Basic Processor Implementation Techniques
`
`229
`
`Value
`
`ALU
`
`Misc
`
`Cond
`
`0
`
`1
`
`2
`
`3
`
`4
`5
`
`6
`7
`
`8
`
`9
`
`ADD
`
`+
`
`SUB
`
`-
`
`RSUB
`-r
`(reverse sub)
`&
`AND
`
`OR
`XOR
`
`J',
`
`SLL
`SRL
`
`I
`/\
`
`<<
`>>
`
`SRA
`
`>>a
`
`Pass Sl SJ
`
`10
`
`Pass S2 S2
`
`Instr Read
`
`/Rf-
`M[PC]
`DataRead MDRf-
`M[MAR]
`M[MAR]f-
`MDR
`ABf-RF LoadA&B
`from Reg. File
`Write Rd
`Write R31
`(for call)
`
`Write
`
`Rdf-C
`R31f-C
`
`- - -
`
`Go to next sequential microinstruction
`
`Uncond
`
`Always jump
`
`Int?
`
`Pending (between instruction) interrupt?
`
`Mem?
`
`Memory access not complete?
`
`Zero?
`Negative?
`
`Is the ALU output zero?
`ls the ALU output less than zero?
`
`Load?
`Decodel
`(Fig. 5.24)
`Decode2
`(Fig. 5.26)
`Decode3
`(Fig. 5.26)
`
`Is the macroinstruction a DI.x load?
`Address table 1 determines next micro-
`instruction (uses main opcode)
`Address table 2 determines next micro-
`instruction (uses ''func" opcode)
`Address table 3 determines next micro-
`instruction (uses main opcode)
`
`FIGURE 5.22 The options for three fields of the DLX microinstruction format in Figure 5.6 on page 209. The
`possible names are shown on the left of the field name, with an explanation of each field to the right. The real
`microinstruction would contain a bit pattern corresponding to the number in the first column. Combined with Figure 5. 7
`(page 211 ), all the fields are defined except the Constant and Jump address fields, which contain numbers supplied by
`the microprogrammer. »a is an abbreviation for shift right arithmetic and -,means reverse subtract (B -,A= A- B).
`
`Following the lead of the state diagram, the DLX microprogram is divided
`into Figures 5.23, 5.25, 5.27, 5.28, and 5.29, with each section of microcode cor(cid:173)
`responding to one of Figures 5.13 to 5.18 (pages 220-224). The first state in
`Figure 5.13 becomes the first two microinstructions in Figure 5.23. The first
`microinstruction (address 0) branches to microinstruction 3 if there is an
`interrupt pending. Microinstruction 1 fetches an instruction from memory,
`branching back to itself as long as the memory access is not complete.
`Microinstruction 2 increments the PC by 4, loads A and B, and then does the
`first-level decoding. The address of the next microinstruction then depends on
`which macroinstruction is in the instruction register. The microinstruction
`addresses for this first-level macroinstruction decode are specified in Figure
`5.24. (In reality, the table shown in this figure is specified after the
`microprogram is written, as both the number of entries and the corresponding
`locations aren't known until then.)
`
`Ex.1035.261
`
`DELL
`
`

`

`230
`
`5.7 Putting It All Together: Control for DLX
`
`Loe Label Dest ALU
`
`Sl
`
`S2
`
`c Misc
`
`Cond
`
`Jump Comment
`label
`
`0
`
`1
`
`2
`
`3
`
`4
`
`If etch:
`
`Hoop:
`
`Interrupt?
`
`Intrpt
`
`Check interrupt
`
`Instr Read Mem?
`
`Hoop
`
`IR <:-M[PC];
`wait for memory
`
`PC
`
`ADD
`
`PC
`
`Constant
`
`4 AB<:-RF Decodel
`
`Intrpt:
`
`IAR
`
`Pass Sl PC
`
`PC
`
`Pass S2
`
`Constant
`
`0
`
`Uncond
`
`If etch
`
`Interrupt
`
`PC<:-0 & go
`fetch next
`instruction
`
`FIGURE 5.23 The first section of the DLX microprogram, corresponding to the states in Figure 5.13 (page 220).
`The first column contains the absolute address of the microinstruction, followed by a label. The rest of the fields contain
`values from Figures 5. 7 (page 211) and 5.22 for the microinstruction format in Figure 5.6 (page 209). As an example,
`microinstruction 2 corresponds to the second state of Figure 5.13. It sends the output from the ALU into PC, tells the ALU
`to add, puts PC onto the Source1 bus, and a constant from the microinstruction (whose value is 4) onto the Source2 bus.
`In addition, A and Bare loaded from the register file according to the specifiers in IR. Finally, the address of the next
`microinstruction to be executed comes from decode table 1 (Figure 5.24), which depends on the opcode· in the instruction
`register (IR).
`
`Opcodes (symbolically
`specified)
`
`Absolute
`address
`
`Label
`
`Figure
`
`Memory
`Move to special
`Move from special
`S2=B
`S2 = Immediate
`Branch equal zero
`Branch not equal zero
`Jump
`Jump register
`Jump and link
`Jump and link register
`Trap
`
`5
`20
`21
`23
`24
`50
`52
`54
`55
`56
`58
`60
`
`Mem:
`Movl2S:
`MovS2I:
`Reg:
`Imm:
`Beq:
`Bne:
`Jump:
`JReg:
`JAL:
`JALR:
`Trap:
`
`5.25
`5.25
`5.25
`5.27
`S.27
`5.29
`5.29
`5.29
`5.29
`5.29
`5.29
`5.29
`
`FIGURE 5.24 Opcodes and corresponding addresses for decode table 1. The
`opcodes are shown symbolically on the left, followed by the addresses with the absolute
`microinstruction address, a label, and the figure where the microcode can be found. If this
`table were implemented with a ROM it would contain 64 entries corresponding to the 6-bit
`opcode of DLX. As this would clearly result in many redundant or unspecified entries, a
`PLA could be used to minimize hardware.
`
`Figure 5.25 contains the DLX load and store instructions. Microinstruction 5
`calculates the effective address, and branches to microinstruction 9 if the
`
`Ex.1035.262
`
`DELL
`
`

`

`Basic Processor Implementation Techniques
`
`231
`
`macroinstruction in the IR is a load. If not, microinstruction 6 loads MDR with
`the value to be stored, and microinstruction 7 jumps to itself until the memory is
`finished writing the data. Microinstruction 8 then jumps back to microinstruction
`0 (Figure 5.23) to begin the execution cycle all over again. If the macroinstruc(cid:173)
`tion was a load, microinstruction 9 loops until the data has been read. Micro(cid:173)
`instruction 10 then uses decode table 2 (specified in Figure 5.26) to specify the
`address of the next microinstruction. Unlike the first decode table, this table is
`used by other microinstructions. (There is no conflict in multiple uses since the
`opcodes for each instance are different.)
`Suppose the instruction were load halfword. Figure 5.26 shows that the result
`of decode 2 would be to jump to microinstruction 15. This microinstruction
`shifts the contents of MDR to the left 16 bits and stores the result in Temp. The
`following microinstruction shifts Temp right arithmetically 16 bits and puts the
`result in C. C now contains the 16 rightmost bits of MDR, with the upper 16 bits
`containing the extended sign. This microinstruction jumps to location 22, which
`writes C back into the destination register specifier in IR, and then jumps to
`fetch the next macroinstruction starting at location 0 (Figure 5.23).
`
`f',
`
`Loe Label
`
`Dest
`
`ALU
`
`MAR
`MDR
`
`ADD
`Pass S2
`
`5 Mem:
`6 Store:
`7 Dloop:
`8
`9 Load:
`IO
`11 LB:
`
`Sl
`
`A
`
`S2
`
`c Misc
`
`immI6
`B
`
`Cond
`
`Load?
`
`Jump Comment
`label
`Load
`
`M emorv instruct.
`Store
`
`Data write Mem?
`Uncond
`Data read Mem?
`Decode2
`
`Dloop
`If etch
`Load
`
`Fetch next instr.
`LoadMDR
`
`Temp
`
`SLL
`
`MDR Constant 24
`
`I2
`
`c
`
`SRA
`
`Temp Constant 24
`
`Temp
`13 LBU:
`c
`14
`Temp
`IS LH:
`c
`I6
`Temp
`17 LHU:
`c
`18
`c
`I9 LW:
`20 Movl2S:
`IAR
`21 MovS21: c
`22 Write I:
`
`MDR Constant 24
`SLL
`Temp Constant 24
`SRL
`MDR Constant
`I6
`SLL
`Temp Constant
`I6
`SRA
`MDR Constant
`I6
`SLL
`Temp Constant
`I6
`SRL
`Pass Sl MDR
`Pass SI A
`IAR
`Pass SI
`
`Rdf--C
`
`Uncond Write I
`
`Uncond Write I
`
`Uncond Write I
`
`Load byte; shift left to
`remove uvver 24 bits
`Shift right arithmetic
`to sifm extend
`LB unsi1med
`Shift rif~ht lof!.ical
`Load half
`Shift rif!.ht arithmetic
`LH Unsif!.ned
`Uncond Write I
`Shift rif!.ht lof!.ical
`Uncond Write I Load word
`If etch Move to special
`Uncond
`Move from spec.
`If etch Write back & go fetch
`next instruction
`
`Uncond
`
`FIGURE 5.25 The section of the DLX microprogram for loads and stores, corresponding to the states in Figure
`5.14 (page 221). The microcode for bytes and halfwords takes an extra microinstruction to align the data (see Figure
`3.10, page 97). Note that microinstruction 5 loads A from Rd, just in case the instruction is a store. The label lfetch is for
`microinstruction 0 in Figure 5.23 on page 230.
`
`Ex.1035.263
`
`DELL
`
`

`

`232
`
`5.7 Putting It All Together: Control for DLX
`
`Opcode
`
`Load byte
`Load byte unsigned
`Load half
`Load half unsigned
`Load word
`ADD
`SUB
`AND
`OR
`XOR
`SLL
`SRL
`SRA
`LHI
`Set equal
`Set not equal
`Set less than
`Set greater than or equal
`Set greater than
`Set less than or equal
`
`Absolute
`address
`
`Label
`
`Figure
`
`11
`13
`15
`17
`19
`25
`26
`27
`28
`29
`30
`31
`32
`33
`35
`37
`39
`41
`43
`45
`
`LB:
`LBU:
`LH:
`LHU:
`LW:
`ADD/I:
`SUB/I:
`AND/I:
`OR/I:
`XOR/I:
`SLL/I:
`SRL/I:
`SRA/I:
`LHI:
`SEQ/I:
`SNEil:
`SLT/I:
`SGE/I:
`SGT/I:
`SLE/I:
`
`5.25
`5.25
`5.25
`5.25
`5.25
`5.27
`5.27
`5.27
`5.27
`5.27
`5.27
`5.27
`5.27
`5.27
`5.28
`5.28
`5.28
`5.28
`5.28
`5.28
`
`FIGURE 5.26 Opcodes and corresponding addresses for decode tables 2 and 3. The
`opcodes are shown symbolically on the left, followed by the absolute microinstruction
`address, the corresponding label, and the figure where the microcode can be found. Since
`the opcodes are shown symbolically, and they go to the same place in both tables, the
`same information can be used for specifying decode tables 2 and 3. This similarity is
`attributable to the immediate version and register version of the DLX instructions sharing
`the same microcode. If a table were implemented with a ROM, it would contain 64 entries
`corresponding to the 6-bit opcode of DLX. Again, the many redundant or unspecified
`entries suggest the use of a PLA to minimize hardware cost.
`
`The ALU instructions are found in Figure 5.27. The first two microinstruc(cid:173)
`tions correspond to the states at the top of Figure 5.15 (page 222). After loading ..
`Temp with either the register or the immediate, each uses a decode table to
`vector to the microinstruction that executes the ALU instruction. To save
`microcode space, the same microinstruction is used whether the operand is a
`register or an immediate. One of the microinstructions between 25 and 33 is
`executed, storing its result in C. It then jumps to microinstruction 34, which
`stores C into the register specified in the IR, and in turn jumps to fetch the next
`macroinstruction.
`
`Ex.1035.264
`
`DELL
`
`

`

`Basic Processor Implementation Techniques
`
`233
`
`Loe Label
`
`Dest ALU
`
`Sl
`
`S2
`
`c Misc
`
`Cond
`
`Jump Comment
`label
`
`23 Reg:
`24 Imm:
`25 ADD/I:
`26 SUB/I:
`27 AND/I:
`28 OR/I:
`29 XOR/I:
`30 SLL/I:
`31 SRL/I:
`32 SRA/I:
`33 LHI:
`34 Write2:
`
`Temp Pass S2
`Temp Pass S2
`c
`ADD
`c
`SUB
`c
`AND
`c
`OR
`c
`XOR
`c
`SLL
`c
`SRL
`c
`SRA
`c
`SLL
`
`B
`Imm
`Temp
`A
`Temp
`A
`Temp
`A
`Temp
`A
`Temp
`A
`Temp
`A
`Temp
`A
`Temp
`A
`Temp Constant 16
`
`.,
`
`Rdf-C
`
`--
`
`source2 = ref!.
`Decode2
`source2 = imm.
`Decode3
`Uncond Write2 ADD
`Uncond Write2
`SUB
`Uncond Write2 AND
`Uncond Write2 OR
`Uncond Write2 XOR
`Uncond Write2 SLL
`Uncond Write2 SRL
`Uncond Write2
`SRA
`Uncond Write2 LHI
`If etch Write back & go
`Uncond
`fetch next instruction
`
`FIGURE 5.27 .Like the first two states in Figure 5.15 (page 222), microinstructions 23 and 24 load Temp with an
`operand and then vector to the appropriate microinstruction, depending on the opcode in IR. One of the nine
`following microinstructions is executed, leaving its result in C. C is written back into the register specified in the register
`destination field of DLX macroinstruction in IR in microinstruction 34.
`
`Loe Label
`
`Dest ALU
`
`Sl
`
`S2
`
`c Misc
`
`35 SEQ/I:
`36
`37 SNE/I:
`38
`39 SLT/I:
`40
`41 SGE/I:
`42
`43 SGT/I:
`44
`45 SLE/I:
`46
`47 Seto:
`48 Setl:
`49 Write4:
`
`c
`
`c
`
`c
`
`c
`
`c
`
`c
`c
`c
`
`A
`
`A
`
`A
`
`A
`
`A
`
`A
`
`SUB
`Pass S2
`SUB
`Pass S2
`SUB
`Pass S2
`SUB
`Pass S2
`RSUB
`Pass S2
`RSUB
`Pass S2
`Pass S2
`Pass S2
`
`Temp
`Constant 0
`Temp
`Constant 1
`Temp
`Constant 0
`Temp
`Constant 1
`Temp
`Constant 0
`Temp
`Constant 1
`Constant 0
`Constant 1
`
`Rdf-C
`
`Cond
`
`Jump' Comment
`label
`Set equal
`Setl
`Zero?
`Uncond Write4 AR (set to false)
`Zero?
`Seto
`Set not equal
`Uncond Write4 AR (set to true)
`Negative? Setl
`Set less than
`Uncond Write4 A~T (set to false)
`Negative? Seto
`Set GT or equal
`Uncond Write4 A~T (set to true)
`Negative? Setl
`Set weater than
`Uncond Write4 T~ (set to false)
`Negative? Seto
`Set LT or equal
`Uncond Write4 T~ (set to true)
`Uncond Write4 Set to 0 =false
`Set to 1 = true
`If etch Write back &fetch
`· next instruction
`
`Uncond
`
`FIGURE 5.28 Corresponding to Figure 5.16 (pages 222-223), this microcode performs the DLX Set instructions.
`As in the previous figure, to save space these same microinstructions execute either the version of set using registers or
`the version using immediates. The tricky microcode is found in microinstructions 43 and 45, where the subtraction Temp -
`A is unlike the earlier microcode. Remember that A-, Temp= Temp-A (see Figure 5.22 on page 229).
`
`Ex.1035.265
`
`DELL
`
`

`

`234
`
`5.7 Putting It All Together: Control for DLX
`
`Loe Label
`
`Dest
`
`c Misc
`
`Figure 5.28 corresponds to the states in Figure 5.16 (pages 222-223), except
`that the top two states that load Temp are microinstructions 23 and 24 of the pre(cid:173)
`vious figure; the decode tables will either jump to locations 25 to 34 in Figure
`5.27, or 35 to 45 in Figure 5.28, depending on the opcode. The microinstructions
`for Set perform relative tests by having the ALU subtract Temp from A and then
`test the ALU output to see if the result is zero or negative. Depending on the test
`result, C is set to 1 or 0 and written back in the register file before going to fetch
`the next macroinstruction. Tests for A = Temp, A '* Temp, A < Temp, and A;;:::
`Temp are straightforward using these conditions on the ALU output A - Temp.
`A > Temp and A ~ Temp, on the other hand, are not simple, but can be done
`using the negative condition with the subtraction reversed:
`(Temp-A< 0) = (Temp< A) = (A> Temp)
`If the result is negative, then A > Temp, otherwise A ~ Temp. Voila!
`Figure 5.29 contains the last of the DLX microcode and corresponds to the
`states found in Figures 5.17 and 5.18 (pages 222-224). Microinstruction 50,
`corresponding to the macroinstruction branch on equal zero, tests if A equals
`zero. If it does, the macroinstruction branch succeeds, and the microinstruction
`jumps to the microinstruction 53. This microinstruction loads the PC with the
`PC-relative address and then jumps to the microcode that fetches the new
`macroinstruction (location 0). If A does not equal zero, the macroinstruction
`branch fails, so that the next sequential microinstruction (51) executes, jumping
`to location 0 without changing the PC.
`A state usually corresponds to a single microinstruction, although in a few
`cases above two microinstructions were needed. The jump and link instructions
`have the reverse case, with two ptates collapsing into one microinstruction. The
`actions in the last two states of] ump and link in Figure 5 .17 are found in micro(cid:173)
`instruction 57, and similarly for the jump and link register with microinstruction
`59. These microinstructions load the PC with the PC-relative branch address and
`save C into R3 l.
`Sl
`S2
`ALU
`
`SUB
`
`A
`
`Constant 0
`
`Constant 0
`imm16
`imm26
`
`50 Beq:
`51
`52 Bne:
`53 Branch:
`54 Jump:
`55 JReg:
`56 JAL:
`57
`58 JALR:
`59
`60 Trap:
`61
`
`A
`SUB
`PC
`ADD
`PC
`ADD
`Pass Sl A
`Pass Sl PC
`ADD
`PC
`Pass Sl PC
`Pass Sl A
`Pass Sl PC
`Pass S2
`
`PC
`PC
`PC
`c
`PC
`c
`PC
`IAR
`PC
`
`imm26
`
`R31f-C
`
`Uncond
`
`If etch
`
`R31f-C
`
`Uncond
`
`If etch
`
`imm26
`
`Uncond
`
`If etch
`
`Cond
`
`O?
`Uncond
`O?
`Uncond
`Uncond
`Uncond
`
`Jump Comment
`label
`Branch
`If etch
`If etch
`If etch
`If etch
`If etch
`
`Instr is branch =0
`:;t(): not taken
`Instr is branch ;t: 0
`:;t(): taken
`Jump
`Jump reRister
`Jump and link
`Jump & save PC
`Jump & link ref!
`Jump & save PC
`Trap
`
`FIGURE 5.29 The microcode for branch and jump DLX instructions, corresponding to the states i11 Figures 5.17
`and 5.18 on pages 222-224.
`
`Ex.1035.266
`
`DELL
`
`

`

`Basic Processor Implementation Techniques
`
`. 235
`
`Performance of Microcoded Control for DLX
`
`Before trying to improve performance or reduce costs of control, the existing
`performance must be assessed. Again, the process is to count the clock cycles
`for each instruction, but this time there is a larger variety in performance.
`All instructions execute microinstructions 0, 1, and 2 in Figure 5.23 (page
`230), giving a base of 3 clocks plus wait states, depending on the repetition of
`microinstruction 1. The clock cycles for the rest of the categories are:
`4 for stores, plus wait states
`5 for load word, plus wait states
`6 for load byte or load half (signed or unsigned), plus wait states
`3 for ALU
`4 for set
`2 for branch equal zero (taken or untaken)
`2 for branch not equal zero (taken)
`1 for branch not equal zero (untaken)
`1 for jumps
`2 for jump and links
`
`Using the instruction mix for GCC in Figure C.4, and assuming an average of 1
`wait state per memory access, the CPI is 7.68. This is higher than the hardwired
`control CPI, because the test for interrupt takes another clock cycle at the begin(cid:173)
`ning, loads and stores are slower, and branch equal zero is slower for the
`untaken case.
`
`Reducing Cost and Improving Performance of DLX
`When Control Is Microcoded
`
`The size of a completely unencoded version of the DLX microinstruction is
`calculated from the number of entries in Figures 5. 7 (page 211) and 5 .22 (page
`229) phis the size of the Constant and Jump address fields. The largest constant
`in the fields is 24, which requires 5 bits, and the largest address is 61, which
`requires 6. Figure 5.30 shows the microinstruction fields, the unencoded widths,
`and the encoded widths. Encoding almost halves the size of control store.
`
`Dest
`
`ALU
`operation
`
`Sourcel Source2 Constant Misc
`
`Cond
`
`Jump
`address
`
`Total
`
`Unencoded
`Encoded
`
`7
`3
`
`11
`4
`
`9
`4
`
`9
`4
`
`5
`5
`
`6
`3
`
`10
`4
`
`6
`6
`
`= 63 bits
`= 33 bits
`
`FIGURE 5.30 Width of field in bits of unencoded and encoded microinstruction formats. Note that the Constant
`and Jump address fields are not encoded in this example, placing fewer restrictions on the microprogram using the
`encoded format.
`
`Ex.1035.267
`
`DELL
`
`

`

`

`

`Basic Processor Implementation Techniques
`
`237
`
`Loe Label
`
`Type. Dest ALU
`
`Sl
`
`S2
`
`Misc Cond
`
`0 If etch: M!f /B
`1 !loop: M!f /B
`
`---
`---
`
`---
`---
`
`2
`3
`
`PC
`
`A/J
`M{f /B
`
`ADD
`---
`
`PC
`
`Constant
`---
`
`4 In trot: A/J
`5
`A/J
`
`IAR
`PC
`
`..
`
`Pass Sl PC
`SUB
`Temp Temp
`
`Interrupt?
`Instr Mem?
`Read
`---
`---
`ABf- Decodel
`RF
`---
`---
`
`---
`---
`
`Const/ Comment
`Jump
`
`Intrpt
`!loop
`
`4
`
`Check interruJJt
`IR f-M[PC]; wait
`for memory
`Increment PC
`
`5
`If etch
`
`Interrupt
`PC~O (t minus t=O)
`& go fetch next
`instruction
`
`FIGURE 5.32 Version of Figure 5.23 (page 230) using the dual-format microinstruction in Figure 5.31. Note that
`ALU/Jump microinstructions check the 81 and 82 fields for a constant specifier to see if the next address is sequential (as
`in microinstruction 2); otherwise they go to the Jump address (as in microinstructions 4 and 5). The microprogrammer
`changed the last microinstruction to generate a zero by subtracting a register from itself rather than through straight(cid:173)
`forward use of constant 0. Using the constant would have required an additional microinstruction since this format goes to
`the next sequential instruction if a constant is used. (See Figure 5.31.)
`
`Sometimes performance can be improved by finding faster sequences of
`microcode, but normally it requires changes to the hardware. The branch equal
`zero instruction takes one extra clock cycle when the branch is not taken with
`hardwired control, but two with microcoded control; while branch not equal zero
`has the same performance for hardwired and microcoded control. Why would
`the former differ in performance? Figure 5.29 shows that microinstruction 52
`branches on zero to fetch the next microinstruction, which is correct for the
`branch on not equal zero macroinstruction. Microinstruction 50 also tests for
`zero for the branch on zero macroinstruction and branches to the
`microinstruction that loads the new PC. The not zero case is handled by the
`following microinstruction (51 ), which jumps to fetch the next instruction(cid:173)
`hence, one clock cycle for untaken branch on not equal zero and two for untaken
`branch on equal zero. One solution is simply to add "not zero" to the microcode
`branch conditions in Figure 5.22 (page 229) and change the branch on equal
`microcode to the version in Figure 5.33. Since there are only ten branch
`conditions, adding the eleventh would not require more than the four bits needed
`for an encoded version of that field.
`
`Loe Label
`
`Dest
`
`ALU
`
`Sl
`
`S2
`
`c Misc
`
`Cond
`
`Jump Comment
`label
`
`50 Beq:
`51
`
`SUB
`ADD
`
`A
`PC
`
`Constant 0
`imm16
`
`PC
`
`notO?
`Uncond
`
`!fetch
`If etch
`
`Branch =0
`=0: taken
`
`FIGURE 5.33 Branch not equal microcode from Figure 5.29 (page 234) rewritten by using a not zero condition in
`microinstruction 44.
`
`Ex.1035.269
`
`DELL
`
`

`

`238
`
`5.7 Putting It All Together: Control for DLX
`
`This change drops the CPI from 7 .68 to 7 .63 for microcoded control, yet this
`is still higher than the CPI for hardwired control.
`
`Example
`
`Let's improve microcoded control so that the CPI for GCC is closer to the
`original CPI under hardwired control.
`
`Answer
`
`The main performance culprit is the separate test for interrupts in Figure 5.23.
`By modifying the hardware, decodel can kill two birds with one stone: In
`addition to jumping to the appropriate microinstructions corresponding to the
`opcode, it also jumps to the interrupt microcode if an interrupt is pending. Figure
`5.34 shows the revised microcode. This modification saves one clock cycle from
`each instruction, reducing the CPI to 6.63.
`
`Loe Label
`
`Dest
`
`ALU
`
`Sl
`
`S2
`
`c Misc
`
`Cond
`
`!fetch:
`
`Instr Read Mem?
`
`PC
`
`ADD
`
`PC
`
`Constant 4 AB<::-RF
`
`Decode I
`
`Intrpt:
`
`IAR
`
`SUB
`
`PC
`
`Constant 4
`
`0
`
`1
`
`2
`
`3
`
`Jump Comment
`label
`If etch
`
`IR <::-M[PC]; wait
`for memory
`Also go to interrupt
`if vendinR interrupt
`Interrupt: undo PC
`increment
`PC<::-0 & go fetch
`next instruction
`
`PC
`
`Pass S2
`
`Constant 0
`
`Uncond
`
`If etch
`
`FIGURE 5.34 Revised microcode that takes advantage of a change of the hardwar-e1o have decode1 go to
`microinstruction 2 if there is a pending interrupt. This microinstruction must reverse the increment of PC in the prior
`microinstruction so that the correct value is saved.
`
`5.8 I Fallacies and Pitfalls
`
`,
`Pitfall: Microcode implementing a complex instruction may not be faster than
`macrocode.
`
`At one time, microcode had the advantage of being fetched from a much faster
`memory than macrocode. Since caches came into use in 1968, microcode no
`longer has such a consistent edge in fetch time. Microcode does, however, still
`have the advantage of using internal temporary registers in the computation,
`which can be helpful on machines with few general-purpose registers. The
`disadvantage of micn:~code is that the algorithms must be selected before the
`machine is announced and can't be changed until the next model of the archi-
`
`Ex.1035.270
`
`DELL
`
`

`

`Basic Processor Implementation Techniques
`
`239
`
`tecture; macrocode, · on the other hand, can utilize improvements in its
`algorithms at any time during the life of the machine.
`The VAX Index instruction provides an example: The instruction checks to
`see if the index is between two bounds, one of which is usually zero. The VAX-
`11/780 microcode uses two compares and two branches to do this, while
`macrocode can perform the same check in one compare and one branch. The
`macrocode checks the· index against the upper limit using unsigned
`comparisons, rather than two's complement comparisons. This treats a negative
`index (less than zero and so failing the comparison) as if it were a very large
`number, thus exceeding the upper limit. (The algorithm can be used with
`nonzero lower bounds by first subtracting the lower bound from the index.)
`Replacing the index instruction by this VAX macrocode always improves
`performance on the V AX-11/780.
`
`J
`
`Fallacy: If there is space in control store, new instructions are free of cost.
`
`Since the length of control store is usually a power of two, at times there may be
`unused control store available to expand the instruction set. The analogy here is
`that of building a house and discovering, near completion, that you have enough
`land and materials left to add a room. This room wouldn't be free, however,
`since there would be the costs of labor and maintenance for the life of the home.
`The temptation to add "free" instructions can only occur when the instruction set
`is not fixed, as is likely to be the case in the first model of a computer. Because
`instruction set compatibility is a lo

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket