`
`Chapter 4 Arithmetic for Computers
`
`4.5 Constructing an Arithmetic Logic Unit
`
`237
`
`11
`
`Binvert
`
`Operation
`
`Carryln
`
`a-+-----+--.--i
`
`Result
`
`Carryout
`
`FIGURE 4.16 A 1-bit ALU that performs AND, OR, and addition on a and b or a and b. By
`selecting b (Binvert = 1) and setting Carry In to 1 in the _least significant bit of the ALU, we get
`two's complement subtraction of b from a instead of add1t1on of b to a.
`
`The adder will then calculate a + b + l. By selecting the inverted version of b,
`we get exactly what we want:
`
`a+b + I= a+(b+ I)= a+(- b) = a-b
`
`The simplicity of the hardware design of a two's complement adder helps
`explain why two's complement representation has become the universal stan(cid:173)
`dard for integer computer arithmetic.
`
`Tailoring the 32-Bit ALU to MIPS
`This set of operations-add, subtract, AND, OR-is found in the ALU of
`almost every computer. If we look at Figure 4.7 on page 228, we see that the
`operations of most MIPS instructions can be performed by this ALU. But the
`design of the ALU is incomplete.
`.
`.
`One instruction that still needs support is the set on less than mstruct10n
`(s l t ). Recall that the operation produces 1 if rs < rt, and 0 otherwise. Conse(cid:173)
`quently, s l t will set all but the least significant bit to 0, with the least signifi(cid:173)
`cant bit set according to the comparison. For the ALU to perform s l t , we first
`need to expand the three-input multiplexor in Figure 4.16 to add an input for
`the s l t result. We call that new input Less, and use it only for s l t .
`
`The top drawing of Figure 4.17 shows the new I-bit ALU with the expanded
`multiplexor. From the description of s l t above, we must connect Oto the Less
`input for the upper 31 bits of the ALU, since those bits are always set to 0. What
`remains to consider is how to compare and set the least significant bit for set on
`less than instructions.
`What happens if we subtract b from a? If the difference is negative, then a <
`b since
`
`(a - b) < 0 (cid:157)
`
`((a - b) + b) < (0 + b)
`a<b
`
`We want the least significant bit of a set on less than operation to be a 1 if a <
`b; that is, a 1 if a - b is negative and a 0 if it's positive. This desired result cor(cid:173)
`responds exactly to the sign-bit values: 1 means negative and 0 means posi(cid:173)
`tive. Following this line of argument, we need only connect the sign bit from
`the adder output to the least significant bit to get set on less than .
`Unfortunately, the Result output from the most significant ALU bit in the
`top of Figure 4.17 for the s l t operation is not the output of the adder; the ALU
`output for the s l t operation is obviously the input value Less.
`Thus, we need a new I-bit ALU for the most significant bit that has an extra
`output bit: the adder output. The bottom drawing of Figure 4.17 shows the de(cid:173)
`sign, with this new adder output line called Set, and used only for s l t. As long
`as we need a special ALU for the most significant bit, we added the overflow
`detection logic since it is also associated with that bit.
`Alas, the test of less than is a little more complicated than just described be(cid:173)
`cause of overflow; Exercise 4.23 on page 326 explores what must be done.
`Figure 4.18 shows the 32-bit ALU.
`Notice that every time we want the ALU to subtract, we set both Carryin
`and Binvert to l. For adds or logical operations, we want both control lines to
`be 0. We can therefore simplify control of the ALU by combining the Carryln
`and Binvert to a single control line called Bnegate.
`To further tailor the ALU to the MIPS instruction set, we must support con(cid:173)
`ditional branch instructions. These instructions branch either if two registers
`are equa l or if they are unequal. The easiest way to test equality with the ALU
`is to subtract b from a and then test to see if the result is 0 since
`(a - b = 0) (cid:157)
`a = b
`Thus, if we add hardware to test if the result is 0, we can test for equality.
`The simplest way is to OR all the outputs together and then send that signal
`through an inverter:
`Zero = (Result31 + Result30 + ... + Result2 + Resultl + Result0)
`Figure 4.19 shows the revised 32-bit ALU. We can think of the combination
`of the I-bit Bnegate line and the 2-bit Operation lines as 3-bit control lines for
`
`INTEL - 1012
`
`(cid:157)
`
`
`238
`
`Chapter 4 Arithmetic for Computers
`
`4.5 Constructing an Arithmetic Logic Unit
`
`239
`
`Binvert
`
`Operation
`Carryln
`
`a - - i - - - - - - - t - - - - -...
`
`Result
`
`Less -+------------+---+t3
`
`Carryout
`
`Binvert
`
`Operation
`Carryln
`
`a -+-----+----11>1
`
`0
`
`1
`
`2
`
`Result
`
`1
`
`Less -+------+-+-t-----t---t--i3
`
`------r-- Set
`
`Overflow
`detection
`
`r----"T""+ Overflow
`
`(Top) A 1-blt ALU that performs AND, OR, and addition on a and b orb, and
`FIGURE 4.17
`(bottom) a 1-blt ALU for the most significant bit. The top drawing includes a direct input that
`is connected to perform the set on less than operation (see Figure 4.18); the bottom has a direct out(cid:173)
`put from the adder for the less than comparison called Set. (Refer to Exercise 4.42 to see how to cal(cid:173)
`culate overflow with fewer inputs.)
`
`Binvert
`
`Carryln
`
`Operation
`
`aO
`bO
`
`al
`bl
`0
`
`a2
`b2
`0
`
`r-------r--- ResultO
`
`r-----"T""-. Resu ltl
`
`Carryln
`ALU2
`Less
`Carryout
`
`r-----"T""-. Result2
`
`Carryln
`
`a31
`b31
`0
`
`r - - - - - - - . Result31
`
`r------1--. Overflow
`
`FIGURE 4.18 A 32-bit ALU constructed from the 31 copies of the 1-bit ALU in the top of
`Figure 4.17 and one 1-bit ALU in the bottom of that figure. The Less inputs are connected to
`0 except for the least significant bit, and that is connected to the Set output of the most significant
`bit. If the ALU performs a - band w e select the input 3 in the multiplexor in Figure 4.17, then
`Result = 0 . .. 001 if a < b, and Result = 0 ... 000 otherwise.
`
`the ALU, telling it to perform add, subtract, AND, OR, or set on less than.
`Figure 4.20 shows the ALU control lines and the corresponding ALU opera(cid:173)
`tion.
`Finally, now that we have seen what is inside a 32-bit ALU, we will use the
`universal symbol for a complete ALU, as shown in Figure 4.21.
`
`INTEL - 1012
`
`
`
`240
`
`Chapter 4 Arithmetic for Computers
`
`4.5 Constructing an Arithmetic Logic Unit
`
`241
`
`Bnegate
`
`Operation
`
`ao
`bO
`
`a1
`b1
`0
`
`a2
`b2
`0
`
`a31
`b31
`0
`
`ResultO
`
`Less
`Carryout
`
`Result1 Dv--z,rn
`
`Result2
`
`Less
`Carryout
`
`Result31
`
`1 - - - - - - - , Set
`1-------+- - - - - - - - - - - - - Overflow
`
`Less
`
`FIGURE 4.19 The final 32-bit ALU. This adds a Zero detector to Figure 4.18.
`
`ALU control lines
`
`Function
`
`000
`001
`010
`110
`111
`
`and
`or
`add
`subtract
`set on less than
`
`FIGURE 4.20 The values of the three ALU control lines Bnegate and Operation and the
`corresponding ALU operations.
`
`ALU operation
`
`a
`
`b
`
`Carryout
`
`Zero
`Result
`Overflow
`
`FIGURE 4.21 The symbol commonly used to represent an ALU, as shown in Figure 4.19.
`This symbol is also used to represent an adder, so it is normally labeled either with ALU or
`Adder.
`
`Carry Lookahead
`
`The next question is, How quickly can this ALU add two 32-bit operands? We
`can determine the a and b inputs, but the Carry In input depends on the oper(cid:173)
`ation in the adjacent 1-bit adder. If we trace all the way through the chain of
`dependencies, we connect the most significant bit to the least significant bit,
`so the most significant bit of the sum must wait for the sequential evaluation of
`all 32 1-bit adders. This sequential chain reaction is too slow to be used in
`time-critical hardware.
`There are a variety of schemes to anticipate the carry so that the worst-case
`scenario is a function of the log2 of the number of bits in the adder. These an(cid:173)
`ticipatory signals are faster because they go through fewer gates in sequence,
`but it takes many more gates to anticipate the proper carry.
`A key to understanding fast carry schemes is to remember that, unlike soft(cid:173)
`ware, hardware executes in parallel whenever inputs change.
`
`Fast Carry Using "Infinite" Hardware
`Appendix B mentions that any equation can be represented in two levels of
`logic. Since the only external inputs are the two operands and the Carryln to
`the least significant bit of the adder, in theory we could calculate the Carryln
`values to all the remaining bits of the adder in just two levels of logic.
`For example, the Carry In for bit 2 of the adder is exactly the CarryOut of bit
`1, so the formula is
`Carryln2 = (b I · Carry In I)+ (a I · Carrylnl) + (a I· b I)
`Similarly, Carrylnl is defined as
`Carrylnl = (b0 · Carryln0) + (a0 · Carryln0) + (a0 · 60)
`
`INTEL - 1012
`
`
`
`242
`
`Chapter 4 Arithmetic for Computers
`
`Using the shorter and more traditional abbreviation of ci for Carrylni we can
`rewrite the formulas as
`'
`
`c2 = ( b 1 · c 1 ) + ( a 1 · c 1) + ( a 1 · b 1 )
`cl= (b0-c0)+(a0-c0)+(aO-bO)
`
`Substituting the definition of cl for the first equation results in this formula:
`c2 = (al ·a0-b0)+(al-a0-c0)+(al -b0-cO)
`+(bl -a0-b0)+(bl ·a0-cO)+(bl -b0-cO)+(al -bl)
`
`You ca~ imagine how the equation expands as we get to higher bits in the
`adder; rt _grows exponentially with the number of bits. This complexity is
`refle~t~~ m the cost of the hardware for fast carry, making this simple scheme
`proh1b1hvely expensive for wide adders.
`
`Fast Carry Using the First Level of Abstraction: Propagate and
`Generate
`
`Most fast carr~ sch~mes li~it the complexity of the equations to simplify the
`hardware, while still makmg substantial speed improvements over ripple
`carry. One such scheme is a carry-lookahead adder. In Chapter 1, we said com(cid:173)
`puter systems cope _with complexity by using levels of abstraction. A carry(cid:173)
`lookahead adder relies on levels of abstraction in its implementation.
`Let's factor our original equation as a first step:
`
`ci+l = (bi· ci) + (ai • ci) + (ai. bi)
`= (ai ·bi)+ (ai +bi)· ci
`
`If we were to rewrite the equation for c2 using this formula, we would see
`some repeated patterns:
`
`c2= (al -bl)+(al +bl)·((a0-b0)+(aO+bO)-cO)
`
`Note the repeated appearance of (ai • bi) and (ai + bi) in the formula above.
`T~ese two important factors are traditionally called generate (gi) and propagate
`(pz):
`
`I
`I
`I I
`
`Using them to define ci+ 1, we get
`
`gi = ai · bi
`pi = ai + bi
`
`ci+ 1 = gi + pi • ci
`To see where the signals get their names, suppose gi is 1. Then
`
`ci+l = gi +pi· ci = I+ pi• ci = 1
`
`4.5 Constructing an Arithmetic Logic Unit
`
`243
`
`That is, the adder generates a CarryOut (ci+ 1) independent of the value of Car(cid:173)
`ry In (ci). Now suppose that gi is O and pi is 1. Then
`
`ci+l = gi +pi· ci = 0 + I · ci = ci
`
`That is, the adder propagates Carryln to a CarryOut. Putting the two together,
`Carrylni+l is a 1 if either gi is 1 or both pi is 1 and Carrylni is 1.
`As an analogy, imagine a row of dominoes set on edge. The end domino can
`be tipped over by pushing one far away provided there are no gaps between
`the two. Similarly, a carry out can be made true by a generate far away provid(cid:173)
`ed all the propagates between them are true.
`Relying on the definitions of propagate and generate as our first level of ab(cid:173)
`straction, we can express the Carryln signals more economically. Let's show it
`for 4 bits:
`
`cl = go+ (pO • cO)
`
`c2 = gl + (pl · gO) + (pl · pO · cO)
`
`c3 = g2 + (p2 · gl) + (p2 · pl · gO) + (p2 · pl · pO · c0)
`
`c4 = g3 + (p3 · g2) + (p3 · p2 · gl) + (p3 · p2 · pl · g0)
`+ (p3 · p2 · pl · pO · cO)
`
`These equations just represent common sense: Carrylni is a 1 if some earlier
`adder generates a carry and all intermediary adders propagate a carry. Figure
`4.22 uses plumbing to try to explain carry lookahead.
`Even this simplified form leads to large equations and, hence, considerable
`logic even for a 16-bit adder. Let's try moving to two levels of abstraction.
`
`Fast Carry Using the Second Level of Abstraction
`First we consider this 4-bit adder with its carry-lookahead logic as a single
`building block. If we connect them in ripple carry fashion to form a 16-bit
`adder, the add will be faster than the original with a little more hardware.
`To go faster, we'll need carry lookahead at a higher level. To perform carry
`lookahead for 4-bit adders, we need propagate and generate signals at this
`higher level. Here they are for the four 4-bit adder blocks:
`
`PO = p3 · p2 · p 1 · pO
`Pl = p7 · p6 · pS · p4
`P2 = p 11 · p 10 · p9 · p8
`P3 = p15 · p14 · p13 · p12
`
`That is, the "super" propagate signal for the 4-bit abstraction (Pi) is true only
`if each of the bits in the group will propagate a carry.
`
`INTEL - 1012
`
`
`
`244
`
`Chapter 4 Arithmetic for Computers
`
`4.5 Constructing an Arithmetic Logic Unit
`
`245
`
`CO = g3 + (p3 · g2) + (p3 · p2 · gl) + (p3 · p2 · p1 · gO)
`GI = g7 + (p7 · g6) + (p7 · p6 · g5) + (p7 · p6 · p5 · g4)
`C2 = gll + (p11 · glO) + (pll · p10 · g9) + (pll · p10 · p9 · g8)
`C3 = g15 + (p15 · g14) + (p15 · p14 · g13) + (p15 · p14 · p13 · g12)
`Figure 4.23 updates our plumbing analogy to show PO and GO.
`
`c4
`
`FIGURE 4.22 A plumbing analogy for carry lookahead for 1 bit, 2 bits, and 4 bits using
`water, pipes, and valves. The wrenches are turned to open and close valves. Water is shown in
`color. The output of the pipe (ci+l) will be full if either the nearest generate value (gi) is turned on
`or if the i propagate value (pi) is on and there is water further upstream, either from an earlier
`generate, or propagate with water behind it. Carryln (cO) can result in a carry out without the
`help of any generates, but with the help of nil propagates.
`
`For the "super" generate signal (Gi), we care only if there is a carry out of
`the most significant bit of the 4-bit group. This obviously occurs if generate is
`true for that most significant bit; it also occurs if an earlier generate is true and
`all the intermediate propagates, including that of the most significant bit, are
`also true:
`
`GO
`
`FIGURE 4.23 A plumbing analogy for the next-level carry-lookahead signals PO and GO.
`PO i, open only if all four propagates (pi) are open, while water flows in GO only if at least one
`generate (gi) is open and all the propagates downstream from that generate arc open.
`
`INTEL - 1012
`
`
`
`246
`
`Chapter 4 Arithmetic for Computers
`
`4.5 Constructing an Arithmetic Logic Unit
`
`247
`
`Carryln
`I
`!
`
`Carryln
`
`ALUO
`PO
`GO
`
`!
`Carryln
`
`ALUl
`Pl
`Gl
`
`l
`
`Carryln
`
`ALU2
`P2
`G2
`
`l
`
`Carryln
`
`ALU3
`P3
`G3
`
`ao (cid:173)
`bo (cid:173)
`al (cid:173)
`bl (cid:173)
`a2 -
`b2 -
`a3 -
`b3 -
`
`a4 -
`b4 -
`a5 -
`b5 -
`a6 -
`b6 -
`a7 -
`b7 -
`
`a8 -
`b8 -
`a9 -
`b9 -
`a1O (cid:173)
`b10 -
`a11 -
`bll -
`
`a12 -
`b12 -
`a13 -
`b13 -
`a14 -
`b14 -
`a15 -
`b15 -
`
`pi
`gi
`
`Cl
`
`ci + 1
`
`pi+ 1
`gi+ 1
`
`C2
`
`ci+ 2
`
`pi+ 2
`gi+ 2
`
`C3
`
`Ci+ 3
`
`pi+ 3
`gi + 3
`
`ci + 4
`
`C4
`
`ResultO- 3
`
`C arry-lookahead unit
`
`Result4-7
`
`Result8- 11
`
`Result12- 15
`
`Carryout
`
`FIGURE 4.24 Four 4-blt ALUs using carry lookahead to form a 16-blt adder. Note that the
`carries come from the carry-lookahead unit, not from the 4-bit AL Us.
`
`Then the equations at this higher level of abstraction for the carry in for each
`4-bit group of the 16-bit adder (Cl, C2, C3, C4 in Figure 4.24) are very similar
`to the carry out equations for each bit of the 4-bit adder (cl, c2, c3, c4) on page
`243:
`
`Cl = GO+ (PO · c0 )
`
`C2 = Gl + (Pl · GO ) + (Pl · PO· c0 )
`C3 = G2 + (P2 · Gl) + (P2 · Pl · GO)+ (P2 · Pl · PO· c0)
`
`C4 = G3 + (P3 · G2) + (P3 · P2 · Gl) + (P3 · P2 ·Pl · GO)
`+ (P3 · P2 · Pl · PO · c0 )
`
`Figure 4.24 shows 4-bit adders connected with such a carry lookahead unit.
`Exercises 4.44 through 4.48 explore the speed differences between these carry
`schemes, different notations for multibit propagate and generate signals, and
`the design of a 64-bit adder.
`
`Example
`
`Answer
`
`Both Levels of the Propagate and Generate
`
`Determine the gi, pi, Pi, and Gi values of these two 16-bit numbers:
`a :
`0001 1010 0011 OOlltwo
`b:
`1110 0101 1110 lOlltwo
`Also, what is CarryOutlS (C4)?
`
`Aligning the bits makes it easy to see the values of generate gi (ai · bi) and
`propagate pi (ai +bi) :
`a :
`0001 1010 0011 0011
`b:
`1110 0101 1110 1011
`0000 0000 0010 0011
`gi :
`1111 1111 1111 1011
`pi :
`where the bits are numbered 15 to 0 from left to right. Next, the "super"
`propagates (P3, P2, Pl , PO) are simply the AND of the lower-level propa(cid:173)
`gates:
`P3 = 1 . I. I. I =
`I =
`P2 = 1 . I
`I
`I =
`Pl = I
`I
`I = 0
`PO = l · 0 · I
`
`I
`
`INTEL - 1012
`
`
`
`248
`
`Chapter 4 Arithmetic for Computers
`
`4.5 Constructing an Arithmetic Logic Unit
`
`249
`
`The "super" generates are more complex, so use the following equations:
`GO = g3 + (p3 · g2) + (p3 · p2 · gl) + (p3 · p2 · pl · g0)
`= 0 + (1 · 0) + (1 · 0 · 1) + (1 · 0 · 1 · 1) = 0 + 0 + 0 + 0 = 0
`GI = g7 + (p7 · g6) + (p7 · p6 · g5) + (p7 · p6 · p5 · g4)
`= 0 + (1 · 0) + (1 · 1 · 1) + (1 · 1 · 1 · 0) = 0 + 0 + 1 + 0 = 1
`G2 = gll + (pll · glO) + (pll · plO · g9) + (pll · plO · p9 · g8)
`0 + (1 · 0) + (1 · 1 · 0) + (1 · 1 · 1 · 0) = 0 + 0 + 0 + 0 = 0
`G3 = g15 + (p15 · g14) + (p15 · p14 · g13) + (p15 · p14 · p13· g12)
`= 0 + (1 · 0) + (1 · 1 · 0) + (1 · 1 · 1 · 0) = 0 + 0 + 0 + 0 = 0
`Finally, CarryOut15 is
`C4 = G3 + (P3 · G2) + (P3 · P2 · Gl) + (P3 · P2 · Pl · GO)
`+ (P3 · P2 · Pl ·PO· c0)
`= 0 + (1 · 0) + (1 · 1 · 1) + (1 · 1 · 1 · 0) + (1 · 1 · 1 · 0 · 0)
`= 0+0+1+0+0=1
`Hence there is a carry out when adding these two 16-bit numbers.
`
`The reason carry lookahead can make carries faster is that all logic begins
`evaluating the moment the clock cycle begins, and the result will not change
`once the output of each gate stops changing. By taking a shortcut of going
`through fewer gates to send the carry in signal, the output of the gates will
`stop changing sooner, and hence the time for the adder can be less.
`To appreciate the importance of carry lookahead, we need to calculate the
`relative performance between it and ripple carry adders.
`
`Example
`
`Speed of Ripple Carry versus Carry Lookahead
`
`One simple way to model time for logic is to assume each AND or OR gate
`takes the same time for a signal to pass through it. Time is estimated by
`simply counting the number of gates along the longest path through a
`piece of logic. Compare the number of gate delays for the critical paths of
`two 16-bit adders, one using ripple carry and one using two-level carry
`lookahead.
`
`Figure 4.13 on page 233 shows that the carry out signal takes two gate de(cid:173)
`lays per bit. Then the number of gate delays between a carry in to the least
`significant bit and the carry out of the most significant is 16 x 2 = 32.
`
`For carry lookahead, the carry out of the most significant bit is just C4,
`defined in the example. It takes two levels of logic to specify C4 in terms
`of Pi and Gi (the OR of several AND terms) . Pi is specified in one level of
`logic (AND) using pi, and Gi is specified in two levels using pi and gi, so
`the worst case for this next level of abstraction is two levels of logic. pi and
`gi are each one level of logic, defined in terms of ai and bi. If we assume
`one gate delay for each level of logic in these equations, the worst case is
`2 + 2 + I = 5 gate delays.
`Hence for 16-bit addition a carry-lookahead adder is six times faster,
`using this simple estimate of hardware speed.
`
`Summary
`
`The primary point of this section is that the traditional ALU can be con(cid:173)
`structed from a multiplexor and a few gates that are replicated 32 times. To
`make it more useful to the MIPS architecture, we expand the traditional ALU
`with hardware to test if the result is 0, detect overflow, and perform the basic
`operation for set on less than.
`Carry lookahead offers a faster path than waiting for the carries to ripple
`through all 32 1-bit adders. This faster path is paved by two signals, generate
`and propagate. The former creates a carry regardless of the carry input, and the
`other passes a carry along. Carry lookahead also gives another example of how
`abstraction is important in computer design to cope with complexity.
`
`Elaboration: We have now accounted for all but one of the arithmetic and logical
`operations for the core MIPS instruction set: the ALU in Figure 4 .21 omits support of
`shift instructions. It would be possible to widen the ALU multiplexor to include a left
`shift by 1 bit or right shift by 1 bit. But hardware designers have created a circuit called
`a barrel shifter, which can shift from 1 to 31 bits in no more time than it takes to add
`two 32-bit numbers, so shifting is normally done outside the ALU.
`
`Elaboration: The logic equation for the Sum output of the full adder on page 234 can
`be expressed more simply by using a more powerful gate than AND and OR. An exclu(cid:173)
`sive OR gate is true if the two operands disagree; that is,
`x ic- y (cid:157)
`1 and x == y (cid:157) 0
`In some technologies , exclusive OR is more efficient than two levels of AND and OR
`gates. Using the symbol EB to represent exclusive OR, here is the new equation:
`
`Sum= a EB b EB Carryln
`
`Also, we have drawn the ALU the traditional way, using gates. Computers are
`designed today in CMOS transistors , which are basically switches. CMOS ALU and bar(cid:173)
`rel shifters take advantage of these switches and have many fewer multiplexors than
`shown in our designs, but the design principles are similar.
`
`INTEL - 1012
`
`
`
`250
`
`Chapter 4 Arithmetic for Computers
`
`II Multiplication
`
`Multiplication is vexation,
`Division is as bad;
`The rule of three doth puzzle me,
`And practice drives me mad.
`
`Anonymous, Elizabethan manuscript, 1570
`
`With the construction of the ALU and explanation of addition, subtraction,
`and shifts, we are ready to build the more vexing operation of multiply.
`But first let's review the multiplication of decimal numbers in longhand to
`remind ourselves of the steps and the names of the operands. For reasons that
`will become clear shortly, we limit this decimal example to using only the dig(cid:173)
`its O and 1. Multiplying lOOOten by lOOlten:
`
`X
`
`Multiplicand
`Multiplier
`
`l000ten
`l00lten
`1000
`0000
`0000
`1000
`Product
`1001 000ten
`The first operand is called the multiplicand and the second the multiplier. The
`final result is called the product. As you may recall, the algorithm learned in
`grammar school is to take the digits of the multiplier one at a time from right
`to left, multiplying the multiplicand by the single digit of the multiplier and
`shifting the intermediate product one digit to the left of the earlier inter(cid:173)
`mediate products.
`The first observation is that the number of digits in the product is consider(cid:173)
`ably larger than the number in either the multiplicand or the multiplier. In fact,
`if we ignore the sign bits, the length of the multiplication of an n-bit multipli(cid:173)
`cand and an m-bit multiplier is a product that is n + m bits long. That is, n + m
`bits are required to represent all possible products. Hence, like add, multiply
`must cope with overflow because we frequently want a 32-bit product as the
`result of multiplying two 32-bit numbers.
`In this example we restricted the decimal digits to O and 1. With only two
`choices, each step of the multiplication is simple:
`
`1. Just place a copy of the multiplicand (1 x multiplicand) in the proper
`place if the multiplier digit is a 1, or
`
`2. Place O (0 x multiplicand) in the proper place if the digit is 0.
`
`4.6 Multiplication
`
`251
`
`Although the decimal example above happened to use only O and 1, multipli(cid:173)
`cation of binary numbers must always use O and 1, and thus always offers
`only these two choices.
`Now that we have reviewed the basics of multiplication, the traditional next
`step is to provide the highly optimized multiply hardware. We break with tra(cid:173)
`dition in the belief that you will gain a better understanding by seeing the evo(cid:173)
`lution of the multiply hardware and algorithm through three generations. The
`rest of this section presents successive refinements of the hardware and the
`algorithm until we have a version used in some computers. For now, let's
`assume that we are multiplying only positive numbers.
`
`First Version of the Multiplication Algorithm and Hardware
`The initial design mimics the algorithm we learned in grammar school; the
`hardware is shown in Figure 4.25. We have drawn the hardware so that data
`flows from top to bottom to more closely resemble the paper-and-pencil
`method.
`Let's assume that the multiplier is in the 32-bit Multiplier register and that
`the 64-bit Product register is initialized to 0. From the paper-and-pencil exam(cid:173)
`ple above, it's clear that we will need to move the multiplicand left one digit
`each step as it may be added to the intermediate products. Over 32 steps a
`
`-Multiplicand
`
`Shift left
`
`-Multiplier
`
`Shift right
`
`32 bits
`
`Write,__ .....
`
`Control test
`
`64-bit ALU
`
`Product
`
`64 bits
`
`FIGURE 4.25 First version of the multiplication hardware. The Multiplicand register, ALU,
`and Product register are all 64 bits wide, with only the Multiplier register containing 32 bits. The
`32-bit multiplicand starts in the right half of the Multiplicand register, and is shifted left 1 bit on
`each step. The multiplier is shifted in the opposite direction at each step. The algorithm starts
`with the product initialized to 0. Control decides when to shift the Multiplicand and Multiplier
`registers and when to write new values into the Product register.
`
`INTEL - 1012
`
`
`
`252
`
`Chapter 4 Arithmetic for Computers
`
`4.6 Multiplication
`
`253
`
`32-bit multiplicand would move 32 bits to the left. Hence we need a 64-bit Mul(cid:173)
`tiplicand register, initialized with the 32-bit multiplicand in the right half and
`0 in the left half. This register is then shifted left 1 bit each step to align the mul(cid:173)
`tiplicand with the sum being accumulated in the 64-bit Pro~uct register. .
`.
`Figure 4.26 shows the three basic steps needed for each bit. The l~ast s1gm~(cid:173)
`icant bit of the multiplier (Multiplier0) determines whether the multiplicand 1s
`
`added to the Product register. The left shift in step 2 has the effect of moving
`the intermediate operands to the left, just as when multiplying by hand. The
`shift right in step 3 gives us the next bit of the multiplier to examine in the fol(cid:173)
`lowing iteration. These three steps are repeated 32 times to obtain the
`product.
`
`Start
`
`First Multiply Algorithm
`
`Using 4-bit numbers to save space, multiply 2ten x 3tew or 0010two
`X 0011two·
`
`Figure 4.27 shows the value of each register for each of the steps labeled
`according to Figure 4.26, with the final value of 0000 0110two or 6ten· Color
`is used to indicate the register values that change on that step, and the bit
`circled is the one examined to determine the operation of the next step.
`
`~•~1~M~- ~a~.1~.•~~~~~~~~~~~~~~~~~~~~~~~•~x~M~1~™~·~1m~•~~~~~~~-MMM
`1----o __ t--ln_it_ia_l v_a_lu_e_s ______ --+ __ 0_0_1_(JJ _+--_0_0_0_0_0_0_1_0~
`0000 0~
`la: 1 (cid:157)
`Prod = Prod + Mcand
`0000 0010
`1
`0011
`0000 0010
`j
`f-2- :_S_h-ift- le_ft_M_u-lt-ip-lic_a_n_d _ ____ 0_0_1_1_-+-~00~0~0- 0~1~0~0--+-' 0000 0010
`
`1j
`
`
`
`Step
`
`0000 0010
`00000110~
`
`J
`I
`
`MultiplierO = 1
`
`MultiplierO = 0
`
`la. Add multiplicand to product and
`place the result in Product register
`
`2 . Shift the Multiplicand register left 1 bit
`
`3 . Shift the Multiplier register right 1 bit
`
`No: < 32 repetitions
`
`32 repetitions
`
`(_oo_ne )
`
`FIGURE 4.26 The first multiplication algorithm, using the hardware shown in Figure 4.25.
`If the least significant bit of the multiplier is 1, add the multiplicand to the product. If not, go to
`the next step. Shift the multiplicand left and the multiplier right in the next two steps. These three
`steps are repeated 32 times.
`
`OOLQJ
`3: Shift right Multiplier
`0000 0100
`~ - - - t - - - - - - - - - - - - - - - t - - - - - -+ - - - - - -~
`la: 1 (cid:157) Prod = Prod + Mcand
`2
`0001
`0000 0100
`2: Shift left Multiplicand
`0001
`_ 0000 1000
`/
`0000 0110
`0000 ~~~~ ~-
`0000 0110
`0000 ~
`0000 0110
`0001 0000
`0000 0110
`0000 0110
`0000 0110
`0000 0110
`0000 0110
`
`-
`
`3
`
`4
`
`3: Shift right Multiplier
`1: 0 (cid:157)
`no operation
`2: Shift left Multiplicand
`3: Shift right Multiplier
`1: O (cid:157)
`no operation
`2: Shift left Multiplicand
`3: Shift right Multiplier
`
`00~
`0000
`0000
`OOl\;)
`
`0000
`0000
`0000
`
`-~
`
`I
`I
`
`0001 0000
`
`00010~~
`001000~ I
`1
`0010 0000
`- - - I
`
`FIGURE 4.27 Multiply example using first algorithm in Figure 4.26. The bit exilmined to
`determine the next step is circled in color.
`
`If each step took a clock cycle, this algorithm would require almost 100 clock
`cycles to multiply two 32-bit numbers. The relative importance of arithmetic
`operations like multiply varies with the program, but addition and subtraction
`may be anywhere from 5 to 100 times more popular than multiply. According(cid:173)
`ly, in many applications, multiply can take multiple clock cycles without sig(cid:173)
`nificantly affecting performance. Yet Amdahl's law (see Chapter 2, page 75)
`reminds us that even a moderate frequency for a slow operation can limit
`performance.
`
`INTEL - 1012
`
`
`
`254
`
`Chapter 4 Arithmetic for Computers
`
`4.6 Multiplication
`
`255
`
`Multiplicand
`
`Shift right ,_ _ _
`Write ..---...
`
`Product
`
`64 bits
`
`-Multiplier
`
`Shift right
`
`32 bits
`
`FIGURE 4.28 Second version of the multiplication hardware. Compare with the first ver(cid:173)
`sion in Figure 4.25. The Multiplicand register, ALU, and Multiplier register are all 32 bits wide,
`with only the Product register left at 64 bits. Now the product is shifted right. These changes are
`highlighted in color.
`
`Second Version of the Multiplication Algorithm
`and Hardware
`Computer pioneers recognized that half of the bits of the multiplicand in the
`first algorithm were always 0, so only half could contain useful bit values. A
`full 64-bit ALU thus seemed wasteful and slow since half of the adder bits
`were adding Oto the intermediate sum.
`The original algorithm shifts the multiplicand left with Os inserted in the
`new positions, so the multiplicand cannot affect the least significant bits of the
`product after they settle down. Instead of shifting the multiplicand left, they
`wondered, what if we shift the product right? Now the multiplicand would be
`fixed relative to the product, and since we are adding only 32 bits, the adder
`need be only 32 bits wide. Figure 4.28 shows how this change halves the
`widths of both the ALU and the multiplicand.
`Figure 4.29 shows the multiply algorithm inspired by this observation. This
`algorithm starts with the 32-bit Multiplicand and 32-bit Multiplier registers set
`to their named values and the 64-bit Product register set to 0. This algorithm
`only forms a 32-bit sum, so only the left half of the 64-bit Product register is
`changed by the addition.
`
`Start
`
`MultiplierO = 1
`
`MultiplierO = 0
`
`1a. Add multiplicand to the left half of
`the product and place the result in
`the left half of the Product register
`
`2. Shift the Product register right 1 bit
`
`3. Shift the Multiplier register right 1 bit
`
`Done
`
`FIGURE 4.29 The second multiplication algorithm, using the hardware In Figure 4.28. Jn
`this vers10n, the Product register is shifted right instead of shifting the multiplicand. Color type
`shows the changes from Figure 4.26.
`
`INTEL - 1012
`
`
`
`256
`
`Chapter 4 Arithmetic for Computers
`
`4.6 Multiplication
`
`257
`
`Second Multiply Algorithm
`
`Multiply 0010two X 001ltwousing the algorithm in Figure 4.29.
`
`Figure 4.30 shows the revised 4-bit example, again giving a product of
`0000 0110two·
`
`Multiplicand
`
`32 bits
`
`-Product
`
`64 bits
`
`Shift right .._ _ _,
`Write - - -
`
`Control
`test
`
`FIGURE 4.31 Third version of the multiplication hardware. Comparing with the second ver(cid:173)
`sion in Figure 4.28 on page 254, the separate Multiplier register has disappeared. The multiplier
`is placed instead in the right half of the Product register.
`
`Third Multiply Algorithm
`
`Multiply 0010two x 0011two using the algorithm in Figure 4.32.
`
`Figure 4.33 shows the revised 4-bit example for the final algorithm.
`
`Signed Multiplication
`
`Sofar we have dealt with positive numbers. The easiest way to understand
`how to deal with signed numbers is to first convert the multiplier and multi(cid:173)
`plicand to positive numbers and then remember the original signs. The algo(cid:173)
`rithms should then be run for 31 iterations, leaving the signs out of the
`calculation. As we learned in grammar school, we need negate the product
`only if the original signs disagree.
`It turns out that the last algorithm will work for signed numbers provided
`that we remember that th