`Blomgren et al.
`
`I 1111111111111111 11111 111111111111111 IIIII IIIII IIIII 111111111111111 11111111
`US006334183Bl
`US 6,334,183 Bl
`Dec. 25, 2001
`
`(10) Patent No.:
`(45) Date of Patent:
`
`(54) METHOD AND APPARATUS FOR
`HANDLING PARTIAL REGISTER ACCESSES
`
`(75)
`
`Inventors: James S. Blomgren; Anthony M.
`Petro, both of Austin, TX (US)
`
`(73) Assignee: Intrinsity, Inc., Austin, TX (US)
`
`( * ) Notice:
`
`Subject to any disclaimer, the term of this
`patent is extended or adjusted under 35
`U.S.C. 154(b) by O days.
`
`(21) Appl. No.: 09/195,757
`
`(22)
`
`Filed:
`
`Nov. 18, 1998
`
`Related U.S. Application Data
`Provisional application No. 60/065,643, filed on Nov. 18,
`1997.
`Int. Cl.7 ...................................................... G06F 15/00
`U.S. Cl. ........................... 712/221; 712/223; 708/236
`Field of Search ..................................... 712/221, 222,
`712/223, 224; 708/236, 495, 525, 523,
`490
`
`References Cited
`
`(60)
`
`(51)
`(52)
`(58)
`
`(56)
`
`U.S. PATENT DOCUMENTS
`5,144,576 * 9/1992 Briggs et al. ........................ 708/493
`
`5,204,828 * 4/1993 Kohn .................................... 708/501
`5,524,263
`6/1996 Griffith et al. .
`5,835,394 * 11/1998 Wong ................................... 708/670
`5,880,983 * 3/1999 Elliott et al. ......................... 708/501
`5,930,159 * 7/1999 Wong ................................... 708/550
`5,935,198 * 8/1999 Blomgren ............................. 708/290
`* cited by examiner
`
`Primary Examiner-Eddie Chan
`Assistant Examiner-Gautam R. Patel
`(74) Attorney, Agent, or Firm-Booth & Wright, L.L.P.;
`Matthew J. Booth
`
`(57)
`
`ABSTRACT
`
`The present invention includes a partial register write han(cid:173)
`dler. The write handler receives either two or three operands.
`An execution unit operates on portions of two operands,
`rather than on full operands. The result of the execution unit
`has fewer bits than an "additional" operand, which may be
`any of the two or three operands received by the write
`handler. An output multiplexer receives all of the bits of an
`execution unit result and selected bits of the additional
`operand, and produces an output that has as many bits as the
`additional operand. If the output of the multiplexer is a string
`of bits, the string of bits contains the execution unit result as
`a substring of bits. The remaining bits of the output of the
`multiplexer are selected from the additional operand.
`
`36 Claims, 2 Drawing Sheets
`
`-------------------------------------------------------------------,
`A<l:0> 8<7:0>
`A<31:16> 8<31:16> A< 15:8> 8< 15:8>
`
`168
`
`TOP
`
`166
`
`164
`
`162
`
`-------- ----------'(_160 _______ ---------------------- ------------
`RESULT<31:16>
`RESULT<15:8>
`RESULT<7:0>
`
`ADI) BH,AL- >AL
`ADD BL,AL->AL
`ADD BH,AH->AH
`ADD BL,AH->AH
`ADD BX,AX->AX
`ADD EBX,EAX->EAX
`
`TOP
`pass
`pass
`pass
`pass
`pass
`add
`
`HIGH
`pass
`pass
`add
`don•tcare
`add
`add
`
`MIX-H
`don•tcare
`don•tcare
`don't care
`add
`don•tcare
`don•tcare
`
`MIX-L
`add
`don•tcare
`don•tcare
`don't care
`don•tcare
`don•tcare
`
`LOW
`don•tcare
`add
`pass
`pass
`add
`add
`
`Oracle-1008 p. 1
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`U.S. Patent
`
`Dec. 25, 2001
`
`Sheet 1 of 2
`
`US 6,334,183 Bl
`
`Register
`File
`
`~
`
`/
`
`~
`
`C
`
`A
`
`buffer ~/
`
`ALU, or
`Arithmetic Logic Unit
`(instruction unit)
`boolean--.
`
`~shifter
`---~adder
`/
`
`"'
`
`160_/
`
`FIG. 1
`
`bypass
`
`/
`'-150
`
`B
`
`\. 102
`
`Oracle-1008 p. 2
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`I■-
`~
`~
`~
`I■-
`_,.,l;,..
`~
`~
`O'I
`rJ'J.
`e
`
`N
`0 ....,
`N
`~ ....
`'JJ. =(cid:173)~
`
`'"""'
`0
`0
`N
`~Ul
`N
`ri
`~
`~
`
`~ = ......
`~ ......
`~
`•
`r:JJ.
`d •
`
`add
`add
`pass
`pass
`add
`don't care
`LOW
`
`don't care
`don't care
`don't care
`don't care
`don't care
`add
`MIX-L
`
`don't care
`don't care
`add
`don't care
`don't care
`don't care
`MIX-H
`
`FIG. 2
`
`add
`add
`don't care
`add
`pass
`pass
`HIGH
`
`add
`pass
`pass
`pass
`pass
`pass
`TOP
`
`ADD EBX,EAX->EAX
`ADD BX,AX->AX
`ADD BL,AH->AH
`ADD BH,AH->AH
`ADD BL,AL->AL
`ADD BH,AL->AL
`
`RESULT<l:0>
`
`RESULT<15:8>
`
`RESULT<31:16>
`
`~--------~---------""'--160-------~----------------------+-------------
`
`.
`
`I
`
`162 i
`
`/
`
`172
`
`174
`
`164
`
`166
`
`170
`
`TOP
`
`-------------------------------------------------------------------,
`
`A<7:0> 8<7:0>
`
`A<31:16> 8<31:16> A< 15:8> 8< 15:8>
`
`Oracle-1008 p. 3
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`US 6,334,183 Bl
`
`1
`METHOD AND APPARATUS FOR
`HANDLING PARTIAL REGISTER ACCESSES
`
`This application claims the benefit of the earlier filed
`U.S. Provisional Application Ser. No. 60/065,643, filed Nov.
`18, 1997, which is incorporated by reference for all purposes
`into this application.
`
`BACKGROUND OF THE INVENTION
`
`1. Field of the Invention
`The present invention relates to the field of computer
`operation, and more specifically to register files in a com(cid:173)
`puter.
`2. Description of the Related Art
`The original 8086 used eight general purpose registers in
`a sequential processor. Each instruction was received and
`was processed one at a time. However, as processors have
`become pipelined to achieve higher clock rates, dealing with
`instructions has become more problematic. Pipelined
`instructions have added complexity in dealing with bypassed
`operands, and have made handling partial registers more
`difficult. Pipelined processors have been able to process
`instructions out of order, "shelving" instructions until all
`operands needed for the operation are valid.
`When an instruction that requires operands is received,
`the pipelined processor has typically responded by examin(cid:173)
`ing the register file to determine whether all of the operands
`are available. If some operands are not available, the instruc(cid:173)
`tion is "shelved" until the operands become available. An
`instruction may remain shelved for several clock cycles. For
`example, this could occur when operands missing from an
`instruction must be read from a memory having a high
`latency. While an instruction is shelved, however, later
`instructions received by the processor may attempt to update
`the values of the operands that were available in the shelved
`instruction. Consequently, performing operations out of
`order can be a complicated task, as updates to operand
`values require complicated register file management.
`Similarly, instructions have been allowed to produce
`intermediate values, which have not been ready to be written
`to the register file. "Intermediate" values refer to those
`values generated by an execution unit such an arithmetic
`logic unit (ALU), but which may not be complete. When an
`instruction that has generated a result is not yet known to be
`on the execution path, the instruction is not yet known to be
`complete. For example, consider an instruction to add the
`values of AX and BX and place the result in AX. The value
`of BX may be available, but the original value of AX may 50
`not be available. In such a case, the instruction is not
`complete; an operand is still necessary to complete the
`operation. The result, therefore, is "intermediate" in that it
`depends on a yet-to-be-determined value. Such intermediate
`values may not be written to the register file until the 55
`instruction is known to be complete.
`Instructions are not known to be complete until all older
`instructions in the instruction path have completed and have
`not encountered any exceptional conditions. However, pipe(cid:173)
`lined processors have generally been permitted to execute 60
`instructions on intermediate data, and then later to determine
`that the intermediate data is valid and may be written to the
`register file. Pipelined processors have therefore achieved
`much of their speed performance by allowing results to be
`used before those results are committed to the register file. 65
`When a processor has received updates for some, but not
`all, of the operands needed for a particular operation, the
`
`30
`
`2
`pipelined processor has required additional logic to prevent
`use of "old" values for the missing operands. Handling
`instructions out of order can greatly complicate this task.
`Moreover, when partial register handling has been
`5 supported, updates to a portion of a register have compli(cid:173)
`cated handling the register, where portions of the register
`have been valid and portions of the register have been
`invalid.
`Intermediate results have therefore been held in a pending
`10 state until written to the register file. The pending state has
`been implemented as an additional "pending" file somewhat
`structurally similar to, and in some implementations larger
`than, the register file. Bypassing, or reading intermediate
`values from the pending state, has allowed pipelined pro-
`15 cessors to execute instructions before determining that the
`instructions that had generated those intermediate values are
`complete.
`Typically, the pending state has been implemented as a
`number of stages along a pipeline. Data has been copied
`20 from one stage to the next until, at the final stage, that data
`has been written to the register file. At each point along the
`pipeline, data has been available to "younger" instructions,
`with more recent updates to the register value being written
`to the first stage of the pipeline. Instructions have been able
`25 to select values from the various stages of the pipeline, and
`from the register file.
`To handle the relatively large number of sources (the
`various stages of the pipeline and the register file) from
`which an instruction can read values of an operand, a bypass
`multiplexer ("bypass multiplexer") has typically been
`included. The implementation of the bypass multiplexer,
`however, is different in many CISC architectures than it is in
`RISC architectures, because the former supports partial
`register write operations, while the latter does not.
`The x86 instruction set originally supported eight 16-bit
`general purpose registers, four of which could be divided
`into two 8-bit general purpose registers. Division of registers
`AX, BX, CX, and DX allowed byte-access (8-bit access) to
`40 the upper and lower bytes of these registers. As a result, not
`only have registers AX, BX, CX, and DX been supported,
`but also registers AH, AL, BH, BL, CH, CL, DH, and DL,
`referring to the high-order byte and low-order byte within
`each 16-bit register. The register count effectively increased
`45 to 16 registers: registers AX, BX, CX, DX,AH,AL, BH, BL,
`CH, CL, DH, DL, SP, BP, SI, and DI. Registers SP, BP, SI,
`and DI have not been divided.
`Moreover, with the introduction of the 386, the x86
`architecture grew to support 32-bit registers. However, the
`prevalence of code using the original 16-bit register set
`instruction set necessitated support of both 16-bit and 32-bit
`register sizes. The 386 instruction set allowed access to all
`the aforementioned registers, as well as allowing access to
`"extended" (32-bit) registers. The 16-bit registers were
`considered partial registers of the new 32-bit registers. Each
`instruction was provided with four partial register options:
`the instruction could select the full "extended" register, or
`the "lower" 16-bits of the extended register, or the high(cid:173)
`order byte or the low-order byte of the lower 16-bit register.
`For example, in the 386 instruction set, an instruction was
`permitted to access an extended register, for example EAX.
`Such an access would access bits [31:0] of a 32-bit register.
`Another instruction could access a 16-bit register, for
`example AX. Such an access would access bits [15:0] of the
`same 32-bit register. Another instruction could access an
`8-bit register, for example, AL. Such an access would access
`bits [7:0] of the same 32-bit register. Another instruction
`
`35
`
`Oracle-1008 p. 4
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`US 6,334,183 Bl
`
`3
`could access another 8-bit register, for example, AX. Such
`an access would access bits [15:8] of the same 32-bit
`register.
`In part due to the problem of handling partial registers in
`a pipelined processor, and in part due to the low cost of 5
`registers, RISC microprocessors have not supported partial
`register write operations. Adding registers has become far
`less expensive than dividing existing registers, particularly
`in light of the added complexity. Therefore, the RISC
`processor historically has operated on registers in their 10
`entirety. Consequently, the bypass multiplexer in a pipelined
`RISC processor configuration has been required only to
`select one of the sources (the various stages of the pipeline
`or the register file) from which an instruction can read values
`of an operand.
`On the other hand, CISC architectures allow a variety of
`portions of the registers to be altered. A pipelined CISC
`processor configuration that allows portions of registers to
`be operated upon has typically required an additional field to
`indicate which portion of the register has been updated
`between one stage and the next. The same result stage
`registers have been used for the result, and the additional
`field has been used to indicate how to write ( commit) the
`result to the register file at the end of the pipeline. Since
`CISC architecture allows a variety of portions of registers to 25
`be altered, the bypass multiplexer must select more than one
`place that an operand might be found, since portions of that
`operand may be generated by different instructions.
`These approaches have proven unsatisfactory. The imple(cid:173)
`mentation of the bypass multiplexer has proven extremely
`complicated in CISC architectures. A particularly problem(cid:173)
`atic example is the situation in which multiple instructions
`have written to different portions of a register, and then an
`instruction requiring the value of the register is encountered.
`In such a case, various stages of the pipeline contain
`different portions of the register value to be used. According
`to one approach, the instruction requiring the value of the
`register has caused the bypass multiplexer to select from the
`relatively large number of sources of values. According to
`another approach, the instruction requiring the value of the
`register has been stalled until enough of the results have
`been written that the operand comes from a single result
`register or from the register file itself. According to still
`another approach, a combination of complex multiplexing 45
`and stalling has been used.
`
`20
`
`4
`to receive a second operand having a second plurality of bits.
`The third input is configured to receive a third operand
`having a third plurality of bits.
`The exemplary embodiment of the present invention also
`includes an execution unit. The execution unit is configured
`to perform an operation upon a portion of the first operand
`and upon a portion of the second operand. The execution
`unit is also configured to provide a result having a number
`of bits no greater than the number of bits of the third
`operand.
`The exemplary embodiment of the present invention also
`includes a multiplexer, configured to receive all of the bits
`of the execution unit result and all of the bits of the third
`operand. The multiplexer is also configured to select a
`15 portion of the third operand, and to provide a multiplexer
`output. The multiplexer output comprises the selected por(cid:173)
`tion of the third operand and the execution unit result. Once
`the multiplexer output is produced, it replaces the third
`operand.
`In some embodiments, the first operand and the third
`operand are the same operand. The execution unit is also
`configured to provide a result having a number of bits no
`greater than the number of bits of the third operand. The
`multiplexer output comprises the selected portion of the first
`operand and the execution unit result. Once the multiplexer
`output is produced, it replaces the first operand.
`In some embodiments, the first and second operands ( and,
`when a third operand is included, the third operand) all have
`30 equal numbers of bits. Moreover, the bits of each of the
`operands are numbered, and there is a one-to-one correspon(cid:173)
`dence between the bits of the various operands. However, in
`some embodiments, the selected portion of the first operand
`and the selected portion of the second operand contain bits
`35 that correspond to one another; for example, the selected
`portion of the first operand may include bits O to 7 of the first
`operand, and the selected portion of the second operand may
`include bits O to 7 of the second operand. In other
`embodiments, no bit in the selected portion of the first
`40 operand corresponds to any bit in the selected portion of the
`second operand; for example, the selected portion of the first
`operand may include bits O and 7 of the first operand, and the
`selected portion of the second operand may include bits 8 to
`15 of the second operand.
`Although there is a one-to-one correspondence between
`the bits of the third operand and the bits of the first and
`second operands, the selected portion of the third operand
`includes all of the bits of the third operand except those that
`correspond to a bit in either of the first and second operands.
`For example, when the selected portion of the first operand
`includes bits O to 7 of the first operand, and the selected
`portion of the second operand includes bits O to 7 of the
`second operand, the selected portion of the third operand
`includes bits 8 to 31 of the third operand. When the selected
`55 portion of the first operand includes the bits O to 7 of the first
`operand, and the selected portion of the second operand
`includes bits 8 to 15 of the second operand, then the bits
`selected from the third operand are the bits which are not
`provided from the execution unit.
`In some embodiments, when the execution unit is con-
`figured to perform an operation, the operation performed is
`an ADD operation performed by the arithmetic logic unit
`within the execution unit. The ADD operation produces a
`sum of the selected portion of the first operand and the
`65 selected portion of the second operand. In some such
`embodiments, the multiplexer is configured to provide bits
`0 to 7 of the third operand as bits O to 7 of the multiplexer
`
`SUMMARY OF THE INVENTION
`
`The present invention includes a partial register write
`handler. The write handler receives either two or three 50
`operands. An execution unit operates on portions of two
`operands, rather than on full operands. The result of the
`execution unit has fewer bits than an "additional" operand,
`which may be any of the two or three operands received by
`the write handler. An output multiplexer receives all of the
`bits of an execution unit result and selected bits of the
`additional operand, and produces an output that has as many
`bits as the additional operand. If the output of the multi(cid:173)
`plexer is a string of bits, the string of bits can contain the
`execution unit result as a substring of bits, and the remaining 60
`bits of the output of the multiplexer are selected from the
`additional operand.
`The present invention includes, according to an exem(cid:173)
`plary embodiment, an apparatus that comprises a first input
`for receiving a first operand, a second input, and a third
`input. The first input is configured to receive a first operand
`having a first plurality of bits. The second input is configured
`
`Oracle-1008 p. 5
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`US 6,334,183 Bl
`
`6
`register may be located. Tables 1-5 illustrate how one
`embodiment of the present invention handles a read instruc(cid:173)
`tion that attempts to read at least some of the bits of a
`register, but which follows a previous instruction that has
`written to at least some of the bits of a register.
`Table 1 describes a write-then-read situation. The "B" in
`Table 1 indicates that the value can be bypassed. In Table 1,
`a register value is written to a register by an execution unit
`performing an instruction, and then immediately read from
`10 the first pipeline stage by the immediately following instruc(cid:173)
`tion. In other words, the most recently generated instance of
`the register value is read. As shown in Table 1, in such a
`situation, the register file can be bypassed. Bypassing the
`register file indicates that the instruction that generated the
`15 value is complete, and that other instructions may use the
`value.
`
`TABLE 1
`
`Partial register written-to
`by the first instruction
`
`EAX
`
`AX
`
`AH
`
`AL
`
`Partial register
`read by the second
`instruction
`
`EAX
`AX
`AH
`AL
`
`B
`
`B
`
`B
`
`B
`
`Bypassed values are still written to the register file, as
`intermediate values before they are written to the register
`30 file.
`Table 2 describes a different write-then-read situation. In
`Table 2, a portion of the register value is generated by a first
`instruction, and then a portion of that portion is needed by
`a second instruction. In such a case, the number of bits being
`written and the number of bits thereafter being read are
`unequal, but the bits being read by the second instruction are
`assuredly the updated value. In other words, Table 2 illus(cid:173)
`trates a situation in which the read instruction "wants" a
`smaller portion of the data that was produced by the write.
`For example, if an instruction that reads AX follows an
`instruction that writes all of EAX, then the read instruction
`receives only updated bits.
`Table 2 encompasses the Table 1 case, i.e., where the same
`register is written and then read. Table 2 also adds the
`situation in which a portion of the register value is generated
`by a first instruction, and then a portion of that portion is
`needed by a second instruction. Write operations to EAX or
`AX cannot be followed by a read operation from AH, due to
`bit alignment uncertainties.
`
`45
`
`5
`output, the sum as bits 8 to 15 of the multiplexer output, and
`bits 16 to 31 of the third operand as bits 16 to 31 of the
`multiplexer output. In other such embodiments, the multi(cid:173)
`plexer is configured to provide the sum as bits O to 7 of the
`multiplexer output, and bits 8 to 31 of the third operand as 5
`bits 8 to 31 of the multiplexer output.
`In still other embodiments of the present invention, a
`system includes a register file containing a first plurality of
`registers. The first plurality of registers includes a first
`register having a first number of bits, and a second register
`having a second number of bits, and a third register having
`a third number of bits. The system also includes a pending
`file containing a second plurality of registers. The second
`plurality of registers includes at least one register corre(cid:173)
`sponding to the first register, at least one register corre(cid:173)
`sponding to the second register, and at least one register
`corresponding to the third register.
`In such embodiments, the system also includes a bypass
`multiplexer configured to select a first operand either from
`the first register or from one of the at least one register 20
`corresponding to the first register in the pending file. The
`bypass multiplexer configured to select a second operand
`either from the second register or from one of the at least one
`register corresponding to the second register in the pending
`file. The system also includes an execution unit, configured 25
`to perform an operation upon a portion of the first operand
`and upon a portion of the second operand. The execution
`unit is further configured to provide a result having a number
`of bits no greater than the number of bits of the third register.
`In such embodiments, the system also includes an output
`multiplexer that receives all of the bits of the execution unit
`result, and selects a third operand either from the third
`register or from one of the at least one register corresponding
`to the third register in the pending file. However, in some of 35
`these embodiments, the third register is actually the first
`register. The output multiplexer also selects a portion of the
`third operand having a number of bits equal to the difference
`between the number of bits in the third register and the
`number of bits in the execution unit result. Accordingly, the 40
`multiplexer provides a multiplexer output comprising the
`selected portion of the third operand and the execution unit
`result, and provides the multiplexer output to either the third
`register or to one of the registers corresponding to the third
`register in the pending file.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`FIG. 1 shows a partial write handler having both a bypass
`multiplexer and an output multiplexer, according to one
`embodiment of the present invention.
`FIG. 2 shows a destination merging adder residing within
`a write handler according to the second embodiment of the
`present invention.
`
`50
`
`TABLE 2
`
`Partial register written-to
`by the first instruction
`
`EAX
`
`AX
`
`AH
`
`AL
`
`Partial register
`read by the second
`instruction
`
`EAX
`AX
`AH
`AL
`
`B
`B
`
`B
`
`B
`
`B
`
`B
`
`B
`
`DETAILED DESCRIPTION OF THE
`INVENTION
`
`55
`
`The present invention comprises partial register access
`handler. This disclosure describes numerous specific details
`that include specific structures, circuits, and logic functions 60
`in order to provide a thorough understanding of the present
`invention. One skilled in the art will appreciate that one may
`practice the present invention without these specific details.
`According to one embodiment of the present invention, a
`partial write handler allows the bypass multiplexer to com(cid:173)
`bine a variety of portions of the registers, despite the large
`number of locations in which various pipelined stages of the
`
`Table 3 shows the write-then-read situation where either
`the write operation writes data to register AH, or the read
`operation reads from AH, or both. When AH is written, data
`65 is provided to bits [15:8] of register EAX. However, the
`remaining bits, bits [7:0] and bits [31:16], are not up
`Consequently, reading from EAX or AX would require
`
`Oracle-1008 p. 6
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`US 6,334,183 Bl
`
`7
`obtaining bits [15:8] from the updated value provided by the
`write command, and also obtaining other bits from another
`location. While this could be done by a complex multiplexer,
`generally, this is performed by stalling until all the necessary
`bits have been updated. Stalling, or interlocking, is indicated 5
`by an "I" in Table 3.
`
`TABLE 3
`
`Partial register written-to
`by the first instruction
`
`EAX
`
`AX
`
`AH
`
`AL
`
`Partial register
`read by the second
`instruction
`
`EAX
`AX
`AH
`AL
`
`B
`B
`
`B
`
`B
`
`B
`
`B
`
`B
`
`Referring now to Table 4, when an entire different portion
`of a register is written than is being requested, then the
`operand is obtained from the register file rather than from the
`updated bits. This is indicated by the letter "R" in Table 4.
`The updating of the bits that are not requested is irrelevant
`to the read instruction.
`
`TABLE 4
`
`Partial register written-to
`by the first instruction
`
`EAX
`
`AX
`
`AH
`
`AL
`
`Partial register
`read by the second
`instruction
`
`EAX
`AX
`AH
`AL
`
`B
`B
`
`B
`
`B
`
`B
`
`B
`R
`
`R
`B
`
`Referring now to Table 5, when more of the register is
`read than had been written, the bypass multiplexer must
`combine bits from different sources, or stall. Writing to AX
`and then reading EAX requires the read instruction to obtain
`bits [15:0] from the updated AX, and the remaining bits
`[31:16] from a pipeline stage representing a previous write
`instance of the register value. Similarly, writing to AL and
`then reading from AX or EAX requires the read instruction
`to obtain bits [7:0] from the updated AL, and the remaining
`bits [15:8] (in the case of reading AX) or [31:8] (in the case
`of reading EAX) from a pipeline stage representing a
`previous write instance of the register value. While this
`could be done by a complex bypass multiplexer, generally
`pipeline processors merely stall on such instructions until all
`the necessary bits have been updated. Stalling, or
`interlocking, is indicated by an "I" in Table 5.
`
`TABLE 5
`
`Partial register written-to
`by the first instruction
`
`EAX
`
`AX
`
`AH
`
`AL
`
`Partial register
`read by the second
`instruction
`
`EAX
`AX
`AH
`AL
`
`B
`B
`
`B
`
`B
`
`B
`
`B
`R
`
`R
`B
`
`When complex bypass multiplexers are used in a pipeline
`processor, the additional internal interconnect requirements
`of the control hardware increase by a factor of three, since
`there are three portions of a register which can be indepen(cid:173)
`dently written.
`
`8
`Mapping Register File Bits to an Output Multiplexer
`FIG. 1 shows one embodiment of the present invention
`comprising a partial write handler having both a bypass
`multiplexer 150 and an output multiplexer 160. The bypass
`multiplexer 150 provides an input to an execution unit 102
`such as an arithmetic logic unit (ALU). The execution unit
`102 receives and executes an instruction. The instruction is
`selected from an instruction set that allows access to partial
`registers AH, AL, BH, BL, CH, CL, DH, and DL, referring
`10 to the high-order byte and low-order byte within each 16-bit
`register, as well as to the full 16-bit registers AX, BX, CX,
`and DX, and their 32-bit extended versions. The output of
`the execution unit 102 contains a number of bits that
`depends on the instruction being performed.
`According to the embodiment of the present invention
`shown in FIG. 1, the partial register write handler also
`includes output multiplexer 160. Output multiplexer 160
`receives the output of the execution unit 102 as well as
`additional bits, and then provides a full 32-bit output upon
`20 every clock cycle. The use of the output multiplexer 160 to
`produce a 32-bit output obviates the need for the bypass
`multiplexer 150 to select bits from multiple locations in the
`cases shown above in Tables 3-5.
`The output multiplexer produces results that appear as if
`the entire 32-bit registers were updated upon each clock
`cycle. In effect, a substitute instruction set is created, in
`which instructions that write to partial registers are replaced
`with instructions that not only write to partial registers, but
`also read enough bits from the register file to generate a full
`32-bit register value. A, B, and C in FIG. 1 represent
`operands residing in partial registers that are provided to
`ALU 102 and/or the output multiplexer 160. If, for example,
`the bottom (least significant) bits of register EAX are
`incremented (AL) then the upper 24 bits are set to the value
`35 that would be in the register file entry for EAX after the
`increment completes. Since the increment does not modify
`the upper 24 bits, the value after the increment is the same
`as the value before the increment.
`According to a first embodiment of the present invention,
`the output multiplexer and the execution unit 102 perform
`their respective operations in parallel. To accomplish this,
`the output multiplexer 160 determines the destination reg(cid:173)
`ister of the instruction being executed in the execution unit
`102 before the execution unit 102 begins executing the
`instruction. The output multiplexer 160 also determines the
`number of bits the instruction being executed in the execu-
`tion unit 102 can generate, also before the execution unit 102
`begins executing the instruction. The output multiplexer 160
`then accesses the register file, and reads the prior value of the
`destination register. The output multiplexer 160 then merges
`the bits of the destination register's prior value with the
`destination register's update to provide a 32-bit register
`value.
`Assuming the Destination Register is One of the Source
`55 Registers
`Since the number of instructions within the x86 instruc(cid:173)
`tion set is quite extensive, providing an output multiplexer
`160 that maps bits of all of the registers in the register file
`would be an enormous task. However, most instructions in
`the x86 instruction set select and operate on two (source)
`operands, and then write the result back to one of the two
`source operands rather than to a third operand. The output
`multiplexer 160 of the present invention takes advantage of
`this feature of the x86 instruction set. For example, all of the
`65 x86 instruction set ADD operations fall into six categories,
`all of which overwrite at least part of one of the source
`registers. In other words, for any two registers, identified in
`
`15
`
`25
`
`30
`
`40
`
`45
`
`50
`
`60
`
`Oracle-1008 p. 7
`Oracle v. Teleputers
`IPR2021-00078
`
`
`
`US 6,334,183 Bl
`
`9
`a particular order, there are six ADD operations in the x86
`instruction set, all of