`Papworth et al.
`
`[19]
`
`11111111111111 iii 1111111111111111111 iiiii 111111111111111111111111111111111
`US005584038A
`[u] Patent Number:
`[45] Date tif Patent:
`
`5,584,038
`Dec. 10, 1996
`
`[54] ENTRY ALLOCATION IN A CIRCULAR
`BUFFER USING WRAP BITS INDICATING
`WHETHER A QUEUE OF THE CIRCULAR
`BUFFER HAS BEEN TRAVERSED
`
`[75]
`
`Inventors: David B. Papworth, Beaverton;
`Andrew F. Glew; Michael A.
`Fetterman, both of Hillsboro; Glenn J.
`Hinton; Robert P. Colwell, both of
`Portland; Steven J. Griffith, Aloha;
`Shantanu R. Gupta; Narayan Hegde,
`both of Beaverton, all of Oreg.
`
`[73] Assignee: Intel Corporation, Santa Clara, Calif.
`
`[21] Appl. No.: 633,905
`
`[22] Filed:
`
`Apr. 17, 1996
`
`Related U.S. Application Data
`
`[62] Division of Ser. No. 204,760, Mar. 1, 1994.
`
`Int. C1.6
`[51]
`[52] U.S. Cl.
`
`[58]
`
`Field of Search
`
`GO6F 12/00
`3951800; 364/251; 364/244.4;
`364/DIG. 1; 395/437; 395/250
`395/800, 375,
`395/437, 250
`
`[56]
`
`References Cited
`U.S. PATENT DOCUMENTS
`
`4,616,338 10/1986 Helen
`5,226,126
`7/1993 McFarland
`1/1995 Brunelle
`5,381,528
`9/1995 Nguyen
`5,450,547
`
`395/250
`3951800
`395/250
`395/250
`
`5,450,560
`9/1995. Bridges
`5,471,598
`11/1995 Quattromani
`5,473,756 12/1995 Traylor
`
`395/410
`395/449
`395/250
`
`OTHER PUBLICATIONS
`
`A Parallel, High Speed Circular Queue Structure by A. E.
`Barbour et al., 1990 IEEE, Circuits and Systems.
`
`Primary Examiner -Eric Coleman
`Attorney, Agent, or Firm - Blakely, Sokoloff, Taylor & Zaf-
`man
`
`[57]
`
`ABSTRACT
`
`An allocator assigns entries for a circular buffer. The allo-
`cator receives requests for storing data in entries of the
`circular buffer, and generates a head pointer to identify a
`starting entry in the circular buffer for which circular buffer
`entries are not allocated. In addition to pointing to an entry
`in the circular buffer, the head pointer includes a wrap bit.
`The allocator toggles the wrap bit each time the allocator
`traverses the linear queue of the circular buffer. A tail pointer
`is generated, including the wrap bit, to identify an ending
`entry in the circular buffer for which circular buffer entries
`are allocated. In response to the request for entries, the
`allocator sequentially assigns entries for the requests located
`between the head pointer and the tail pointer. The allocator
`has application for use in a microprocessor performing
`out -of -order dispatch anti speculative execution. The allo-
`cator is coupled to a reorder buffer, configured as a circular
`buffer, to permit allocation of entries. The allocator utilizes
`an all or nothing allocation policy, such that either all or no
`incoming instructions are allocated during an allocation
`period.
`
`8 Claims, 8 Drawing Sheets
`
`Memory
`160
`
`External Bu,
`
`Memary &uisp,es
`155
`
`MOB
`164
`
`Brant Target
`Buffa Circuit
`197
`
`&upmaler
`Remotion Cluster
`130 I
`
`Branch
`Huffer Cache
`108
`Brenda
`Instruction
`Pointer Table
`162
`
`ranch Addrea
`Calculator
`164
`
`Ou[opOrder
`Cluster -
`110
`
`A
`
`e
`
`(ea
`
`SEC
`
`e5
`
`116
`
`t
`
`Reorder
`Buffer
`120
`
`tRRF
`
`Bus Interface
`102
`
`Inswrfim
`Fetch Unit
`105
`
`Instruction
`
`109
`
`Allocato
`112
`
`aff rder
`acne Cluster
`103
`
`Mieroproeessor100
`
`Retirement
`>140e
`140
`
`4
`EXHIBIT
`Witness:
`R. COLWELL, PhD
`Date: 2 -27 -17
`Sudan Magee CSR No 17661
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 1 of 18
`
`
`
`U.S. Patent
`
`Dec. 10, 1996
`
`Sheet 1 of 8
`
`5,584,038
`
`Memory
`160
`
`101
`
`External Bus
`
`Memory Subsystem
`155
`
`MOB
`154
`
`Superscalar
`Execution Cluster
`130 I
`
`-150
`-1AG
`IEÚ-14--g
`
`-IFEUR47
`
`- 145
`
`115
`
`f
`
`Reorder
`Buffer
`120
`
`t-
`
`RRF
`128
`
`Retirement
`Logic
`140
`
`Bus Interface
`102
`
`Instruction
`Fetch Unit
`105
`
`Instruction
`Decoder
`109
`
`Allocator
`112
`
`1
`
`In -Order
`Issue Cluster
`103
`
`Branch Target
`Buffer Circuit
`107
`
`I
`
`Branch Target
`Buffer Cache
`108
`
`Branch
`Instruction
`Pointer Table
`162
`
`Branch Address
`Calculator
`164
`
`RAT
`111
`
`Out -of -Order
`Cluster -
`110
`
`Microprocessor 100
`
`Figure 1
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 2 of 18
`
`
`
`RONuke
`ROClear
`JEClear
`
`cr,
`1/4t
`
`eDró
`
`i-+
`
`tailptr
`
`RONuke
`ROClear
`
`120
`ROB
`
`headptT
`
`ALstall
`
`Figure 2
`
`headptr
`Pdsts
`
`ALstall
`
`Allocation
`
`ROB
`
`220
`
`Control
`Allot
`
`225
`
`ALROBstall
`
`Uop Decode
`
`210
`
`Allocator 112
`
`JEClear
`
`Execution Unit
`
`Branch
`
`230
`
`Pdsts
`
`Instruction
`Uops from
`
`Decoder
`
`109
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 3 of 18
`
`
`
`U.S. Patent
`
`Dec. 10, 1996
`
`Sheet 3 of 8
`
`5,584,038
`
`Tail tr
`
`Head . tr
`
`Figure 3
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 4 of 18
`
`
`
`U.S. Patent
`
`Dec. 10, 1996
`
`Sheet 4 of 8
`
`5,584,038
`
`intheadptr
`
`1
`
`tailptr
`
`ER...eset
`
`Compare
`and latch
`415
`
`A1Robstall
`
`+1
`
`+3
`
`+2
`1
`Latches
`410
`Pdst+2
`
`Pdst+3
`y
`
`Pdst+1
`
`.
`Latches
`425
`
`Clk# & r.w>
`AiRobstall
`
`intheheadptr
`
`Clk# & >
`
`A1Robstall
`or A1Stall
`
`IDuops
`
`V
`
`V
`Mug
`440
`
`Pdst(3)
`Pdst(2) To
`120
`Pdst(1)
`
`new
`intheheadptr
`
`intheadptr'
`
`Stall
`
`/Clk
`
`450
`
`60
`
`P
`
`1
`
`headptr to
`ROB120
`
`224
`
`Figure 4
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 5 of 18
`
`
`
`U.S. Patent
`
`Dec. 10, 1996
`
`Sheet 5 of 8
`
`5,584,038
`
`1
`Branch 1
`i
`Branch 2
`
`(Executed first &
`correctly predicted)
`
`(Executed second &
`mispredicted,
`Assert JEClear/
`flush pipeline)
`
`Figure 5a
`
`1
`Branch 1
`
`ranch 2
`
`(Executed second &
`mispredicted,
`Assert JEClear/
`flush pipeline)
`
`(Executed first,
`predicted correctly)
`
`Figure 5b
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 6 of 18
`
`
`
`U.S. Patent
`
`Dec. 10, 1996
`
`Sheet 6 of 8
`
`5,584,038
`
`Branch 1
`
`(Executed second &
`mispredicted,
`Assert JEClear/
`flush pipeline)
`
`(Executed first &
`mispredicted,
`flush pipeline)
`
`Figure 5c
`
`1
`Branch 1
`
`ranch 2
`
`(Executed first &
`mispredicted,
`Assert JEClear/
`flush pipeline)
`
`(Executed second &
`mispredicted,
`do not flush
`pipeline)
`
`Figure 5d
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 7 of 18
`
`
`
`U.S. Patent
`
`Dec. 10, 1996
`
`Sheet 7 of 8
`Cstart)
`
`5,584,038
`
`610
`
`Mispredicted
`branch
`detected
`9
`
`Yes
`
`- 620
`
`Past
`mispredicted branches
`outstanding
`9
`
`Yes
`
`- 630
`Current
`mispredicted branch
`issued subsequent to
`past mispredicted
`branch
`9
`
`Yes
`
`Do not assert
`JE clear
`
`' 640
`
`nd)
`( t
`
`Figure 6
`
`T
`Assert
`JE clear
`
`625 -
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 8 of 18
`
`
`
`U.S. Patent
`
`Dec. 10, 1996
`
`Sheet 8 of 8
`
`5,584,038
`
`-720
`
`Yes
`
`- 730
`
`Result
`
`No
`
`Exclusive OR past Pdst and
`current Pdst wrap bits
`
`- 760
`
`Yes
`
`-770
`
`No
`
`790
`
`Past mispredicted
`branch older
`
`1
`
`- 785
`
`C End)
`
`Figure 7
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 9 of 18
`
`
`
`5,584,038
`
`1
`ENTRY ALLOCATION IN A CIRCULAR
`BUFFER USING WRAP BITS INDICATING
`WHETHER A QUEUE OF THE CIRCULAR
`BUFFER HAS BEEN TRAVERSED
`
`This is a divisional of application Ser. No. 08/204,760,
`filed Mar. 1, 1994.
`
`FIELD OF THE INVENTION
`
`The present invention relates to resource allocation in a
`buffer, and more specifically to methods and apparatus for
`allocating entries in a circular buffer for operation in a
`microprocessor.
`
`BACKGROUND OF THE INVENTION
`
`5
`
`to
`
`15
`
`2
`It is another object of the present invention to provide
`entry allocation for a reorder buffer in a microprocessor
`performing out -of -order execution.
`It is another object of the present invention to provide
`entry allocation for a reorder buffer in a microprocessor
`performing speculative execution.
`These and other objects of the present invention are
`realized in an arrangement that includes a circular buffer and
`an allocator for allocating entries in the circular buffer. The
`allocator receives requests for storing data in entries of the
`circular buffer. The allocator generates a head pointer to
`identify a starting entry in the circular buffer for which
`circular buffer entries are not allocated. In addition to
`pointing to an entry in the circular buffer, the head pointer
`includes a wrap bit. The allocator toggles the wrap bit each
`time the allocator traverses the linear queue of the circular
`buffer. A tail pointer is generated, including the wrap bit, to
`identify an ending entry in the circular buffer for which
`circular buffer entries are allocated. In response to the
`request for entries, the allocator sequentially assigns entries
`for the requests located between the head pointer and the tail
`pointer.
`In order to allocate entries in the circular buffer, the
`allocator generates speculative addresses for each of the
`requests by utilizing the head pointer as a base address. The
`speculative addresses identify potential allocation entries in
`the circular buffer. The allocator compares the speculative
`addresses to the tail pointer. If a match between one of the
`speculative addresses and the tall pointer occurs, then the
`circular buffer does not contain enough unallocated entries
`to support all of the requests. Alternatively, if no match
`occurs between one of the speculative addresses and the tail
`pointer, then the circular buffer contains resources to support
`all of the requests. In either case, the appropriate head
`pointer is stored for reference in a subsequent allocation
`period.
`The allocator of the present invention has application for
`use in a microprocessor performing out -of -order execution
`and speculative execution. The microprocessor contains an
`instruction fetch and decoder circuits for issuing and decod-
`ing instructions, respectively, in the program order. The
`microprocessor also includes a superscalar execution dus-
`ter, containing a plurality of execution units, and an out -of-
`order cluster for performing out -of -order dispatch and
`execution. The out -of -order cluster contains a reorder buffer
`configured as a circular buffer. The reorder buffer stores data
`for use in execution in the superscalar execution unit. The
`allocator is coupled to the reorder buffer to permit allocation
`of entries.
`In a preferred embodiment, the allocator utilizes an all or
`nothing allocation policy. The allocator receives a plurality
`of instructions, entitled micro -ops, for execution in the
`microprocessor. The allocator determines whether the reor-
`der buffer contains enough unallocated entries to support
`each of the incoming micro-ops. If the reorder buffer con-
`tains enough entries, the allocator sequentially allocates
`physical destinations in the reorder buffer, including a wrap
`bit, for each micro -op so as to retain the original program
`order. Altematively, if the reorder buffer does not contains
`enough entries, the allocator does not allocate any of the
`nticro -ops in the allocation period.
`In order to support speculative execution in the micro-
`processor, the microprocessor contains a branch execution
`unit. In part, the branch execution unit determines the
`outcome for a branch instruction previously predicted.
`Under certain circumstances, if the branch instruction is
`
`In general, data processing systems of all types employ
`registers or buffers to store data. The buffers are constructed
`in a variety of ways depending upon the particular applica- 20
`tion for the data processing system. For example, buffers
`utilized for data processing systems include random access
`memory (RAM) and first -in- first -out (FIFO) buffers. In
`addition to containing buffers for storing data, data process-
`ing systems typically employ apparatus for allocating entries 25
`in the buffer to store data. Because buffer resources are
`finite, data processing systems deallocate buffer entries
`when data stored in a corresponding buffer entry is no longer
`needed. Although FIFO memory devices provide easy entry
`allocation, FIFO memory devices are not suitable for all data 30
`processing applications.
`Buffers are utilized in data processing systems to store
`input data, intermediate result data and output data. For
`example, in a digital processing system for audio or video
`compression, random access memory (RAM) may be
`employed to store input uncompressed data, intermediate
`data and final compressed output data. In such a digital
`processing system, it is necessary to reallocate buffer entries
`no longer needed in order to provide a continual flow of
`resources required for subsequent data processing. In addi-
`tion, it is necessary to reallocate entries in the buffer so as to
`preserve data entries still required by the data processing
`system. Consequently, allocation of buffer entries requires
`both preservation of certain existing data entries, and
`resource assignment to new buffer entries for incoming data.
`Microprocessors utilize buffers to implement file registers
`for use in conjunction with executing instructions. Typically,
`file registers store data for access by an execution unit, such
`as an arithmetic logic unit (ALU), in the microprocessor.
`The complexity of the register file and allocation required
`for operation of the microprocessor is dependent upon the
`architecture of the microprocessor. For example, in a micro-
`processor performing parallel execution, some instructions
`are executed out of the original program order. Conse- 55
`quently, allocation and deallocation of entries in the register
`file becomes complex.
`
`45
`
`so
`
`35
`
`SUMMARY AND OBJECTS OF THE
`INVENTION
`Therefore, it is an object of the present invention to
`allocate entries in a circular buffer without the need for
`complex logic.
`It is a further object of the present invention to allocate 6s
`entries in a circular buffer without overwriting valid data
`currently stored in the buffer.
`
`60
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 10 of 18
`
`
`
`5,584,038
`
`3
`mispredicted, then the branch execution unit flushes the
`pipeline of incoming instructions. In order to determine
`whether to flush the pipeline, the branch execution unit
`determines the relative age of two branch instructions by
`comparing the physical destinations and wrap bits of a first
`and second branch instruction. If the wrap bits are equal,
`then the branch execution unit designates the instruction
`comprising the smallest physical destination value as the
`oldest branch instruction. If the wrap bits are not equal,
`indicating the linear queue has been traversed, then the to
`branch execution unit designates the instruction comprising
`the largest physical destination value as the oldest branch
`instruction.
`Other objects, features and advantages of the present
`invention will be apparent from the accompanying draw- 15
`ings, and from the detailed description that follows below.
`
`5
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`The objects, features, and advantages of the present 20
`invention will be apparent from the following detailed
`description of the preferred embodiment of the invention
`with references to the following drawings.
`FIG. 1 illustrates a high level block diagram of an
`out -of -order execution and speculative execution micropro- 25
`cessor incorporating the teachings of the present invention.
`FIG. 2 illustrates ti high level block diagram of an
`allocator and associated units configured in accordance with
`the present invention.
`FIG. 3 illustrates a logical diagram of the ROB 120 buffer 30
`configured in accordance with the present invention.
`FIG. 4 illustrates a ROB allocation unit configured in
`accordance with the present invention.
`FIG. 5a, 56, 5c, 5d illustrate examples of how a branch 35
`execution unit execute branches both in and out of order in
`accordance with the present invention.
`FIG. 6 illustrates pipeline flush logic implemented in the
`branch execution unit of the present invention.
`FIG. 7 illustrates a method for determining the relative 40
`branch age configured in accordance with the present inven-
`tion.
`
`45
`
`DETAILED DESCRIPTION
`Methods and apparatus for allocating reorder buffer
`resources in a microprocessor performing out -of -order
`execution are disclosed. In the following description, for
`purposes of explanation, specific nomenclature is set forth to
`provide a thorough understanding of the present invention. 50
`However, it will be apparent to one skilled in the art that
`these specific details are not required to practice the present
`invention. In other instances, well known circuits and
`devices are shown in block diagram form to avoid obscuring
`the present invention unnecessarily.
`The present invention has application for use in a micro-
`processor performing out -of -order execution and specula-
`tive execution. FIG. 1 illustrates a high level block diagram
`of an out -of -order execution and speculative execution
`microprocessor incorporating the teachings of the present 60
`invention. The high level block diagram of FIG. 1 illustrates
`functional blocks of a pipelined microprocessor incorporat-
`ing the teachings of present invention. The microprocessor
`100 contains an in -order issue cluster 103, an out -of -order
`cluster 110, and a superscalar execution cluster 130. In as
`addition, microprocessor 100 contains a bus interface 102,
`coupled to an external bus 101, and a memory subsystem
`
`55
`
`4
`155 for interfacing the microprocessor 100 to external
`memory.
`The bus interface 102 interfaces the microprocessor 100
`to peripheral components, including memory, via the exter-
`nal bus 101. The memory subsystem 155 is coupled to the
`bus interface 102 and provides a memory interface to cache
`memory and main memory. In one embodiment, the bus
`interface 101 attempts to load or store data from a high speed
`cache memory. Alternatively, the bus interface 101 accesses
`a main memory over the external bus 101. The bus interface
`102 and memory subsystem 155 are intended to represent a
`broad category of interface devices which are well known in
`the art and will not be described further.
`The bus interface 102 is coupled to an instruction fetch
`unit 105 located in the in -order issue cluster 103. The
`instruction fetch unit 105 retrieves microprocessor instruc-
`tions, known as macro instructions, and operands for execu-
`tion in the microprocessor 100. In a preferred embodiment,
`the microprocessor 100 is implemented as a pipelined pro-
`cessor overlapping instruction fetch, instruction decode and
`instruction execution functions. The instruction fetch unit
`105 continually fetches macro instructions for the pipeline in
`the microprocessor 100. However, simple unconditional
`branch instructions within the instruction stream prevent the
`instruction fetch unit 105 from retrieving instructions in a
`purely sequential path. Furthermore, conditional branch
`instructions, within the instruction stream, prevent the
`instruction fetch unit 105 from retrieving instructions along
`a predetermined path because the condition requires reso-
`lution to ascertain the direction of the path.
`In order to continually input macro instructions into the
`pipeline of microprocessor 100, the microprocessor 100
`includes a branch prediction unit 107. The branch prediction
`unit 107 predicts the execution path of an instruction stream.
`In general, the branch prediction unit 107 predicts the
`existence of branch instructions within the instruction
`steam, and predicts the outcome of the branch. Conse-
`quently, as the macro instructions input to the pipeline
`proceed down the pipeline stages, the macro instructions are
`"speculatively" executed because of the uncertainty that the
`branch was properly predicted. The reorder buffer allocation
`of the present invention supports the operation of specula-
`tive execution in the microprocessor 100 as is described
`more fully below.
`The in -order cluster 103 includes a branch target buffer
`circuit 107. The branch target buffer circuit 107 receives a
`current instruction pointer (IP) from the instruction fetch
`unit 105. The branch target buffer circuit 107 contains the
`branch target buffer cache 108. The branch target buffer
`cache 108 stores information pertaining to the history of
`branch instructions previously executed. By utilizing the
`current instruction pointer (IP) and the internal cache, the
`branch target buffer circuit 107 predicts both the existence of
`branch instructions within the instruction stream, and a
`target address for the branch instructions predicted. After a
`branch instruction is resolved by the microprocessor 100,
`the branch target buffer circuit 107 records the branch target
`address and the outcome of the branch instruction in the
`branch target buffer cache 108 for future reference.
`Each entry in the branch rrrget buffer cache 108 is
`indexed by the address of the last byte of the branch
`instruction to allow the branch target buffer circuit 107 to
`reference previous branch instructions utilizing the instruc-
`tion pointer. When the branch target buffer circuit 107 finds
`an upcoming branch instruction in the branch target buffer
`cache 108, the branch target buffer circuit 107 predicts a
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 11 of 18
`
`
`
`5,584,038
`
`5
`
`1s
`
`5
`taken or not -taken branch outcome for the branch instruc-
`tion. If the branch target buffer circuit 107 predicts a taken
`branch outcome, then the branch target buffer circuit 107
`also predicts a branch target address.
`The macro instructions retrieved are input to an instme-
`lion decoder 109. In general, the instruction decoder 109
`decodes an operation code and source data for the corre-
`sponding macro instructions. In a preferred embodiment, the
`instruction decoder 109 receives Intel® architecture com-
`patible macro instructions, and determines the type of 10
`instruction received. The instruction decoder 109 breaks
`down the macro instruction into one or more micro- opera-
`tions (micro -ops) with associated micro- operands. The one
`or more micro -ops corresponding to the decoded macro
`instruction specify the equivalent function.
`If the instruction decoder 109 determines that a received
`is microprocessor instruction is a branch instruction, then
`the instruction decoder 109 passes certain information about
`the branch instruction to a branch address calculator 164 for
`special treatment. The branch address calculator 164 further 20
`processes each branch instruction based on information
`provided in the instruction and the operand. For example, if
`the branch target buffer circuit 107 made a branch predic-
`tion, the branch address calculator 164 attempts to verify the
`branch prediction.
`The instruction decoder 109 is coupled to an allocator
`112, also located within the in -order issue cluster 103. The
`micro -ops generated in the instruction decoder 109 are input
`to the allocator 112. In general, the allocator 112 allocates
`resources necessary to execute each micro -op. In the pre- 30
`fared embodiment, the microprocessor 100 performs out -
`of -order execution, wherein micro -ops may be executed out
`of the original program order. During retirement of the
`micro -ops, the original program order is restored. The allo-
`cation of resources to the out -of -order cluster is described 35
`below.
`The out -of -order cluster 110 contains a reservation station
`(RS) 115, a reorder buffer (ROB) 120, a real register file
`(CRRF) 128, and retirement logic 140. The ROB 120
`includes a reorder buffer and corresponding reorder logic.
`The ROB 120 provides capabilities for speculative execu-
`tion, register renaming and out -of -order execution for the
`microprocessor 100. In a preferred embodiment of the
`present invention, the ROB 120 is implemented as a multi- 45
`port register file. The ROB 120 is managed as a first in first
`out (FIFO) register file. Both source reads and reorder buffer
`write -backs operate on the ROB 120 as a register file. The
`RRF 128 comprises the architectural registers of the micro-
`processor 100.
`The ROB 120 contains a circular buffer to store n entries,
`wherein each entry stores the results of executed micro -ops.
`Because each entry in the ROB 120 provides a destination
`for micro -op result, each ROB 120 entry is referred to as a
`physical destination (Pdst). The Pdsts, within the ROB 120,
`are numbered 0 through n -1. Each Pdst in the ROB 120
`contains a valid bit, that indicates whether or not the
`micro-op result is valid, a micro -op result, a set of flags and
`a corresponding mask, a code, and fault data.
`If the allocator 112 allocates an entry in the ROB 120 for 60
`a micro -op associated with a branch instruction, the branch
`target buffer circuit 107 allocates a matching entry in a
`branch instruction- pointer table (BIT) 162. The branch tar-
`get buffer circuit 107 stores a "fall - through" address for the
`branch instruction, and processor state information into the 65
`branch instruction- pointer table (BIT) 162. The fall -through
`address identifies the address of the instruction immediately
`
`50
`
`25
`
`55
`
`6
`following the branch instruction. The information stored in
`the branch instruction- pointer table (BIT) 162 is later uti-
`lized by the instruction fetch unit 105, the branch address
`calculator 164 and the branch target buffer circuit 107.
`The ROB 120 supports out -of -order execution by allow-
`ing the superscalar execution unit 130 to complete execution
`of instructions, and write -back the results without regard to
`other instructions that use the same logical register. There-
`fore, as far as the superscalar execution unit 130 is con-
`cerned, micro -ops complete out -of- order. Subsequently, the
`retirement logic 140, also contained within the out -of -order
`cluster 110, reorders the executed micro operations into the
`original sequence issued by the in -order issue cluster 103. To
`support register renaming and out -of -order execution, a
`register alias table (RAT) 111, located in the in -order issue
`cluster 103, maintains a mapping of logical registers, located
`in the real register file 128, to physical registers allocated in
`the ROB 120. In addition, the ROB 120 supports speculative
`execution by buffering the results of the superscalar execu-
`tion cluster 130 before committing the results to an archi-
`tecturally visible state in the RRF 128.
`The ROB 120 is utilized to support register renaming. In
`general, register renaming involves allocating a new physi-
`cal register from a logical register, as the destination for a
`predefined architectural register. In microprocessor 100,
`register renaming renames logical registers associated with
`the REF 128 and allocates physical registers in the ROB
`120. Consequently, by renaming the registers, the supersca-
`lar execution cluster 130 executes different instructions in
`overlapping dock cycles even though the instructions utilize
`the same architectural register because different physical
`registers are allocated in the ROB 120 for each micro -op.
`The allocator 112 allocates an catty in ROB 120 for each
`micro -op. The allocator 112 allocates and deallocates entries
`in the ROB 120 in a FIFO manner. Upon allocation of a
`micro -op to a reorder buffer entry, the allocator 112 provides
`the reorder unit 120 with physical destination addresses to
`identify the allocation. Each physical destination in the ROB
`120 contains micro-op result data, flags a code for the result
`data, fault data, and a valid bit, which indicates whether or
`not the corresponding micro -op is valid. During the high
`phase of the system clock, the allocator 112 provides the
`three physical destination addresses to the reorder unit 120.
`In a subsequent low phase of the clock cycle, the in -order
`fetch and issue cluster 103 provides information to write
`entries into the ROB 120. Also, on the low phase of the clock
`cycle, ROB 120 entries receive data. In a preferred embodi-
`ment, up to four micro -ops are allocated in the ROB 120 in
`any given clock.
`For each micro -op, the allocator 112 allocates an entry in
`the reservation station (RS) 115. Each entry in the RS 115
`stores a valid bit, indicating validity of the corresponding
`entry, the micro -op instruction code, two source data entries
`and corresponding source data valid bits. In addition, the RS
`115 stores two physical source fields identifying the location
`of the source data if the entry is not valid, and a physical
`destination for the result of the micro -op. Upon allocation of
`entries in the RS 115 and ROB 120, each micro -op waits in
`the RS 115 for both available resource data and an available
`execution unit in the superscalar execution cluster 130.
`When the resource data and the appropriate execution unit
`are ready, the RS 115 dispatches the micro -op to the appro-
`priate execution unit in the superscalar execution cluster
`130.
`The out -of -order cluster 110 is coupled to the superscalar
`execution cluster 130. The superscalar execution cluster 130
`
`PUMA Exhibit 2011
`APPLE v. PUMA, IPR 2016-01114
`Page 12 of 18
`
`
`
`5,584,038
`
`5
`
`to
`
`7
`executes instructions utilizing source data stored in the ROB
`120 and the RRF 128. For the present embodiment, the
`superscalar execution cluster 130 comprises four execution
`units (AGU 150, IEU 149, PEU 147 and MIU 145). Spe-
`cifically, the superscalar execution cluster 130 contains an
`address generation unit AGU (150), an integer execution
`unit (IEU) 149, a floating point execution unit (FEU) 147,
`and a memory interface unit (M1U) 145. Upon execution of
`the micro -op in the superscalar execution unit 130, the
`corresponding execution unit writes the result data, the
`architectural flags, and any fault information in the appro-
`priate physical destination entry in the ROB 120.
`The retirement logic 140 retires the write -back results
`stored in the ROB 120 for each executed micro-op. In
`general, the retirement logic 140 retires the ROB 120 entries
`by evaluating the physical destination entries in the ROB
`120 in the order allocated The retirement logic 140 retires
`the physical destination entries by transferring write -back
`data into a corresponding logical register in the RRF 128 so
`as to commit the write -back data to the current architectural
`state of the microprocessor 100. Because the allocator 112
`allocates the physical destination entries in the ROB 120 in
`the original program order, and the retirement logic 140
`retires the physical destination entries in the same order, the
`original program order is maintained.
`In order to properly retire a micro -op, the retirement logic
`140 tests a valid bit in the corresponding ROB 120 entry to
`ascertain whether the corresponding entry contains a valid
`executed micro -op result. If the micro -op result is valid, as
`indicated by the valid bit, then the retirement logic 140
`checks the fault field of the corresponding Pdst to ascertain
`whether special handing is required. If the ROB 120 Pdst
`entry contains a valid executed micro -op result and no fault
`exists, then the executed micro -op result is committed to
`permanent architectural state in the RRF 128. The tail
`pointer is incremented for each micro -op retired. When the
`retirement logic 140 attempts to retire a branch micro -op, the
`retirement logic 140 tests the fault field of the corresponding
`Pdst entry to determine whether the branch micro-op was
`mispredicted. If the retirement logic 140 detects that the
`micro -op was mispredicted, the retirement logic 140 flushes
`all additional micro -ops issued subsequent to the mispre-
`dicted branch.
`As described above, the allocator 112 is contained within
`the in -order issue cluster 103 on the microprocessor 100.
`The allocator 112 interacts with both the in -order and
`out -of -order sections of the microprocessor 100. Specifi-
`cally, the allocator 112 interacts with instruction decoder
`109, register alias table (RAT) 111, branch target buffer
`(BTB) 107, the integer execution unit 149, ROB 120, RS
`115, and MOB 154. During each clock cycle of micropro-
`cessor 100, the allocator 112 prepares to allocate up to three
`ROB 120, RS 115, and load buffer (LB) entries within MOB
`154. In addition, the allocator 112 prepares to allocate up to
`two store block (SB) entries within MOB 154.
`In order to allocate the appropriate resources, the allocator
`112 generates pointers to the appropriate resources by
`decoding the micro -ops input from the instruction decoder
`109. The decoded micro -ops permit the allocator 112 to
`ascertain specific resources required for the micro -ops. In
`addition, the decoded micro -ops indicate a specific RS 115
`dispatch port. The decoded micro -ops contain a micro-op
`valid bit that permits the allocator 112 to further qualify
`resources required. Based on the resources required and
`validity of the micro-ops, the allocator 112 ascertains the
`availability of resources for the micro -ops.
`FIG. 2 illustrates a high level block diagram incorporating
`the teachings of the present invention. In part, FIG. 2
`
`8
`illustrates a portion of the allocator 112 utilized to allocate
`resources to the ROB 120. In addition, FIG. 2 illustrates
`ROB 120 and a branch execution unit 230. In order to
`allocate resources for the ROB 120, the allocator 112
`contains a micro -op decoder 210 and ROB allocation unit
`220. The micro-op decoder 210 analyzes the incoming
`stream of micro -ops in order to determine whether a RS 115,
`LB or SB entries are required. In a preferred embodiment,
`each micro -op requires one ROB 120 buffer entry.
`The entries in the ROB 120 are entitled physical desti-
`nation addresses (Pdst), as opposed to logical addresses that
`are the source and destination fields of the m