throbber
Complexity-Effective Superscalar Processors
`
`Subbarao Palacharla
`
`Norman P. Jouppi
`
`J. E. Smith
`
`Computer Sciences Department
`University of Wisconsin-Madison
`Madison, WI 53706, USA
`subbarao@cs.wisc.edu
`
`Western Research Laboratory
`Digital Equipment Corporation
`Palo Alto, CA 94301, USA
`jouppi@pa.dec.com
`
`Dept. of Electrical and Computer Engg.
`University of Wisconsin-Madison
`Madison, WI 53706, USA
`jes@ece.wisc.edu
`
`Abstract
`The performance tradeoff between hardware complexity and
`clock speed is studied. First, a generic superscalar pipeline is de-
`fined. Then the specific areas of register renaming, instruction win-
`dow wakeup and selection logic, and operand bypassing are ana-
`lyzed. Each is modeled and Spice simulated for feature sizes of
` 
`,  
` 

`, and    

`. Performance results and trends are
`expressed in terms of issue width and window size. Our analysis in-
`dicates that window wakeup and selection logic as well as operand
`bypass logic are likely to be the most critical in the future.
`A microarchitecture that simplifies wakeup and selection logic
`is proposed and discussed. This implementation puts chains of de-
`pendent instructions into queues, and issues instructions from mul-
`tiple queues in parallel. Simulation shows little slowdown as com-
`pared with a completely flexible issue window when performance is
`measured in clock cycles. Furthermore, because only instructions at
`queue heads need to be awakened and selected, issue logic is simpli-
`fied and the clock cycle is faster – consequently overall performance
`is improved. By grouping dependent instructions together, the pro-
`posed microarchitecture will help minimize performance degrada-
`tion due to slow bypasses in future wide-issue machines.
`1 Introduction
`For many years, a major point of contention among microproces-
`sor designers has revolved around complex implementations that at-
`tempt to maximize the number of instructions issued per clock cycle,
`and much simpler implementations that have a very fast clock cy-
`cle. These two camps are often referred to as “brainiacs” and “speed
`demons” – taken from an editorial in Microprocessor Report [7]. Of
`course the tradeoff is not a simple one, and through innovation and
`good engineering, it may be possible to achieve most, if not all, of
`the benefits of complex issue schemes, while still allowing a very
`fast clock in the implementation; that is, to develop microarchitec-
`tures we refer to as complexity-effective. One of two primary ob-
`jectives of this paper is to propose such a complexity-effective mi-
`croarchitecture. The proposed microarchitecture achieves high per-
`formance, as measured by instructions per cycle (IPC), yet it permits
`a design with a very high clock frequency.
`Supporting the claim of high IPC with a fast clock leads to the
`second primary objective of this paper. It is commonplace to mea-
`
`sure the effectiveness (i.e. IPC) of a new microarchitecture, typ-
`ically by using trace driven simulation. Such simulations count
`clock cycles and can provide IPC in a fairly straightforward man-
`ner. However, the complexity (or simplicity) of a microarchitecture
`is much more difficult to determine – to be very accurate, it requires
`a full implementation in a specific technology. What is very much
`needed are fairly straightforward measures of complexity that can
`be used by microarchitects at a fairly early stage of the design pro-
`cess. Such methods would allow the determination of complexity-
`effectiveness. It is the second objective of this paper to take a step
`in the direction of characterizing complexity and complexity trends.
`Before proceeding, it must be emphasized that while complexity
`can be variously quantified in terms such as number of transistors,
`die area, and power dissipated, in this paper complexity is measured
`as the delay of the critical path through a piece of logic, and the
`longest critical path through any of the pipeline stages determines
`the clock cycle.
`The two primary objectives given above are covered in reverse
`order – first sources of pipeline complexity are analyzed, then a
`new complexity-effective microarchitecture is proposed and eval-
`uated. In the next section we describe those portions of a microar-
`chitecture that tend to have complexity that grows with increasing
`instruction-level parallelism. Of these, we focus on instruction dis-
`patch and issue logic, and data bypass logic. We analyze potential
`critical paths in these structures and develop models for quantifying
`their delays. We study the variation of these delays with microarchi-
`tectural parameters of window size (the number of waiting instruc-
`tions from which ready instructions are selected for issue) and the is-
`sue width (the number of instructions that can be issued in a cycle).
`We also study the impact of the technology trend towards smaller
`feature sizes. The complexity analysis shows that logic associated
`with the issue window and data bypasses are likely to be key lim-
`iters of clock speed since smaller feature sizes cause wire delays to
`dominate overall delay [20, 3].
`Taking sources of complexity into account, we propose and eval-
`uate a new microarchitecture. This microarchitecture is called
`dependence-based because it focuses on grouping dependent in-
`structions rather than independent ones, as is often the case in super-
`scalar implementations. The dependence-based microarchitecture
`simplifies issue window logic while exploiting similar levels of par-
`allelism to that achieved by current superscalar microarchitectures
`using more complex logic.
`The rest of the paper is organized as follows. Section 2 describes
`the sources of complexity in a baseline microarchitecture. Section
`3 describes the methodology we use to study the critical pipeline
`
`1
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2044, p. 1
`
`

`

` Wakeup logic. This logic is part of the issue window and is
`responsible for waking up instructions waiting for their source
`operands to become available.
`
`Selection logic. This logic is another part of the issue window
`and is responsible for selecting instructions for execution from
`the pool of ready instructions.
`
` Bypass logic. This logic is responsible for bypassing operand
`values from instructions that have completed execution, but
`have not yet written their results to the register file, to subse-
`quent instructions.
`
`There are other important pieces of pipeline logic that are not con-
`sidered in this paper, even though their delay is a function of dis-
`patch/issue width. In most cases, their delay has been considered
`elsewhere. These include register files and caches. Farkas et. al. [6]
`study how the access time of the register file varies with the number
`of registers and the number of ports. The access time of a cache is a
`function of the size of the cache and the associativity of the cache.
`Wada et. al. [18] and Wilton and Jouppi [21] have developed de-
`tailed models that estimate the access time of a cache given its size
`and associativity.
`2.2 Current Implementations
`The structures identified above were presented in the context
`of the baseline superscalar model shown in Figure 1. The MIPS
`R10000 [22] and the DEC 21264 [10] are real implementations that
`directly fit this model. Hence, the structures identified above apply
`to these two processors.
`On the other hand, the Intel Pentium Pro [9], the HP PA-8000
`[12], the PowerPC 604 [16], and the HAL SPARC64 [8] do not
`completely fit the baseline model. These processors are based on
`a microarchitecture where the reorder buffer holds non-committed,
`renamed register values.
`In contrast, the baseline microarchitec-
`ture uses the physical register file for both committed and non-
`committed values. Nevertheless, the point to be noted is that the ba-
`sic structures identified earlier are present in both types of microar-
`chitectures. The only notable difference is the size of the physical
`register file.
`Finally, while the discussion about potential sources of complex-
`ity is in the context of an out-of-order baseline superscalar model,
`it must be pointed out that some of the critical structures identified
`apply to in-order processors, too. For example, part of the register
`rename logic (to be discussed later) and the bypass logic are present
`in in-order superscalar processors.
`3 Methodology
`The key pipeline structures were studied in two phases. In the
`first phase, we selected a representative CMOS circuit for the struc-
`ture. This was done by studying designs published in the literature
`(e.g. ISSCC proceedings) and by collaborating with engineers at
`Digital Equipment Corporation. In cases where there was more than
`one possible design, we did a preliminary study of the designs to
`decide in favor of one that was most promising. By basing our cir-
`cuits on designs published by microprocessor vendors, we believe
`the studied circuits are similar to circuits used in microprocessor de-
`signs. In practice, many circuit tricks could be employed to optimize
`critical paths. However, we believe that the relative delays between
`different structures should be more accurate than the absolute de-
`lays.
`
` International Solid-State and Circuits Conference.
`
`structures identified in Section 2. Section 4 presents a detailed anal-
`ysis of each of the structures and shows how their delays vary with
`microarchitectural parameters and technology parameters. Section
`5 presents the proposed dependence-based microarchitecture and
`some preliminary performance results. Finally, we draw conclu-
`sions in Section 6.
`
`2 Sources of Complexity
`In this section, specific sources of pipeline complexity are consid-
`ered. We realize that it is impossible to capture all possible microar-
`chitectures in a single model, however, and any results have some
`obvious limitations. We can only hope to provide a fairly straight-
`forward model that is typical of most current superscalar processors,
`and suggest that analyses similar to those used here can be extended
`to other, more advanced techniques as they are developed.
`
`Data-cache
`
`Bypass
`
`Register file
`
`Select
`Wakeup
`window
`Issue
`
`Rename
`
`Decode
`
`Fetch
`
`FETCH DECODE RENAME WAKEUP
`SELECT
`
`REG READ
`
`EXECUTE
`BYPASS
`
`DCACHE
`ACCESS
`
`REG WRITE
`COMMIT
`
`Figure 1: Baseline superscalar model.
`
`Figure 1 shows the baseline model and the associated pipeline.
`The fetch unit reads multiple instructions every cycle from the in-
`struction cache, and branches encountered by the fetch unit are pre-
`dicted. Next, instructions are decoded and their register operands
`are renamed. Renamed instructions are dispatched to the instruction
`window, where they wait for their source operands and the appro-
`priate functional unit to become available. As soon as these condi-
`tions are satisfied, instructions are issued and executed in the func-
`tional units. The operand values of an instruction are either fetched
`from the register file or are bypassed from earlier instructions in the
`pipeline. The data cache provides low latency access to memory
`operands.
`
`2.1 Basic Structures
`As mentioned earlier, probably the best way to identify the pri-
`mary sources of complexity in a microarchitecture is to actually im-
`plement the microarchitecture in a specific technology. However,
`this is extremely time consuming and costly. Instead, our approach
`is to select certain key structures for study, and develop relatively
`simple delay models that can be applied in a straightforward man-
`ner without relying on detailed design.
`Structures to be studied were selected using the following crite-
`ria. First, we consider structures whose delay is a function of issue
`window size and/or issue width; these structures are likely to be-
`come cycle-time limiters in future wide-issue superscalar designs.
`Second, we are interested in dispatch and issue-related structures
`because these structures form the core of a microarchitecture and
`largely determine the amount of parallelism that can be exploited.
`Third, some structures tend to rely on broadcast operations over
`long wires and hence their delays might not scale as well as logic-
`intensive structures in future technologies with smaller feature sizes.
`The structures we consider are:
`
` Register rename logic. This logic translates logical register
`designators into physical register designators.
`
`2
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2044, p. 2
`
`
`

`

`PHYSICAL
`REG MAPPED
`TO LOGICAL
`REG R
`
`MUX
`
`..
`
`PHYSICAL
`DEST
`REGS
`
`MAP
`TABLE
`
`PHYSICAL
`SOURCE
`REGS
`
`DEPENDENCE
`CHECK
`LOGIC
`(SLICE)
`
`....
`
`...
`
`LOGICAL
`SOURCE
`REGS
`
`LOGICAL
`DEST
`REGS
`
`LOGICAL
`SOURCE REG R
`
`Figure 2: Register rename logic.
`
`ister to which it is mapped. The number of entries in the map
`table is equal to the number of logical registers.
`
` CAM scheme
`An alternate scheme for register renaming uses a CAM
`(content-addressable memory) [19] to store the current map-
`pings. Such a scheme is implemented in the HAL SPARC [2]
`and the DEC 21264 [10]. The number of entries in the CAM is
`equal to the number of physical registers. Each entry contains
`two fields: the logical register designator that is mapped to the
`physical register represented by the entry and a valid bit that is
`set if the current mapping is valid. Renaming is accomplished
`by matching on the logical register designator field.
`
`In general, the CAM scheme is less scalable than the RAM scheme
`because the number of CAM entries, which is equal to the number
`of physical registers, tends to increase with issue width. Also, for
`the design space we are interested in, the performance was found to
`be comparable. Consequently, we focus on the RAM method below.
`A more detailed discussion of the trade-offs involved can be found
`in [15].
`The dependence check logic proceeds in parallel with the map ta-
`ble access. Every logical register designator being renamed is com-
`pared against the logical destination register designators of earlier
`instructions in the current rename group. If there is a match, then
`the physical register assigned to the result of the earlier instruction is
`used instead of the one read from the map table. In the case of mul-
`tiple matches, the register corresponding to the latest (in dynamic
`order) match is used. Dependence check logic for issue widths of
`2, 4, and 8 was implemented. We found that for these issue widths,
`the delay of the dependence check logic is less than the delay of the
`map table, and hence the check can be hidden behind the map table
`access.
`4.1.2 Delay Analysis
`As the name suggests, the RAM scheme operates like a standard
`RAM. Address decoders drive word lines; an access stack at the ad-
`dressed cell pulls a bitline low. The bitline changes are sensed by a
`sense amplifier which in turn produces the output. Symbolically the
`rename delay can be written as,
` 
`        "!#$ %&'(!#) "!#$ %*+  + 
` %,
`The analysis presented here and in following subsections focuses
`on those parts of the delay that are a function of the issue width and
`window size. All sources of delay are considered in detail in [15].
`In the rename logic, the window size is not a factor, and the issue
`width affects delay through its impact on wire lengths. Increasing
`
`In the second phase we implemented the circuit and optimized the
`circuit for speed. We used the Hspice circuit simulator [14] from
`Meta-Software to simulate the circuits. Primarily, static logic was
`used. However, in situations where dynamic logic helped in boost-
`ing the performance significantly, we used dynamic logic. For ex-
`ample, in the wakeup logic, a dynamic 7-input NOR gate is used
`for comparisons instead of a static gate. A number of optimizations
`were applied to improve the speed of the circuits. First, all the tran-
`sistors in the circuit were manually sized so that overall delay im-
`proved. Second, logic optimizations like two-level decomposition
`were applied to reduce fan-in requirements. We avoided using static
`gates with a fan-in greater than four. Third, in some cases transis-
`tor ordering was modified to shorten the critical path. Wire para-
`sitics were added at appropriate nodes in the Hspice model of the
`circuit. These parasitics were computed by calculating the length
`of the wires based on the layout of the circuit and using the values
`
`of   
` and

` , the resistance and parasitic capacitance of
`
`metal wires per unit length.
`To study the effect of reducing the feature size on the delays
`of the structures, we simulated the circuits for three different fea-
`ture sizes: 

`,  
` 
`, and 

`respectively. Layouts for
`the   

`and 

`process were obtained by appropriately
`shrinking the layouts for the  

`process. The Hspice models
`used for the three technologies are tabulated in [15].
`4 Pipeline Complexity
`In this section, we analyze the critical pipeline structures. The
`presentation for each structure begins with a description of the log-
`ical function being implemented. Then, possible implementation
`schemes are discussed, and one is chosen. Next, we summarize our
`analysis of the overall delay in terms of the microarchitectural pa-
`rameters of issue width and issue window size; a much more de-
`tailed version of the analysis appears in [15]. Finally, Hspice circuit
`simulation results are presented and trends are identified and com-
`pared with the earlier analysis.
`4.1 Register Rename Logic
`Register rename logic translates logical register designators into
`physical register designators by accessing a map table with the log-
`ical register designator as the index. The map table holds the cur-
`rent logical to physical mappings and is multi-ported because mul-
`tiple instructions, each with multiple register operands, need to be
`renamed every cycle. The high level block diagram of the rename
`logic is shown in Figure 2. In addition to the map table, dependence
`check logic is required to detect cases where the logical register be-
`ing renamed is written by an earlier instruction in the current group
`of instructions being renamed. The dependence check logic detects
`such dependences and sets up the output MUXes so that the appro-
`priate physical register designators are selected. At the end of every
`rename operation, the map table is updated to reflect the new logical
`to physical mappings created for the result registers written by the
`current rename group.
`4.1.1 Structure
`The mapping and checkpointing functions of the rename logic
`can be implemented in at least two ways. These two schemes, called
`the RAM scheme and the CAM scheme, are described next.
`
` RAM scheme
`In the RAM scheme, implemented in the MIPS R10000 [22],
`the map table is a register file where the logical register desig-
`nator directly accesses an entry that contains the physical reg-
`
`3
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2044, p. 3
`
`

`

`Sense Amp delay
`
`Bitline delay
`
`Wordline delay
`
`Decoder delay
`
`8
`
`4(cid:9)
`
`0.18
`
`8
`
`2
`
`4(cid:9)
`
`0.35
`
`8
`
`2
`
`4(cid:9)0
`
`.8
`
`2
`
`Figure 3: Rename delay versus issue width.
`
`1600
`
`1200
`
`800
`
`400
`
`0
`
`Rename delay (ps)
`
`to indicate this. The issue window is a CAM array holding one in-
`struction per entry. Buffers, shown at the top of the figure, are used
`to drive the result tags
`is the issue width.
`to  
`
`, where I
`Each entry of the CAM has  
`
`comparators to compare each
`of the results tags against the two operand tags of the entry. The OR
`logic ORs the comparator outputs and sets the rdyL/rdyR flags.
`
`tagIW
`tag1
`. . .
`
`=
`=
`
`OR
`
`rdyR
`
`inst0
`
`opd tagR
`
`...
`
`=
`=
`
`opd tagL
`
`...
`
`OR
`
`rdyL
`
`rdyL
`
`opd tagL
`
`opd tagR
`
`rdyR
`
`instN-1
`
`Figure 4: Wakeup logic.
`
`4.2.2 Delay Analysis
`The delay consists of three components: the time taken by the
`buffers to drive the tag bits, the time taken by the comparators in a
`pull-down stack corresponding to a mismatching bit position to pull
`the matchline low , and the time taken to OR the individual match
`signals (matchlines). Symbolically,
`
` 
`     ! &

` !  
`! #"%$
`
`The time taken to drive the tags depends on the length of the tag
`lines and the number of comparators on the tag lines. Increasing the
`window size increases both these terms. For a given window size,
` We assume that only one pull-down stack is turned on since we are in-
`terested in the worst-case delay.
`
`4
`
`the issue width increases the number of bitlines and wordlines in
`each cell thus making each cell bigger. This in turn increases the
`length of the predecode, wordline, and bitline wires and the associ-
`ated wire delays. The net effect is the following relationships for the
`delay components:
`
` (    "!#$ '(!#) "!#$    
` 
 
`
`is the issue width and , , and are constants that are
`where 
`
`fixed for a given technology and instruction set architecture; deriva-
`tion of the constants for each component is given in [15]. In each
`case, the quadratic component, resulting from the intrinsic RC de-
`lay of wires, is relatively small for the design space and technolo-
`gies we explored. Hence, the decode, wordline, and bitline delays
`are effectively linear functions of the issue width.
`For the sense amplifier, we found that even though its structural
`constitution is independent of the issue width, its delay is a function
`of the slope of the input – the bitline delay – and therefore varies
`linearly with issue width.
`4.1.3 Spice Results
`For our Hspice simulations, Figure 3 shows how the delay of the
`rename logic varies with the issue width i.e. the number of instruc-
`tions being renamed every cycle for the three technologies. The
`graph includes the breakdown of delay into components discussed
`in the previous section.
`A number of observations can be made from the graph. The to-
`tal delay increases linearly with issue width for all the technologies.
`This is in conformance with our analysis, summarized in the previ-
`ous section. Furthermore, each of the components shows a linear
`increase with issue width. The increase in the bitline delay is larger
`than the increase in the wordline delay as issue width is increased
`because the bitlines are longer than the wordlines in our design. The
`bitline length is proportional to the number of logical registers (32 in
`most cases) whereas the wordline length is proportional to the width
`of the physical register designator (less than 8 for the design space
`we explored).
`Another important observation that can be made from the graph is
`that the relative increase in wordline delay, bitline delay, and hence,
`total delay as a function of issue width worsens as the feature size is
`reduced. For example, as the issue width is increased from 2 to 8,
`the percentage increase in bitline delay shoots up from 37% to 53%
`as the feature size is reduced from 

`to    

`. Logic delays
`in the various components are reduced in proportion to the feature
`size, while the presence of wire delays in the wordline and bitline
`components cause the wordline and bitline components to fall at a
`slower rate. In other words, wire delays in the wordline and bitline
`structures will become increasingly important as feature sizes are re-
`duced.
`4.2 Wakeup Logic
`Wakeup logic is responsible for updating source dependences for
`instructions in the issue window waiting for their source operands to
`become available.
`4.2.1 Structure
`Wakeup logic is illustrated in Figure 4. Every time a result is pro-
`duced, a tag associated with the result is broadcast to all the instruc-
`tions in the issue window. Each instruction then compares the tag
`with the tags of its source operands. If there is a match, the operand
`is marked as available by setting the rdyL or rdyR flag. Once all the
`operands of an instruction become available (both rdyL and rdyR
`are set), the instruction is ready to execute, and the ready flag is set
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2044, p. 4
`
`(cid:9)
`(cid:9)
`

`

`expected, the delay increases as window size and issue width are in-
`creased. The quadratic dependence of the total delay on the window
`size results from the quadratic increase in tag drive time as discussed
`in the previous section. This effect is clearly visible for issue width
`of 8 and is less significant for issue width of 4. We found similar
`curves for  

`and 
` 

`technologies. The quadratic depen-
`dence of delay on window size was more prominent in the curves for
` 

`technology than in the case of the other two technologies.
`Also, issue width has a greater impact on the delay than window
`size because increasing issue width increases all three components
`of the delay. On the other hand, increasing window size only length-
`ens the tag drive time and to a small extent the tag match time. Over-
`all, the results show that the delay increases by almost 34% going
`from 2-way to 4-way and by 46% going from 4-way to 8-way for
`a window size of 64 instructions. In reality, the increase in delay
`is going to be even worse because in order to sustain a wider issue
`width, a larger window is required to find independent instructions.
`Figure 6 shows the effect of reducing feature sizes on the vari-
`ous components of the wakeup delay for an 8-way, 64-entry win-
`dow processor. The tag drive and tag match delays do not scale as
`well as the match OR delay. This is expected since tag drive and tag
`match delays include wire delays whereas the match OR delay only
`consists of logic delays. Quantitatively, the fraction of the total de-
`lay contributed by tag drive and tag match delay increases from 52%
`to 65% as the feature size is reduced from  

`to    
`. This
`shows that the performance of the broadcast operation will become
`more crucial in future technologies.
`
`Match OR delay
`
`Tag match delay
`
`Tag drive delay
`
`(cid:9) (cid:9)
`
`0.8
`
`(cid:9) (cid:9)
`0.35
`(cid:9) (cid:9)
`Feature size
`
`0.18
`
`Figure 6: Wakeup delay versus feature size.
`
`1500
`
`1200
`
`900
`
`600
`
`300
`
`0
`
`Wakeup delay (ps)
`
`4.3 Selection Logic
`Selection logic is responsible for choosing instructions for execu-
`tion from the pool of ready instructions in the issue window. Some
`form of selection logic is required because the number and types of
`ready instructions may exceed the number and types of functional
`units available to execute them.
`Inputs to the selection logic are request (REQ) signals, one per
`instruction in the issue window. The request signal of an instruction
`is raised when the wakeup logic determines that all its operands are
`available. The outputs of the selection logic are grant (GRANT) sig-
`nals, one per request signal. On receipt of the GRANT signal, the
`associated instruction is issued to the functional unit.
`A selection policy is used to decide which of the requesting in-
`structions is granted. An example selection policy is oldest first -
`the ready instruction that occurs earliest in program order is granted
`
`increasing issue width also increases both the terms in the follow-
`ing way. Increasing issue width increases the number of matchlines
`in each cell and hence increases the height of each cell. Also, in-
`creasing issue width increases the number of comparators in each
`cell. Note that we assume the maximum number of tags produced
`per cycle is equal to the maximum issue width.
`In simplified form (see [15] for a more detailed analysis), the time
`taken to drive the tags is:
`
`   !    
`    %
` 
   
`    
`    
`
`The above equation shows that the tag drive time is a quadratic func-
`tion of the window size. The weighting factor of the quadratic term
`is a function of the issue width. The weighting factor becomes sig-
`nificant for issue widths beyond 2. For a given window size, the tag
`drive time is also a quadratic function of the issue width. For cur-
`rent technologies (  

`and longer) the quadratic component is
`relatively small and the tag drive time is largely a linear function of
`issue width. However, as the feature size is reduced to   

`, the
`quadratic component also increases in significance. The quadratic
`component results from the intrinsic RC delay of the tag lines.
`In reality, both issue width and window size will be simulta-
`neously increased because a larger window is required for find-
`ing more independent instructions to take advantage of wider issue.
`Hence, the tag drive time will become significant in future designs
`with wider issue widths, bigger windows, and smaller feature sizes.
`The tag match time is primarily a function of the length of the
`matchline, which varies linearly with the issue width. The match
`OR time is the time taken to OR the match lines, and the number of
`matchlines is a linear function of issue width. Both of these (refer
`to [15]) have a delay:
`
`

` 

` ! #"%$     
` 
 
`
`
`However, in both cases the quadratic term is very small for the de-
`sign space we consider, so these delays are linear functions of issue
`width.
`
`8-way
`4-way
`2-way
`
`16
`
`24
`
`40
`32
`Window Size
`
`48
`
`56
`
`64
`
`350
`
`300
`
`250
`
`200
`
`150
`
`100
`
`50
`
`0
`
`8
`
`Wakeup Delay (ps)
`
`Figure 5: Wakeup logic delay versus window size.
`
`4.2.3 Spice Results
`The graph in Figure 5 shows how the delay of the wakeup logic
`varies with window size and issue width for   
`technology. As
`
`5
`
`Patent Owner Saint Regis Mohawk Tribe
`Ex. 2044, p. 5
`
`(cid:9)
`

`

`where  and are constants determined by the propagation delays
`of a single arbiter. We found the optimal number of arbiter inputs to
`be four in our case, so the logarithm is base 4. The selection logic
`in the MIPS R10000, described in [17], is also based on four-input
`arbiter cells.
`4.3.3 Spice Results
`Figure 8 shows the delay of the selection logic for various win-
`dow sizes and for the three feature sizes assuming a single func-
`tional unit is being scheduled. The delay is broken down into the
`three components. From the graph we can see that for all the three
`technologies, the delay increases logarithmically with window size.
`Also, the increase in delay is less than 100% when the window size
`is increased from 16 instructions to 32 instructions (or from 64 in-
`structions to 128 instructions) since one of the components of the
`total delay, the delay at the root cell, is independent of the window
`size.
`
`Grant propagation delay
`
`Root delay
`
`Request propagation delay
`
`16 32 64
`
`128 (cid:9) 16 32 64
`
`128 (cid:9) 16 32 64
`
`128
`
`0.8
`
`0.35
`
`0.18
`
`Figure 8: Selection delay versus window size.
`
`3000
`
`2500
`
`2000
`
`1500
`
`1000
`
`500
`
`0
`
`Selection delay (ps)
`
`The various components of the total delay scale well as the fea-
`ture size is reduced. This is not surprising since all the delays are
`logic delays.
`It must be pointed out that we do not consider the
`wires in the circuit, so the selection delays presented here are op-
`timistic, especially if the request signals (the ready flags discussed
`in the wakeup logic) originate from the CAM entries in which the
`instructions reside. On the other hand, it might be possible to mini-
`mize the effect of these wire delays if the ready signals are stored in
`a smaller, more compact array.
`4.4 Data Bypass Logic
`Data bypass logic is responsible for forwarding result values from
`completing instructions to dependent instructions, bypassing the
`register file. The number of bypass paths required is determined by
`the depth of the pipeline and the issue width of the microarchitec-
`ture. As pointed out in [1], if
`
`is the issue width, and if there are
` pipestages after the first result-producing stage, then a fully by-
`passed design would require   
`   bypass paths assuming
`2-input functional units. In other words, the number of bypass paths
`grows quadratically with issue width. This is of critical importance,
`given the current trends toward deeper pipelines and wider issue.
`Bypass logic consists of two components: datapath and control.
`The datapath comprises result busses, that are used to broadcast by-
`
`the functional unit. Butler and Patt [5] studied various policies for
`scheduling ready instructions and found that overall performance is
`largely independent of the selection policy. The HP PA-8000 uses
`a selection policy that is based on the location of the instruction in
`the window. We assume the same selection policy in our study.
`
`ISSUE WINDOW
`
`grant3
`req3
`grant2
`req2
`grant1
`req1
`grant0
`req0
`
`anyreq
`
`enable
`
`anyreq
`
`enable
`
`anyreq enable
`
`anyreq
`
`enable
`
`from/to other subtrees
`
`root cell
`enable
`
`grant0
`req0
`
`grant3
`req3
`grant2
`req2
`grant1
`req1
`grant0
`req0
`
`anyreq
`
`enable
`
`grant3
`grant2
`grant1
`grant0
`
`req3
`req2
`req1
`req0
`
`OR
`
`Priority
`Encoder
`
`ARBITER CELL
`
`anyreq
`
`enable
`
`Figure 7: Selection logic.
`
`4.3.1 Structure
`The basic structure of selection logic is shown in Figure 7. Modi-
`fications to this scheme for handling multiple functional units of th

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket