throbber
3
`
`Memory
`

`Channel
`
`Processor
`
`Channel
`
`String
`Controller
`
`String ofDisks
`
`FIGURE 5.24 Disk 1/0 cache placement options.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 301
`
`

`

`5.2 Disk Systems
`
`285
`
`load. However, Smith [SMIT85) showed that skew is dynamic and
`changes from disk to disk over time. One placement, therefore, may be
`far from ideal for a varying workload on the system. A cache would then
`be required for all disks. For disk manufacturers, incorporating the cache
`into the disk permits the manufacturer to sell the cache as well as the
`d.isk. However, this is not necessarily the best for overall system cost
`and/or performance. A13 discussed below, for a case where there is a fixed
`amount of memory available for a cache, the memory should be placed
`at the processor.
`Smith's paper on disk caches gives miss ratio data derived from
`trace data on an IBM S/360/91, 370/165, 370/168s and an Amdahl
`470V/6 [SMIT85). The trace data is then used to evaluate disk cache
`miss rates for various cache sizes, organizations, and locations in the
`system. Of the five possible locations for caches shown in Figure 5.24,
`three locations are evaluated. The results of the simulation are shown
`in Table 5.17.
`
`Global. One cache at the main memory (location 1).
`Controller. Multiple caches, one with each I/O Channel (location 2).
`Device. Multiple caches, one with each disk (location 5).
`
`The results of these simulations indicate that for the evaluated disk
`caches sizes, the best miss ratio is obtained with global d.isk caches.
`However, as would be expected, there can be significant variations in
`the miss ratios for different workloads. Ousterhout [OUST85) reports
`that for measurements on VAX with a 2-Mbyte disk cache, only 17.7%
`of all requests are actually served by the disk. Goldstein [GOLD87] uses
`a miss ratio of 0.1 to justify the use of disk caches in the future-a figure
`
`Total Cache
`Capacity
`(Mbytes)
`
`Crocker Bank
`Business Data Processing
`
`SLAC
`Scientific Calculations
`
`1
`2
`4
`8
`16
`32
`64
`
`Global Controller Device Global Controller
`
`Device
`
`0.316
`0.259
`0.225
`0.197
`0.172
`0.150
`0.139
`
`0.475
`0.365
`0.275
`0.233
`0.203
`0.177
`0.155
`
`0.610
`0.450
`0.330
`0.266
`0.224
`0.199
`0.175
`
`0.330
`0.226
`0.146
`0.099
`0.070
`0.050
`0.033
`
`0.601
`0.414
`0.326
`0.227
`0.136
`0.085
`0.061
`
`0.630
`0.496
`0.370
`0.271
`0.174
`0.109
`0.071
`
`Note. The total cache capacity is evenly distributed over the controllers or devices.
`
`TABLE 5.17 Disk cache miss ratios.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 302
`
`

`

`286
`
`Interleaved Memory and Disk Systems
`
`that is consistent with this data. Therefore, it seems that a miss ratio
`in the range of0.1 constitutes a reasonable estimate for design purposes.
`There are a number of system issues concerning disk caches. In
`general, these are the same issues addressed in Chapter 2 on multilevel
`caches. A disk cache should be write back because of the disk latency, and
`multilevel inclusion should be enforced. In addition, there are operating
`system issues such as operating system control of the disk cache and
`error recovery. Additional issues are problems in coherency and reli(cid:173)
`ability with the disk cache operating in wrjte-back mode. Most of these
`design issues are discussed at length in [SMIT85]. It is interesting to
`note that the software-controlled disk cache of MS-DOS places such
`considerations under user control.
`Smith presents an interesting discussion on alternatives to disk
`caches. Today, with large main memories and large buffers, the disk
`cache may not be the most effective method of reducing disk accesses.
`Alternatives discussed by Smith include large cache block sizes that
`benefit from spatial locality, larger main memory to reduce the need to
`page temporary files, solid state drums and disks, using the virtual
`memory system to map the disk address space into the program address
`space, and others. If a designer has a fixed additional memory, how
`should this memory be allocated; should it be used as a disk cache, more
`main memory, or as program managed buffers? The conclusion seems to
`be that system simplification issues are more important than perfor(cid:173)
`mance issues in answering this design question.
`While not a disk cache, an extended memory as a paging device
`(solid state drum) has been used with a number of supercomputers.
`These devices have been very effective in reducing the time required for
`page-out and page-in when a task is swapped out in a multiprogramming
`environment.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 303
`
`

`

`6
`
`Pipeli ned Processors
`
`6.0 Overview
`The design process of early computers began with the instruction set
`and progressed to implementation. While instruction set design and
`implementation were not completely disjoint, they were reasonably so.
`Once an instruction set was designed, with some forethought of its im(cid:173)
`plementation, the specified functions were hardwired into a controller.
`The controller could be a decoded counter or a tapped delay line that
`produced the timing pulses to enable the transfer gates.
`Wilkes [WILK53] wrote the seininal paper on microprogramming in
`which he suggested that there is a better implementation technique for
`sequential machines than hardwiring. He suggested that the instruction
`set of a processor could be interpreted by another processor. Thus, the
`program of the second processor is the controller for the first program.
`The name given to this technique is microprogramming.
`The technology of the 1950s did not permit the use of microprogram(cid:173)
`ming because Read Only Memory (ROM), in the form of diodes, was just
`too expensive. A single germanium diode sold for as much a $10 in 1953!
`However, with time, the cost of ROM for microprogram stores decreased
`to the point that the IBM S/360 family was designed, with a very
`complex instruction set, for microprogram implementation. This family
`of processors used a number of ROM technologies to store the micropro(cid:173)
`gram, except for the very high performance members of the family that
`are hardwire controlled.
`Microprogramming technology provided a convenient way to exploit
`concurrency or function overlap because a long micro-instruction permit(cid:173)
`ted a number of transfer paths to be enabled simultaneously. However,
`the programming task required to accomplish this was, and still is,
`daunting because identifying concurrency is a manual task and is subject
`to procedural error. Microprogramming gave the instruction set designer
`almost unlimited freedom, which supported the design of the so-called
`287
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 304
`
`

`

`288
`
`Pipelined Processors
`
`CISC architectures. These early processors followed the serial execution
`model.
`A serial execution model processor is one in which a program counter
`sequences through instructions one by one, finishing one instruction
`before starting the next (paraphrased from [HWU87]).
`Most of the parallelism controlled by the microprogram was within
`the execution of a single instruction. H owever, there are exceptions to
`the serial execution model where por tions of two instructions are per(cid:173)
`formed together. For example , instruction i + 1 could be fetched while
`instruction i is completing. Recall that the von N eumann architecture
`fetched two instructions at once. Nevertheless, if for some reason the
`flow of control is changed, no state has been changed and the execution
`continued to follow the serial execution model.
`Pipelines do not follow the serial execution model. At any one time
`only one portion of a given instruction is being processed in the pipeline.
`However, other portions of a number of differ ent ins tructions are being
`simultaneously processed. This major and fundamental characteristic of
`pipelines-that is, the concurrent execution of portions of a number of
`instructions-leads to most of the pipeline design problems. Pipelining
`formalizes the identification of parallelism and replaced the asynchro(cid:173)
`nous interface between functions with synchronizing clock pulses. A task
`is broken up into its constituent parts, and a special purpose hardware
`function is developed for each part. If the workload is a long vector, the
`vector AUs can be streamed down the pipeline, producing one result per
`clock.
`
`A pipeline is "a structure that consists of a sequence of stages
`through which a computation flows with the property that new
`operations can be initiated at the start of the pipeline while other
`operations are in progress t hrough the pipeline" [STON93].
`
`The IBM 7094 (1963) is an example of an early overlapped architec(cid:173)
`ture and is the successor to the nonoverlapped IBM 704 (1956). The
`7094 had three functional units that could be operated concurrently: the
`instruction fetch unit, the decode and data fetch, and the execution unit
`[CHENB0]. Additional concurrency is provided by fetching two instruc(cid:173)
`tions (as with the von Neumann architect ure) and two operands at once.
`Complex interlocks controlled the execution of an instruction because
`the three functional units operated asynchronously. Another example of
`overlap is found later in the Intel 8086. The designers considered a
`number of overlap schemes selected to enhance the performance of this
`processor .. The issue was to provide reasonable performance with a
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 305
`
`

`

`6.0 OveNiew
`
`289
`
`reasonable die size [MCKE79]. The selection of overlap options and their
`resulting performance is described in Section 6.4.
`During World War II, code-breaking computers were developed at
`Benchly Park by Allen Turing. This work was followed in the early
`1960s, at the National Security Agency, with the development ofspecial(cid:173)
`purpose computers for breaking codes and ciphers using pipelined proces(cid:173)
`sors. Researchers at the National Security Agency with assistance from
`their contractors, IBM and UNIV AC, played a significant role in the
`development of pipelining techniques in general and of circuit techniques
`for pipelining in particular.
`Concurrent with the development of microprogramming and simple
`overlapped processors, researchers were investigating methods for
`spee.ding up frequently used functions and algorithms by using attached
`processors [ESTR60]. Many of these special purpose processors were
`directed toward various forms of signal processing and seismic data
`processing. FFT processors and the more general convolver processors
`were produced [TEX65]. The special purpose processors have evolved
`to the coprocessors-processors (attached processors) used with many
`microprocessors today. In the main, attached processors only addressed
`the problem of high-speed execution of arithmetic operations. The main
`processor handled other functions such as instruction fetching and pro(cid:173)
`cessing. Vector processors (discussed in Chapter 11) are direct descen(cid:173)
`'dants of the pipelined convolver boxes of the 1960s.
`In 1955 the UNIV AC Corporation and IBM began designing a new
`generation of general scientific computers for the Lawrence Livermore
`Laboratory. These machines were conceived to push the state-of-the-art
`in architecture and achieve a greater performance increase than could
`be obtained from faster circuit and memory technology alone.
`The UNIV AC machine, named LARC, was delivered in 1959. This
`processor used a four-stage pipeline that is clocked from a common source
`[ECKE59] and is the precursor of the technique of pipelining as practiced
`today. The IBM Stretch, delivered in 1961, design goal was an increase
`of lOO x in performance with a memory speed increase of6X and a logic
`speed increase of 10 x [BLOC59, BUCH62, ROSE69]. These numbers
`indicate that a concurrency level of 10 to 15 would be required to meet
`the design goals. The Stretch architecture explored several ad hoc con(cid:173)
`currency techniques and used a number of functional stages operating
`asynchronously. By means of interlocks between these stages, work is
`passed down the pipeline as completed.
`The LARC and Stretch are important precursor supercomputers,
`but they were not commercial successes and were followed by the first
`generation of pipelined supercomputers. These supercomputers are pipe(cid:173)
`lined for instruction processing as well as execution. The dates of first
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 306
`
`

`

`290
`
`Pipelined Processors
`
`installation of these machines are:
`
`CDC 6600
`IBM S/360/91
`CDC 7600
`TIASC
`CDC STAR 100
`Cray 1
`
`1964 [MATl80]
`1967 [CASE78]
`1969 [MATI80]
`1972 [CRAG89]
`1974 [RIGA84]
`1976 [MATI80]
`
`Two early pseudo-pipelined processors deserve notice: the VAX
`11/78 and the Intel 8008. The VAX-11/780 employed a three-stage pipe(cid:173)
`line [EMER84] that is reminiscent of the Stretch pipeline. That is, each
`of the three stages is an autonomous processor that executes its own
`microcode instruction stream to perform the function of the stage. In(cid:173)
`terlocking between the stages controls the flow of an instruction down
`the pipeline.
`Microprocessor design has followed the path of the supercomputers.
`The first microprocessors, the Intel and Texas Instruments 8008
`[NOYC81], are hardwired processors. As the 8008 evolved, its serial
`execution model was replaced by pipelining: The i486 is pipelined with
`five stages, for example [FU89, CRAW90, INTE90]. Processors such as
`the IBM S/360 are initially microprogrammed sequential machines, but
`contemporary models are implemented as pipelines. Researchers at IBM
`have published many of the "tricks" of pipelining a CISC architecture
`in the form of patents and in the IBM Technical Disclosure Bulletin.
`Pipelining is now such an accepted implementation approach that all
`new architectures after 1985, such as the MIPS, SPARC, PowerPC, and
`others, are based on pipelining as a first consideration followed by the
`design of an instruction set that is amenable to pipeline execution-a
`procedure that is the reverse of that used for the early computers.
`The Intel Pentium represents an architecture that was not intended for
`pipelining but is now being pipelined-a significant challenge.
`
`6.1 Pipeline Performance Models
`The first performance models for pipelines were developed by Cotton
`[CO'IT65]; he investigated the tradeoffs between gate delay, latch delay,
`pipeline length, and, significantly, the issue of clock skew and latch
`setup time. Clock skew plagues pipeline designers with problems of clock
`distribution even today.
`Davidson [DA VI71] formulated a graphical representation of the
`temporal behavior of pipelines named reservation tables. Similar graphi(cid:173)
`cal representations of the space-time relationship of pipelines, which
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 307
`
`

`

`6. 1 Pipeline Performance Models
`
`291
`
`Time-
`12345678
`
`S 1
`
`11
`
`T 2
`A
`G 3
`
`E 4
`
`I2
`
`11
`
`I3
`
`I2
`
`I1
`
`I4
`
`I3
`
`I2
`
`I1
`
`I4
`
`I3
`
`I2
`
`14
`
`I3
`
`14
`
`FIGURE 6.1 Reservation table by stage.
`
`evolved from Gantt charts, are described by [CHEN71]. Tables of this
`type_ in their original and modified form serve today to help explain
`pipeline operation.
`A reservation table is shown in Figure 6.1. Time steps, in clocks,
`are shown on the horizontal axis, and the pipeline stages are_ shown on
`the vertical axis. The vertical lines between time steps signify the clock
`pulses. This reservation table illustrates a pipeline of four stages and a
`four-instruction sequence: I1 to 14. The first result, 11, can be clocked
`out of the pipeline at the end of t4 and the last result at the end of t7.
`The instruction sequence, for this example, is broken with instruction 4
`and cannot be resumed until time 8, after the pipeline is emptied, which
`is referred to as flushing. This form of reservation table is similar to the
`presentation of a logic analyzer with a probe on each of the pipeline
`stages.
`A common variation of the reservation table is to show time on the
`horizontal axis and the sequence of instructions on the vertical axis. The
`same pipeline and workload illustrated in Figure 6.1 are shown in Figure
`6.2.
`
`Note that for both of these reservation tables, at t3 for example,
`instructions 11, 12, and 13 are active in stages S3, S2, and S1, respec(cid:173)
`tively. An objection to this form of reservation table is that it expands
`in two dimensions as the number of instructions increases; the first form
`
`Time-
`12 34 56789
`
`Inst. 1
`
`Sl 82
`
`S3 84
`
`Inst. 2
`
`Inst.3
`
`Inst. 4
`
`S1
`
`82 S3
`
`84
`
`81
`
`82 S3
`
`S4
`
`S1 82
`
`S3 S4
`
`FIGURE 6.2 Reservation table by instruction.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 308
`
`

`

`292
`
`Pipelined Processors
`
`expands in only one dimension. It is a matter of choice as to which form
`of reservation table best conveys the desired information. However,
`both forms are frequently used in reference manuals of microprocessors,
`while the first form is used exclusively in this book.
`A processor pipeline can be viewed as a single entity. However,
`many early pipelines are decomposed into two major functions:
`
`1. Instruction Processor Unit (IPU),
`2. Execution Unit(s) (EU).
`
`This subdivision is useful in explaining the operation of a pipeline.
`For example, low-density circuit technology often requires a functional
`partition into physical cabinets. In some cases, the instruction processor
`was in one cabinet and the execution unit resided in anoth er. The in(cid:173)
`terconnections between the two cabinets had to be a pipeline stage that
`provided no logic function. This technique is beginning to be used again
`for communication between chips.
`The performance models developed below consider the pipeline as a
`unit; the division into IPU and EU is not made. The unit of time in
`these models is a clock period, and the unit of work is an instruction.
`Note that only one instruction can be issued in one clock period. Processor
`implementations that can issue more than one instruction in a clock
`period are discussed in Chapter 10.
`The simplest possible scheduling strategy is assumed for this model
`That is, a break in the flow of instructions causes the pipeline to stall,
`introducing delays, until the last instruction in the sequence exits the
`pipeline; this strategy is illustrated in Figure 6.1. Other scheduling and
`strategies are discussed in Chapters 7 and 8.
`For this model, known as the pipeline model, the number of clocks
`required to process a sequence of k instructions is
`
`number of clocks = clocks for the first result to exit
`+ length of a run of instructions -1
`=n+k-1
`
`where n is the number of pipeline stages and k is the length of an
`instructionsequence ending in a taken branch. And the clocks per instruc(cid:173)
`tion (CPI) is
`
`CPI = n + k - 1 = l + n - 1.
`k
`k
`
`As k--> 1, CPI--> n and as k--> oo, CPI--> 1. In other words, 1 s CPI s n.
`For the limit of CPI = 1, all of the overhead (delays) needed to fill
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 309
`
`

`

`6. 1 Pipeline Performance Models
`
`293
`
`the pipeline have been prorated over the large number of executed
`instructions. For the limit of CPI = n, there is only one instruction and
`it must be clocked down all of the pipeline stages.
`For a serial execution model implementation, n clocks would be
`required to execute a single instruction: That is, using a time-shared
`common block of logic, the instructions make n passes through the logic
`to complete the execution of an instruction. Note that the number of
`passes is the same as the number of pipeline stages. If the clock period
`for both the serial execution model processor and the pipeline implemen(cid:173)
`tations is the same, the speedup of a pipelined implementation over a
`sequential model implementation is
`
`S = serial execution model time (in clocks)
`pipeline time (in clocks)
`
`nk
`n + k-l
`
`As k--+ oo , S - n.
`These limits tell us that for pipelines to be effective, the length (k)
`of the stream of AUs processed must be as long as possible. However,
`there can be an n-fold increase in the hardware of the pipeline over the
`simple serial execution model. The major goal of a processor pipeline
`design is to reduce the value of CPI by increasing the apparent value
`of k . Methods for achieving large values of k are the subject of Chapters
`6 to 8.
`When the stream of instructions is broken, a pipeline delay, stall or
`break , is said to have occurred that produces bubbles in the pipeline. For
`example, Figure 6.1 has a delay of four clocks. These delays will reduce
`the performance by increasing CPI to a value greater than 1. The model
`for CPI has two terms: the steady-state term, which indicates that the
`pipeline will produce one result every clock, and the delay term, which
`indicates the increase in clocks due to delays:
`
`CPI = steady state + delays = 1 + delays.
`
`Pipeline delays occur for three major reasons.
`1. Control. Change of flow control caused by branches (Chapter 7),
`interrupts (Chapter 9), traps (Chapter 9), and exceptions (Chapter 9).
`2. Data dependencies (Chapter 8), particularly true dependencies,
`anti dependencies, and output dependencies.
`3 . Structural delays, pertaining to logic resources (Chapter 8) and
`memory delays (Chapters 2, 3, 4, and 5).
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 310
`
`

`

`294
`
`Pipelined Processors
`
`Note that the delays added to the steady-state CPI= 1 are the proba(cid:173)
`bilities that a delay will occur (such as a taken branch) times the delay
`that results with the event (such as the taken branch delay). The magni(cid:173)
`tude of the delay is of primary concern to the processor designer and is
`discussed in the noted chapters. The probability that an event will
`occur is partially a function of the instruction set architecture and the
`workload, subjects not directly addressed in this book.
`
`6.2 Pipeline Design Considerations
`Selecting the number of pipeline stages is a significant design problem
`because of the following considerations. For a given logic path, the
`greater the subdivisions into stages, the shorter the logic path per stage,
`the faster the clock, and, since performance has been stated in CPI, the
`faster the execution of a program. Also, recall that the speedup of a
`pipelined processor compared to a serial execution model processor is
`the number of pipeline stages, as k-> co. However, as the length of the
`pipeline increases, the delays due to flushing the pipeline increase be(cid:173)
`cause very large k is not realistic. Thus there is an optimum number of
`pipeline stages.
`The logic path is the number of levels of logic required if the pro(cid:173)
`cessor executed an instruction in one clock, typically in excess of 100
`gates levels or delays. Figure 6.3 shows a one-stage pipeline with L logic
`levels between two registers clocked from the same source. If the pipeline
`is subdivided into two stages, each stage of logic has L/2 of the gate
`level, and so on.
`In addition to the logic path, each stage of the pipeline requires a
`fixed time for clock skew and latch setup. If the subdivision of the logic
`path into stages is overdone (that is, it has too many stages), the clock
`skew and latch setup time will dominate the clock period determination.
`The design question is: What is the optimum number of pipeline stages?
`Larson and Davidson [LARS73] formulated the first known model that
`relates the processing rate to clock skew, the value of k, and the number
`
`R
`
`R
`
`Ci:0:P:0-
`
`R
`R
`R
`~
`
`C lo~
`
`FIGURE 6.3 Pipeline subdivision of stages.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 311
`
`

`

`6.2 Pipeline Design Considerations
`
`295
`
`of pipeline stages. They showed that as the number of stages increases,
`performance increases until a maximum is reached, at which point per(cid:173)
`formance starts decreasing. Consider a simple pipeline with delays due
`only to program control, as shown in Figure 6.1. More complex control
`flow situations are discussed in Chapter 7.
`The number of clocks to process a simple sequence is
`
`number of clocks = n + k - 1.
`
`The clock period is
`
`clock period = !:,. + t
`n
`
`where L is the logic path length in gate delays, t is the clock skew plus
`latch setup in gate delays, and n is the number of pipeline stages.
`
`processing time = number of clocks x clock period
`= (n + k - 1) (; + t)
`
`(in gate delays).
`
`The first derivative of the processing time with respect to n is set
`to zero and solved for the optimum number of pipeline stages, nopt as
`
`-JL(k -1).
`
`nopt -
`
`t
`
`'
`
`for large values of k,
`
`nopt = {¥.
`
`As an example, consider a pipeline with L = 128, t = 2 and 4 and an
`instruction sequence of k = 4. Figure 6.3 shows a plot of the processing
`time, in gate delays, for this workload as n is varied between 1 and 128.
`For t = 2, there is a minimum processing time for values of n between
`8 and 16. For t = 4 the minimum occurs around n = 8. Solving the
`equation for nopt shows that nopt = 13.86 and 9.8 for the two cases oft.
`As the number of stages in a pipeline must be an integer, n could be
`either 13 or 14 for t = 2 and n = 10 for t = 4. This example illustrates
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 312
`
`

`

`296
`
`Pipelined Processors
`
`K=4
`I= 128
`
`700
`
`650
`
`600
`
`650
`
`600
`
`450
`
`400
`
`350
`
`300
`
`250
`
`200
`
`s: Q
`
`-0 s
`t\\i
`.,
`E
`
`~ ..
`·m
`1
`
`150
`0
`
`2
`Log (2) Number of Pipeline Stag,,s
`
`3
`
`5
`
`6
`
`7
`
`FIGURE 6.4 Pipeline performance.
`
`the performance increase that can be achieved by reducing t as much
`as possible.
`The optimum length of the pipeline is the length that provides the
`minimum execution time for a sequence of length k. Note, from Figure
`6.4, that if clock skew and setup time are not controlled, the optimum
`length of the pipeline can change significantly. Furthermore, because a
`longer pipeline gives a higher speedup over a serial execution model
`processor, the minimization oft is a paramount design issue.
`Clock skew plus latch setup is normalized to gate delays and is in
`the order of 2 to 3. It can, however, go to 10 in some extreme designs.
`VLSI technology has not eliminated the problem of clock skew; while the
`clock distribution paths are on the chip, the gates are faster and the
`problem remains. As an example of the lengths taken to balance clock
`delay and control skew, consider the clock distribution scheme of the
`DEC Alpha microprocessor shown in Figure 6.5. The clock is generated
`by a common generator and distributed over equal-length paths to each
`of the clocked registers. Note that the total delay from the clock pulse
`generator is not important, only the skew at each of the destinations. A
`similar scheme is described by Cotton [COTT65] and used with the TI
`ASC [CRAG89]. Recent research on clock distribution to nonuniform
`destinations is reported by Kahng et al. [KAHN90].
`Early pipeline computers had pipeline lengths of 12 to 15 stages
`primarily because of the use of core memory and relatively slow,
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 313
`
`

`

`6.2 Pipeline Design Considerations
`
`297
`
`Loads
`
`Clock Input
`
`FIGURE 6.5 DEC Alpha clock distribution.
`
`low-density logic such as used with the IBM S/360/91. Modern micropro(cid:173)
`cessors such as MIPS and SPARC,.however, have 4 to 6 pipeline stages.
`The reduction in pipeline length is an attempt to reduce, for example, the
`branch delays discussed in Chapter 7. Superpipelining (to be discussed in
`Chapter 9) explores the possibility that pipelines have become too short
`and the clock period too long-a continuing research topic. A model for
`optimum pipeline length can be developed that attaches a cost of the
`latches required when the logic path is partitioned into stages. The cost
`can be in terms of chip area and power.
`The latch or flip flop setup time is included in parameter t used
`above. A discussion of the design of latches for pipelines is outside the
`scope of this book. However, a few comments are in order. There is the
`potential for race conditions in a pipeline with two solutions. First, JK
`flip flops are, in effect, dual-rank latches, which triggered at different
`times can be used to eliminate race conditions. The second solution for
`this problem is the use of balanced logic paths and fast latches. The Earl
`latch, for ECL, is an example of such a latch and is described by Kogge
`[KOGG81].
`VLSI CMOS implementations of pipelines registers have the same
`problems of setup time, clock skew, and clock distribution as with bipolar
`circuits. A thorough discussion of these problems along with various flip(cid:173)
`flop designs for pipelines can be found in [WEST93].
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 314
`
`

`

`298
`
`Pipelined Processors
`
`6.3 Other Pipeline Perfo rma nce Models
`Kunkel and Smith modeled the optimal pipeline length by considering a
`number of factors in one model [KUNK86). Their paper looks at not only
`circuit design considerations, such as clocking, but also at dependencies
`that limit the performance of pipelines. However, I believe that they
`look only at the execution pipelines, not at the total pipeline, which
`includes the IPU, as branch delays are not considered. Their results
`show that the number of useful gate delays per pipeline segment is in
`the range 6 to 8. These results are consistent with the results discussed
`above only if there is buffering between the IPU and EU that eliminates
`branch delays.
`Another approach to pipeline performance modeling is found in work
`of Dubey and Flynn [DUBE90]. Their model considers the utilization of
`the pipeline, not the length of an instruction sequence. Pipeline utiliz(cid:173)
`ation varies from O to 1 and is roughly equivalent to 1/CPI. Pipeline
`utilization is defined as
`
`U = U max -
`
`rs 2 - VS
`
`where Uma.." denotes the upper limit on utilization, independent of pipe(cid:173)
`line length, cache miss~, page faults and the like; s the stages in the
`pipeline; r the dependency delays; and v an experimentally derived
`coefficient.
`If this model is used to evaluate only the pipeline (for example,
`without ·cache misses), Umax is equal to 1. Note that the dependency
`delays (I believe this includes branch delays as well) reduce utilization
`by the square of the pipeline length. Dubey and Flynn compare their
`model to the simulation results of [KUNK86] and find agreement. The
`conclusion of their modeling is that the optimum pipeline length is in
`the range of 6 to 8 stages and is limited by clock skew and dependencies,
`which is the same result as given by the other models.
`[LANG79] gives another approach to pipeline modeling. The a u thors
`of this paper have used queuing theory to model the performance of a
`pipeline that has a multiplicity of execution units. As described in Chap(cid:173)
`ter 8, various delays can build up in the processor, and these delays can
`be viewed as queues. This model gives results that are within ± 10% of
`measured performance for two benchmarks on three computers: the
`Cray-1 and two configurations of the TI ASC. A Markov chain form of
`modeling is reported by [HSIE85). This modeling method allows a close
`examination of the benchmark parameters and their relationship to the
`pipeline parameters.
`Emma and Davidson [EMMA87] investigated the optimum length
`of pipelines under the constraints of branch and dependency delays.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 315
`
`

`

`6.3 Other Pipeline Performance Models
`
`299
`
`They examined a number of execution traces and noted the interference
`between branches and dependencies that preclude accurate analytical
`models that treat these two delays as orthogonal.
`For this model, the pipeline is divided into two sections: setup and
`execute (equivalent to the IPU and EU discussed previously). The total
`length of the pipeline determines branch delays while only the execute
`section determines dependency delays. Thus there is interest in the ratio
`of the lengths of these two function pipelines.
`Let
`
`I = number of stages in the setup pipeline,
`E = number of stages in the execution pipeline,
`R = Ell,
`n =E + I.
`
`It follows that n = l(R + 1).
`Based upon their workload models Emma and Davidson found that
`Iopt = V 0 .33L/t. Thus, the optimum length of the pipeline is
`r;;:;;,
`nopt = (1 + R) ✓ =:-=-·
`
`This model suggests that for a given technology that sets the value
`oft, R should be as large as possible. That is, the execution units should
`be long compared to the setup stage. For integers, this ratio is difficult
`to

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket