`
`Memory
`
`®
`Channel
`
`Processor
`
`Channel
`
`String
`Controller
`
`String ofDisks
`
`FIGURE 5.24 Disk 1/0 cache placement options.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 301
`
`
`
`5.2 Disk Systems
`
`285
`
`load. However, Smith [SMIT85) showed that skew is dynamic and
`changes from disk to disk over time. One placement, therefore, may be
`far from ideal for a varying workload on the system. A cache would then
`be required for all disks. For disk manufacturers, incorporating the cache
`into the disk permits the manufacturer to sell the cache as well as the
`d.isk. However, this is not necessarily the best for overall system cost
`and/or performance. A13 discussed below, for a case where there is a fixed
`amount of memory available for a cache, the memory should be placed
`at the processor.
`Smith's paper on disk caches gives miss ratio data derived from
`trace data on an IBM S/360/91, 370/165, 370/168s and an Amdahl
`470V/6 [SMIT85). The trace data is then used to evaluate disk cache
`miss rates for various cache sizes, organizations, and locations in the
`system. Of the five possible locations for caches shown in Figure 5.24,
`three locations are evaluated. The results of the simulation are shown
`in Table 5.17.
`
`Global. One cache at the main memory (location 1).
`Controller. Multiple caches, one with each I/O Channel (location 2).
`Device. Multiple caches, one with each disk (location 5).
`
`The results of these simulations indicate that for the evaluated disk
`caches sizes, the best miss ratio is obtained with global d.isk caches.
`However, as would be expected, there can be significant variations in
`the miss ratios for different workloads. Ousterhout [OUST85) reports
`that for measurements on VAX with a 2-Mbyte disk cache, only 17.7%
`of all requests are actually served by the disk. Goldstein [GOLD87] uses
`a miss ratio of 0.1 to justify the use of disk caches in the future-a figure
`
`Total Cache
`Capacity
`(Mbytes)
`
`Crocker Bank
`Business Data Processing
`
`SLAC
`Scientific Calculations
`
`1
`2
`4
`8
`16
`32
`64
`
`Global Controller Device Global Controller
`
`Device
`
`0.316
`0.259
`0.225
`0.197
`0.172
`0.150
`0.139
`
`0.475
`0.365
`0.275
`0.233
`0.203
`0.177
`0.155
`
`0.610
`0.450
`0.330
`0.266
`0.224
`0.199
`0.175
`
`0.330
`0.226
`0.146
`0.099
`0.070
`0.050
`0.033
`
`0.601
`0.414
`0.326
`0.227
`0.136
`0.085
`0.061
`
`0.630
`0.496
`0.370
`0.271
`0.174
`0.109
`0.071
`
`Note. The total cache capacity is evenly distributed over the controllers or devices.
`
`TABLE 5.17 Disk cache miss ratios.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 302
`
`
`
`286
`
`Interleaved Memory and Disk Systems
`
`that is consistent with this data. Therefore, it seems that a miss ratio
`in the range of0.1 constitutes a reasonable estimate for design purposes.
`There are a number of system issues concerning disk caches. In
`general, these are the same issues addressed in Chapter 2 on multilevel
`caches. A disk cache should be write back because of the disk latency, and
`multilevel inclusion should be enforced. In addition, there are operating
`system issues such as operating system control of the disk cache and
`error recovery. Additional issues are problems in coherency and reli(cid:173)
`ability with the disk cache operating in wrjte-back mode. Most of these
`design issues are discussed at length in [SMIT85]. It is interesting to
`note that the software-controlled disk cache of MS-DOS places such
`considerations under user control.
`Smith presents an interesting discussion on alternatives to disk
`caches. Today, with large main memories and large buffers, the disk
`cache may not be the most effective method of reducing disk accesses.
`Alternatives discussed by Smith include large cache block sizes that
`benefit from spatial locality, larger main memory to reduce the need to
`page temporary files, solid state drums and disks, using the virtual
`memory system to map the disk address space into the program address
`space, and others. If a designer has a fixed additional memory, how
`should this memory be allocated; should it be used as a disk cache, more
`main memory, or as program managed buffers? The conclusion seems to
`be that system simplification issues are more important than perfor(cid:173)
`mance issues in answering this design question.
`While not a disk cache, an extended memory as a paging device
`(solid state drum) has been used with a number of supercomputers.
`These devices have been very effective in reducing the time required for
`page-out and page-in when a task is swapped out in a multiprogramming
`environment.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 303
`
`
`
`6
`
`Pipeli ned Processors
`
`6.0 Overview
`The design process of early computers began with the instruction set
`and progressed to implementation. While instruction set design and
`implementation were not completely disjoint, they were reasonably so.
`Once an instruction set was designed, with some forethought of its im(cid:173)
`plementation, the specified functions were hardwired into a controller.
`The controller could be a decoded counter or a tapped delay line that
`produced the timing pulses to enable the transfer gates.
`Wilkes [WILK53] wrote the seininal paper on microprogramming in
`which he suggested that there is a better implementation technique for
`sequential machines than hardwiring. He suggested that the instruction
`set of a processor could be interpreted by another processor. Thus, the
`program of the second processor is the controller for the first program.
`The name given to this technique is microprogramming.
`The technology of the 1950s did not permit the use of microprogram(cid:173)
`ming because Read Only Memory (ROM), in the form of diodes, was just
`too expensive. A single germanium diode sold for as much a $10 in 1953!
`However, with time, the cost of ROM for microprogram stores decreased
`to the point that the IBM S/360 family was designed, with a very
`complex instruction set, for microprogram implementation. This family
`of processors used a number of ROM technologies to store the micropro(cid:173)
`gram, except for the very high performance members of the family that
`are hardwire controlled.
`Microprogramming technology provided a convenient way to exploit
`concurrency or function overlap because a long micro-instruction permit(cid:173)
`ted a number of transfer paths to be enabled simultaneously. However,
`the programming task required to accomplish this was, and still is,
`daunting because identifying concurrency is a manual task and is subject
`to procedural error. Microprogramming gave the instruction set designer
`almost unlimited freedom, which supported the design of the so-called
`287
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 304
`
`
`
`288
`
`Pipelined Processors
`
`CISC architectures. These early processors followed the serial execution
`model.
`A serial execution model processor is one in which a program counter
`sequences through instructions one by one, finishing one instruction
`before starting the next (paraphrased from [HWU87]).
`Most of the parallelism controlled by the microprogram was within
`the execution of a single instruction. H owever, there are exceptions to
`the serial execution model where por tions of two instructions are per(cid:173)
`formed together. For example , instruction i + 1 could be fetched while
`instruction i is completing. Recall that the von N eumann architecture
`fetched two instructions at once. Nevertheless, if for some reason the
`flow of control is changed, no state has been changed and the execution
`continued to follow the serial execution model.
`Pipelines do not follow the serial execution model. At any one time
`only one portion of a given instruction is being processed in the pipeline.
`However, other portions of a number of differ ent ins tructions are being
`simultaneously processed. This major and fundamental characteristic of
`pipelines-that is, the concurrent execution of portions of a number of
`instructions-leads to most of the pipeline design problems. Pipelining
`formalizes the identification of parallelism and replaced the asynchro(cid:173)
`nous interface between functions with synchronizing clock pulses. A task
`is broken up into its constituent parts, and a special purpose hardware
`function is developed for each part. If the workload is a long vector, the
`vector AUs can be streamed down the pipeline, producing one result per
`clock.
`
`A pipeline is "a structure that consists of a sequence of stages
`through which a computation flows with the property that new
`operations can be initiated at the start of the pipeline while other
`operations are in progress t hrough the pipeline" [STON93].
`
`The IBM 7094 (1963) is an example of an early overlapped architec(cid:173)
`ture and is the successor to the nonoverlapped IBM 704 (1956). The
`7094 had three functional units that could be operated concurrently: the
`instruction fetch unit, the decode and data fetch, and the execution unit
`[CHENB0]. Additional concurrency is provided by fetching two instruc(cid:173)
`tions (as with the von Neumann architect ure) and two operands at once.
`Complex interlocks controlled the execution of an instruction because
`the three functional units operated asynchronously. Another example of
`overlap is found later in the Intel 8086. The designers considered a
`number of overlap schemes selected to enhance the performance of this
`processor .. The issue was to provide reasonable performance with a
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 305
`
`
`
`6.0 OveNiew
`
`289
`
`reasonable die size [MCKE79]. The selection of overlap options and their
`resulting performance is described in Section 6.4.
`During World War II, code-breaking computers were developed at
`Benchly Park by Allen Turing. This work was followed in the early
`1960s, at the National Security Agency, with the development ofspecial(cid:173)
`purpose computers for breaking codes and ciphers using pipelined proces(cid:173)
`sors. Researchers at the National Security Agency with assistance from
`their contractors, IBM and UNIV AC, played a significant role in the
`development of pipelining techniques in general and of circuit techniques
`for pipelining in particular.
`Concurrent with the development of microprogramming and simple
`overlapped processors, researchers were investigating methods for
`spee.ding up frequently used functions and algorithms by using attached
`processors [ESTR60]. Many of these special purpose processors were
`directed toward various forms of signal processing and seismic data
`processing. FFT processors and the more general convolver processors
`were produced [TEX65]. The special purpose processors have evolved
`to the coprocessors-processors (attached processors) used with many
`microprocessors today. In the main, attached processors only addressed
`the problem of high-speed execution of arithmetic operations. The main
`processor handled other functions such as instruction fetching and pro(cid:173)
`cessing. Vector processors (discussed in Chapter 11) are direct descen(cid:173)
`'dants of the pipelined convolver boxes of the 1960s.
`In 1955 the UNIV AC Corporation and IBM began designing a new
`generation of general scientific computers for the Lawrence Livermore
`Laboratory. These machines were conceived to push the state-of-the-art
`in architecture and achieve a greater performance increase than could
`be obtained from faster circuit and memory technology alone.
`The UNIV AC machine, named LARC, was delivered in 1959. This
`processor used a four-stage pipeline that is clocked from a common source
`[ECKE59] and is the precursor of the technique of pipelining as practiced
`today. The IBM Stretch, delivered in 1961, design goal was an increase
`of lOO x in performance with a memory speed increase of6X and a logic
`speed increase of 10 x [BLOC59, BUCH62, ROSE69]. These numbers
`indicate that a concurrency level of 10 to 15 would be required to meet
`the design goals. The Stretch architecture explored several ad hoc con(cid:173)
`currency techniques and used a number of functional stages operating
`asynchronously. By means of interlocks between these stages, work is
`passed down the pipeline as completed.
`The LARC and Stretch are important precursor supercomputers,
`but they were not commercial successes and were followed by the first
`generation of pipelined supercomputers. These supercomputers are pipe(cid:173)
`lined for instruction processing as well as execution. The dates of first
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 306
`
`
`
`290
`
`Pipelined Processors
`
`installation of these machines are:
`
`CDC 6600
`IBM S/360/91
`CDC 7600
`TIASC
`CDC STAR 100
`Cray 1
`
`1964 [MATl80]
`1967 [CASE78]
`1969 [MATI80]
`1972 [CRAG89]
`1974 [RIGA84]
`1976 [MATI80]
`
`Two early pseudo-pipelined processors deserve notice: the VAX
`11/78 and the Intel 8008. The VAX-11/780 employed a three-stage pipe(cid:173)
`line [EMER84] that is reminiscent of the Stretch pipeline. That is, each
`of the three stages is an autonomous processor that executes its own
`microcode instruction stream to perform the function of the stage. In(cid:173)
`terlocking between the stages controls the flow of an instruction down
`the pipeline.
`Microprocessor design has followed the path of the supercomputers.
`The first microprocessors, the Intel and Texas Instruments 8008
`[NOYC81], are hardwired processors. As the 8008 evolved, its serial
`execution model was replaced by pipelining: The i486 is pipelined with
`five stages, for example [FU89, CRAW90, INTE90]. Processors such as
`the IBM S/360 are initially microprogrammed sequential machines, but
`contemporary models are implemented as pipelines. Researchers at IBM
`have published many of the "tricks" of pipelining a CISC architecture
`in the form of patents and in the IBM Technical Disclosure Bulletin.
`Pipelining is now such an accepted implementation approach that all
`new architectures after 1985, such as the MIPS, SPARC, PowerPC, and
`others, are based on pipelining as a first consideration followed by the
`design of an instruction set that is amenable to pipeline execution-a
`procedure that is the reverse of that used for the early computers.
`The Intel Pentium represents an architecture that was not intended for
`pipelining but is now being pipelined-a significant challenge.
`
`6.1 Pipeline Performance Models
`The first performance models for pipelines were developed by Cotton
`[CO'IT65]; he investigated the tradeoffs between gate delay, latch delay,
`pipeline length, and, significantly, the issue of clock skew and latch
`setup time. Clock skew plagues pipeline designers with problems of clock
`distribution even today.
`Davidson [DA VI71] formulated a graphical representation of the
`temporal behavior of pipelines named reservation tables. Similar graphi(cid:173)
`cal representations of the space-time relationship of pipelines, which
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 307
`
`
`
`6. 1 Pipeline Performance Models
`
`291
`
`Time-
`12345678
`
`S 1
`
`11
`
`T 2
`A
`G 3
`
`E 4
`
`I2
`
`11
`
`I3
`
`I2
`
`I1
`
`I4
`
`I3
`
`I2
`
`I1
`
`I4
`
`I3
`
`I2
`
`14
`
`I3
`
`14
`
`FIGURE 6.1 Reservation table by stage.
`
`evolved from Gantt charts, are described by [CHEN71]. Tables of this
`type_ in their original and modified form serve today to help explain
`pipeline operation.
`A reservation table is shown in Figure 6.1. Time steps, in clocks,
`are shown on the horizontal axis, and the pipeline stages are_ shown on
`the vertical axis. The vertical lines between time steps signify the clock
`pulses. This reservation table illustrates a pipeline of four stages and a
`four-instruction sequence: I1 to 14. The first result, 11, can be clocked
`out of the pipeline at the end of t4 and the last result at the end of t7.
`The instruction sequence, for this example, is broken with instruction 4
`and cannot be resumed until time 8, after the pipeline is emptied, which
`is referred to as flushing. This form of reservation table is similar to the
`presentation of a logic analyzer with a probe on each of the pipeline
`stages.
`A common variation of the reservation table is to show time on the
`horizontal axis and the sequence of instructions on the vertical axis. The
`same pipeline and workload illustrated in Figure 6.1 are shown in Figure
`6.2.
`
`Note that for both of these reservation tables, at t3 for example,
`instructions 11, 12, and 13 are active in stages S3, S2, and S1, respec(cid:173)
`tively. An objection to this form of reservation table is that it expands
`in two dimensions as the number of instructions increases; the first form
`
`Time-
`12 34 56789
`
`Inst. 1
`
`Sl 82
`
`S3 84
`
`Inst. 2
`
`Inst.3
`
`Inst. 4
`
`S1
`
`82 S3
`
`84
`
`81
`
`82 S3
`
`S4
`
`S1 82
`
`S3 S4
`
`FIGURE 6.2 Reservation table by instruction.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 308
`
`
`
`292
`
`Pipelined Processors
`
`expands in only one dimension. It is a matter of choice as to which form
`of reservation table best conveys the desired information. However,
`both forms are frequently used in reference manuals of microprocessors,
`while the first form is used exclusively in this book.
`A processor pipeline can be viewed as a single entity. However,
`many early pipelines are decomposed into two major functions:
`
`1. Instruction Processor Unit (IPU),
`2. Execution Unit(s) (EU).
`
`This subdivision is useful in explaining the operation of a pipeline.
`For example, low-density circuit technology often requires a functional
`partition into physical cabinets. In some cases, the instruction processor
`was in one cabinet and the execution unit resided in anoth er. The in(cid:173)
`terconnections between the two cabinets had to be a pipeline stage that
`provided no logic function. This technique is beginning to be used again
`for communication between chips.
`The performance models developed below consider the pipeline as a
`unit; the division into IPU and EU is not made. The unit of time in
`these models is a clock period, and the unit of work is an instruction.
`Note that only one instruction can be issued in one clock period. Processor
`implementations that can issue more than one instruction in a clock
`period are discussed in Chapter 10.
`The simplest possible scheduling strategy is assumed for this model
`That is, a break in the flow of instructions causes the pipeline to stall,
`introducing delays, until the last instruction in the sequence exits the
`pipeline; this strategy is illustrated in Figure 6.1. Other scheduling and
`strategies are discussed in Chapters 7 and 8.
`For this model, known as the pipeline model, the number of clocks
`required to process a sequence of k instructions is
`
`number of clocks = clocks for the first result to exit
`+ length of a run of instructions -1
`=n+k-1
`
`where n is the number of pipeline stages and k is the length of an
`instructionsequence ending in a taken branch. And the clocks per instruc(cid:173)
`tion (CPI) is
`
`CPI = n + k - 1 = l + n - 1.
`k
`k
`
`As k--> 1, CPI--> n and as k--> oo, CPI--> 1. In other words, 1 s CPI s n.
`For the limit of CPI = 1, all of the overhead (delays) needed to fill
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 309
`
`
`
`6. 1 Pipeline Performance Models
`
`293
`
`the pipeline have been prorated over the large number of executed
`instructions. For the limit of CPI = n, there is only one instruction and
`it must be clocked down all of the pipeline stages.
`For a serial execution model implementation, n clocks would be
`required to execute a single instruction: That is, using a time-shared
`common block of logic, the instructions make n passes through the logic
`to complete the execution of an instruction. Note that the number of
`passes is the same as the number of pipeline stages. If the clock period
`for both the serial execution model processor and the pipeline implemen(cid:173)
`tations is the same, the speedup of a pipelined implementation over a
`sequential model implementation is
`
`S = serial execution model time (in clocks)
`pipeline time (in clocks)
`
`nk
`n + k-l
`
`As k--+ oo , S - n.
`These limits tell us that for pipelines to be effective, the length (k)
`of the stream of AUs processed must be as long as possible. However,
`there can be an n-fold increase in the hardware of the pipeline over the
`simple serial execution model. The major goal of a processor pipeline
`design is to reduce the value of CPI by increasing the apparent value
`of k . Methods for achieving large values of k are the subject of Chapters
`6 to 8.
`When the stream of instructions is broken, a pipeline delay, stall or
`break , is said to have occurred that produces bubbles in the pipeline. For
`example, Figure 6.1 has a delay of four clocks. These delays will reduce
`the performance by increasing CPI to a value greater than 1. The model
`for CPI has two terms: the steady-state term, which indicates that the
`pipeline will produce one result every clock, and the delay term, which
`indicates the increase in clocks due to delays:
`
`CPI = steady state + delays = 1 + delays.
`
`Pipeline delays occur for three major reasons.
`1. Control. Change of flow control caused by branches (Chapter 7),
`interrupts (Chapter 9), traps (Chapter 9), and exceptions (Chapter 9).
`2. Data dependencies (Chapter 8), particularly true dependencies,
`anti dependencies, and output dependencies.
`3 . Structural delays, pertaining to logic resources (Chapter 8) and
`memory delays (Chapters 2, 3, 4, and 5).
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 310
`
`
`
`294
`
`Pipelined Processors
`
`Note that the delays added to the steady-state CPI= 1 are the proba(cid:173)
`bilities that a delay will occur (such as a taken branch) times the delay
`that results with the event (such as the taken branch delay). The magni(cid:173)
`tude of the delay is of primary concern to the processor designer and is
`discussed in the noted chapters. The probability that an event will
`occur is partially a function of the instruction set architecture and the
`workload, subjects not directly addressed in this book.
`
`6.2 Pipeline Design Considerations
`Selecting the number of pipeline stages is a significant design problem
`because of the following considerations. For a given logic path, the
`greater the subdivisions into stages, the shorter the logic path per stage,
`the faster the clock, and, since performance has been stated in CPI, the
`faster the execution of a program. Also, recall that the speedup of a
`pipelined processor compared to a serial execution model processor is
`the number of pipeline stages, as k-> co. However, as the length of the
`pipeline increases, the delays due to flushing the pipeline increase be(cid:173)
`cause very large k is not realistic. Thus there is an optimum number of
`pipeline stages.
`The logic path is the number of levels of logic required if the pro(cid:173)
`cessor executed an instruction in one clock, typically in excess of 100
`gates levels or delays. Figure 6.3 shows a one-stage pipeline with L logic
`levels between two registers clocked from the same source. If the pipeline
`is subdivided into two stages, each stage of logic has L/2 of the gate
`level, and so on.
`In addition to the logic path, each stage of the pipeline requires a
`fixed time for clock skew and latch setup. If the subdivision of the logic
`path into stages is overdone (that is, it has too many stages), the clock
`skew and latch setup time will dominate the clock period determination.
`The design question is: What is the optimum number of pipeline stages?
`Larson and Davidson [LARS73] formulated the first known model that
`relates the processing rate to clock skew, the value of k, and the number
`
`R
`
`R
`
`Ci:0:P:0-
`
`R
`R
`R
`~
`
`C lo~
`
`FIGURE 6.3 Pipeline subdivision of stages.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 311
`
`
`
`6.2 Pipeline Design Considerations
`
`295
`
`of pipeline stages. They showed that as the number of stages increases,
`performance increases until a maximum is reached, at which point per(cid:173)
`formance starts decreasing. Consider a simple pipeline with delays due
`only to program control, as shown in Figure 6.1. More complex control
`flow situations are discussed in Chapter 7.
`The number of clocks to process a simple sequence is
`
`number of clocks = n + k - 1.
`
`The clock period is
`
`clock period = !:,. + t
`n
`
`where L is the logic path length in gate delays, t is the clock skew plus
`latch setup in gate delays, and n is the number of pipeline stages.
`
`processing time = number of clocks x clock period
`= (n + k - 1) (; + t)
`
`(in gate delays).
`
`The first derivative of the processing time with respect to n is set
`to zero and solved for the optimum number of pipeline stages, nopt as
`
`-JL(k -1).
`
`nopt -
`
`t
`
`'
`
`for large values of k,
`
`nopt = {¥.
`
`As an example, consider a pipeline with L = 128, t = 2 and 4 and an
`instruction sequence of k = 4. Figure 6.3 shows a plot of the processing
`time, in gate delays, for this workload as n is varied between 1 and 128.
`For t = 2, there is a minimum processing time for values of n between
`8 and 16. For t = 4 the minimum occurs around n = 8. Solving the
`equation for nopt shows that nopt = 13.86 and 9.8 for the two cases oft.
`As the number of stages in a pipeline must be an integer, n could be
`either 13 or 14 for t = 2 and n = 10 for t = 4. This example illustrates
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 312
`
`
`
`296
`
`Pipelined Processors
`
`K=4
`I= 128
`
`700
`
`650
`
`600
`
`650
`
`600
`
`450
`
`400
`
`350
`
`300
`
`250
`
`200
`
`s: Q
`
`-0 s
`t\\i
`.,
`E
`
`~ ..
`·m
`1
`
`150
`0
`
`2
`Log (2) Number of Pipeline Stag,,s
`
`3
`
`5
`
`6
`
`7
`
`FIGURE 6.4 Pipeline performance.
`
`the performance increase that can be achieved by reducing t as much
`as possible.
`The optimum length of the pipeline is the length that provides the
`minimum execution time for a sequence of length k. Note, from Figure
`6.4, that if clock skew and setup time are not controlled, the optimum
`length of the pipeline can change significantly. Furthermore, because a
`longer pipeline gives a higher speedup over a serial execution model
`processor, the minimization oft is a paramount design issue.
`Clock skew plus latch setup is normalized to gate delays and is in
`the order of 2 to 3. It can, however, go to 10 in some extreme designs.
`VLSI technology has not eliminated the problem of clock skew; while the
`clock distribution paths are on the chip, the gates are faster and the
`problem remains. As an example of the lengths taken to balance clock
`delay and control skew, consider the clock distribution scheme of the
`DEC Alpha microprocessor shown in Figure 6.5. The clock is generated
`by a common generator and distributed over equal-length paths to each
`of the clocked registers. Note that the total delay from the clock pulse
`generator is not important, only the skew at each of the destinations. A
`similar scheme is described by Cotton [COTT65] and used with the TI
`ASC [CRAG89]. Recent research on clock distribution to nonuniform
`destinations is reported by Kahng et al. [KAHN90].
`Early pipeline computers had pipeline lengths of 12 to 15 stages
`primarily because of the use of core memory and relatively slow,
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 313
`
`
`
`6.2 Pipeline Design Considerations
`
`297
`
`Loads
`
`Clock Input
`
`FIGURE 6.5 DEC Alpha clock distribution.
`
`low-density logic such as used with the IBM S/360/91. Modern micropro(cid:173)
`cessors such as MIPS and SPARC,.however, have 4 to 6 pipeline stages.
`The reduction in pipeline length is an attempt to reduce, for example, the
`branch delays discussed in Chapter 7. Superpipelining (to be discussed in
`Chapter 9) explores the possibility that pipelines have become too short
`and the clock period too long-a continuing research topic. A model for
`optimum pipeline length can be developed that attaches a cost of the
`latches required when the logic path is partitioned into stages. The cost
`can be in terms of chip area and power.
`The latch or flip flop setup time is included in parameter t used
`above. A discussion of the design of latches for pipelines is outside the
`scope of this book. However, a few comments are in order. There is the
`potential for race conditions in a pipeline with two solutions. First, JK
`flip flops are, in effect, dual-rank latches, which triggered at different
`times can be used to eliminate race conditions. The second solution for
`this problem is the use of balanced logic paths and fast latches. The Earl
`latch, for ECL, is an example of such a latch and is described by Kogge
`[KOGG81].
`VLSI CMOS implementations of pipelines registers have the same
`problems of setup time, clock skew, and clock distribution as with bipolar
`circuits. A thorough discussion of these problems along with various flip(cid:173)
`flop designs for pipelines can be found in [WEST93].
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 314
`
`
`
`298
`
`Pipelined Processors
`
`6.3 Other Pipeline Perfo rma nce Models
`Kunkel and Smith modeled the optimal pipeline length by considering a
`number of factors in one model [KUNK86). Their paper looks at not only
`circuit design considerations, such as clocking, but also at dependencies
`that limit the performance of pipelines. However, I believe that they
`look only at the execution pipelines, not at the total pipeline, which
`includes the IPU, as branch delays are not considered. Their results
`show that the number of useful gate delays per pipeline segment is in
`the range 6 to 8. These results are consistent with the results discussed
`above only if there is buffering between the IPU and EU that eliminates
`branch delays.
`Another approach to pipeline performance modeling is found in work
`of Dubey and Flynn [DUBE90]. Their model considers the utilization of
`the pipeline, not the length of an instruction sequence. Pipeline utiliz(cid:173)
`ation varies from O to 1 and is roughly equivalent to 1/CPI. Pipeline
`utilization is defined as
`
`U = U max -
`
`rs 2 - VS
`
`where Uma.." denotes the upper limit on utilization, independent of pipe(cid:173)
`line length, cache miss~, page faults and the like; s the stages in the
`pipeline; r the dependency delays; and v an experimentally derived
`coefficient.
`If this model is used to evaluate only the pipeline (for example,
`without ·cache misses), Umax is equal to 1. Note that the dependency
`delays (I believe this includes branch delays as well) reduce utilization
`by the square of the pipeline length. Dubey and Flynn compare their
`model to the simulation results of [KUNK86] and find agreement. The
`conclusion of their modeling is that the optimum pipeline length is in
`the range of 6 to 8 stages and is limited by clock skew and dependencies,
`which is the same result as given by the other models.
`[LANG79] gives another approach to pipeline modeling. The a u thors
`of this paper have used queuing theory to model the performance of a
`pipeline that has a multiplicity of execution units. As described in Chap(cid:173)
`ter 8, various delays can build up in the processor, and these delays can
`be viewed as queues. This model gives results that are within ± 10% of
`measured performance for two benchmarks on three computers: the
`Cray-1 and two configurations of the TI ASC. A Markov chain form of
`modeling is reported by [HSIE85). This modeling method allows a close
`examination of the benchmark parameters and their relationship to the
`pipeline parameters.
`Emma and Davidson [EMMA87] investigated the optimum length
`of pipelines under the constraints of branch and dependency delays.
`
`PATENT OWNER DIRECTSTREAM, LLC
`EX. 2142, p. 315
`
`
`
`6.3 Other Pipeline Performance Models
`
`299
`
`They examined a number of execution traces and noted the interference
`between branches and dependencies that preclude accurate analytical
`models that treat these two delays as orthogonal.
`For this model, the pipeline is divided into two sections: setup and
`execute (equivalent to the IPU and EU discussed previously). The total
`length of the pipeline determines branch delays while only the execute
`section determines dependency delays. Thus there is interest in the ratio
`of the lengths of these two function pipelines.
`Let
`
`I = number of stages in the setup pipeline,
`E = number of stages in the execution pipeline,
`R = Ell,
`n =E + I.
`
`It follows that n = l(R + 1).
`Based upon their workload models Emma and Davidson found that
`Iopt = V 0 .33L/t. Thus, the optimum length of the pipeline is
`r;;:;;,
`nopt = (1 + R) ✓ =:-=-·
`
`This model suggests that for a given technology that sets the value
`oft, R should be as large as possible. That is, the execution units should
`be long compared to the setup stage. For integers, this ratio is difficult
`to