`
`
`,,
`,_
`.
`Attorney Docket N0.: SRCOZS PRO
`Client/Matter N0.: 80404.0033
`Express Mail N0.: EV335405698USo
`
`:3
`gig
`Kg“
`5% .c
`
`
`
`
`I;’—’-—-
`
`PROVISIONAL APPLICATION FOR PA TENT COVER SHEET
`g3
`5-; 0 This is a request for filing a PROVISIONAL APPLICATION FOR PATENT under 37 CFR 1.53(c)
`
`
`'
`PTO/SB/16110—O1)
`
`INVENTOR(S)
`
`
`I
`Given Name (first & middle, if any)
`Family Name or Surname
`
`
`Colorado Springs, CO USA
`Jon M.
`Huppenthal
`
`
`Daniel
`Poznanovic
`Colorado Springs, CO USA
`
`
`Colorado Springs, CO USA
`Jeffrey
`Hammes
`
`
`Lisa
`Krause
`Minneapolis, MN USA
`
`
`Jon
`Steidel
`Minneapolis, MN USA
`
`
`Colorado Springs, CO USA
`David E.
`Caliga
`
`
`Thomas R.
`Seeman
`Colorado Springs, CO USA
`
`
`Lee A.
`Burton
`Divide, CO USA
`
`
`St. Louis. MN USA
`Jeffrey Paul
`,,, Brooks
`
`E] Additional inventors are named on the _ separately numbered sheets attached hereto
`
`
`TITLE OF THE INVENTION {500 characters max
`BANDWIDTH EFFICIENCY AND UTILIZATION USING DIRECT EXECUTION LOGIC
`
`CORRESPONDENCE ADDRESS
`
`
`
`
`
`
`Direct all correspondence to:
`IX] Customer Number
`
`Place Customer Number
`Bar Code Label Here
`25235
`Illllll|l|||||l||ll|||||l||||l||||l
`We Customer Number here
`
`
`
`E] Fiirrm or
`Individual Name
`
`
`
`City
`
`Country
`
`ENCLOSED APPLICATION PARTS (Check all that apply)
`E Specification Number of Pages: 23
`D CD(S)’ Number:
`|:I Drawing(s) Number of Sheets:
`D Other (specify)
`
`
`[I Application Data Sheet. See 37 CFR 1.76
`METHOD OF PAYMENT OF FILINGS FEES FOR THIS PROVISIONAL APPLICATION (check one)
`
`I:I Applicant claims small entity status. See 37 CFR 1.27
`
`g A check or money order is enclosed to cover the filing fees
`
`FILING FEE
`
`AMOUNT ($)
`
`
`
`
`
`
`E The Commissioner is hereby authorized to charge
`any additional filing fees or credit any overpayment to
`
`Deposit Account No. 50-1123 $160.00
`
`The invention was made by an agency of the US. Government or under a contract with an agency of the US.
`Government.
`
`lzl No.
`
`
`
`[:I Yes, the name of the agency and contract no. are
`
`
`
`- d,“\/
`__ ,
`.
`
`:
`WI Iam J. Kubid
`Telephone:
`719—448-5909
`\\\CS ~ 8040410001 , 61483 v1
`
`Date:
`
`Reg. No.
`
`
`9 664
`
`Docket No.
`
`SRCOZ8 PRO
`
`1
`
`XILINX 1017
`
`1
`
`XILINX 1017
`
`
`
`
`
`Attorney Docket No. SRCOZ8 PRO
`Client/Matter No. 80404.0033
`
`Express Mail Label No. EV335405698US
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`In re Provisional Application of:
`
`Jon M. Huppenthal, Daniel Poznanovic, Jeffrey Hammes,
`Lisa Krause, Jon Steidel, David E. Caliga, Thomas R.
`Seeman, Lee A. Burton, and Jeffrey Paul Brooks
`
`Serial No_ ___________
`
`Filed: Herewith
`
`For: BANDWIDTH EFFICIENCY AND UTILIZATION USING
`DIRECT EXECUTION LOGIC
`
`
`CERTIFICATE OF MAILING BY EXPRESS MAIL
`
`MAIL STOP PROVISIONAL PATENT APPLICATION
`Commissioner for Patents
`PO. Box 1450
`
`Alexandria, VA 22313—1450
`
`Sir:
`
`The undersigned hereby certifies that the attached:
`Provisional Application for Patent Cover Sheet;
`Specification- 23 pages;
`Check in the Amount of $160.00;
`Certificate of Mailing by Express Mail; and
`. Return Card
`
`mewwe
`
`relating to the above application, were deposited as "Express Mail," Mailing Label
`No.EV335405698US, with the United States Postal Service, addressed to MAIL STOP
`PROVISIONAL PATENT APPLICATION Commissioner for Patents, PO. Box 1450
`
`Alexandria, VA 22313—1450, on
`
`
`I? W 22:03
`Dat
`
`
` g. .
`HOGAN & HARTSON LLP
`One Tabor Center
`1200 17th Street, Suite 1500
`Denver, Colorado 80202
`(719) 448—5909 Tel
`(303) 899-7333 Fax
`
`
`
`\\\CS - 8040410001 - 61483 v1
`
`2
`
`
`
`
`
`Bandwidth Efliciency and Utilization
`Using Direct Execution Logic
`
`SRC Computers, Inc.
`
`
`introduction _. 1::
`
`,
`
`
`
`In a previous paper, Petqflop Computing Using Direct Execution Logic, we showed that Teraflop-class
`processors will be available to direct execution logic (DEL) programmers in 2008. The challenge with this
`kind of computational capacity, of course, is how to keep it fed with operands. One key facet of this is the
`ability to exploit all available locality in a program or algorithm, or bandwidth efliciency. A second facet
`of this is the ability to run calculations at such a rate as to use all available memory bandwidth, or
`bandwidth utilization.
`
`In this paper, we will show that the implicit features in modern microprocessor—based systems fail to
`achieve full bandwidth efficiency or utilization, even for fiilly optimized and tuned codes. By contrast, we
`will show that a programmer can use a high level language to directly manipulate the memory hierarchy
`and calculation pipeline via DEL achieving both maximum bandwidth efficiency and utilization across a
`wide variety of algorithms. The end result is a programmable, computational system that delivers the
`maximum possible performance for any given memory bus speed. This combination of efficiency and
`utilization yields an orders ofmagnitude performance benefit even when utilizing an equivalent memory
`bus. A diverse set of examples presented in this paper demonstrate between 6X and 216x performance gain
`through DEL.
`
`The ability of DEL to use 100% of available bandwidth with near 100% efficiency is exciting today, but
`becomes even more exciting in the future as new interconnect technologies such as stacked die promise to
`substantially increase available memory bandwidth, We believe that performance via DEL will continue to
`scale perfectly into these bandwidth-rich environments.
`
`Microprocessors:gRéacl‘iing the Limits (jar implieiimétifeqse
`
`Over the past 30 years, microprocessors have enjoyed annual performance improvements of about 50%.
`Designers have been taking advantage of higher frequencies, and have added a great deal of complexity
`primarily in two areas- memory hierarchies and searching for instruction level parallelism (ILP). For a
`short history of the microprocessor and these implicit design optimizations, see the Appendix.
`
`Instruction Level Parallelism
`Techniques, which automatically exploit HP at execution time, have reached the point of diminishing
`returns, Of the 10 million transistors in an Intel Pentium III processor, for example, more than halfare
`dedicated to instruction cache, branch prediction, out—of-order execution and superscalar logic. In addition,
`complex ILP techniques require chip~wide wires.
`In a few short years, it will take several clocks to travel
`across a microprocessor die so these complex techniques do not scale with integrated circuit technology.
`appears that viable techniques to extract implicit parallelism from a serial instruction stream have simply
`run out of steam. According to Bill Dally, Pat IIanrahan and Ron Fedkiw at Stanford University:
`
`It
`
`Add the Copyright Statement
`
`3
`
`
`
`
`
`is no longer scaling at the historic
`The performance ofthe microprocessors
`
`rate of50% per year. Microprocessors have reached a point ofdiminishing
`returns in terms ofgates per clock and clocks per instruction. As we enter an era
`
`ofbillion transistor chips, there is not enough explicit parallelism in
`conventional programs to efficiently use these resources...1t is expected that
`
`‘ without new innovations in parallel processor designs, microprocessor
`performance will only increase with the increase in gate speed, at a rate of
`
`about 20% per year. Such as Change would have a major effect on the computer
`
`‘ business, and the entire economy.’
`
`
`Attaining high performance on microprocessors is beginning to require more explicit techniques from the
`compiler and programmer. For example, on the Intel ltanium processor, the compiler must carefully
`arrange instructions to attain 6-way instruction issue. Future microprocessor chips will have multiple
`processors on the same die and programmers and compilers will have to modify coding techniques to take
`advantage of parallelism at this level.
`
`Attaining high performance using direct execution logic is much more explicit. A programmer and
`compiler explicitly identify parallelism and the compiler builds the necessary parallel logic. pipelines. This
`technique has proven to be very effective. We have seen simple examples constructed in DEL where many
`hundreds of operations are executed in parallel each clock cycle. The available parallelism via DEL will
`increase linearly with integrated circuit density and so the technique will continue to scale.
`'
`
`DEL also allows the programmer to explicitly manipulate the memory hierarchy, avoiding the tradeoffs and
`complexities that are common in cache designs today. This allows the programmer to make the absolute
`most efficient use of on-chip storage and fully exploit available algorithm locality.
`In the next two
`sections, we will compare the implicit memory hierarchies found in microprocessors with the explicit
`memory hierarchy available to programmers through DEL.
`
`Data and Instruction Cache
`The design goal of any cache melnory system is to minimize memory bus transactions by taking advantage
`of application locality. Microprocessors are fixed logic devices. As such, decisions have to be made at
`design time that optimize for the “average” case, if such a thing exists. Of course, the corollary of this is
`that the design will not be optimal for most applications.
`
`Typical microprocessors exploit data locality implicitly via a rather complex data cache structure fraught
`with tradeoffs. Some of the design parameters include cache size (which has an inverse relationship to
`cache latency), the number of cache levels, cache-line width, the associativity and replacement policy, the
`write policy, the number of physical registers and the instruction cache design. A cache hierarchy that is
`optimized for the SPEC benchmark, for example, may not be a good choice for database applications or
`sparse matrix calculations. No cache design works well with irregular access patterns and strided memory
`references. For a full discussion of cache design tradeoffs, see Appendix B.
`
`Because there are so many design choices when constructing caches, it is difficult to write portable, optimal
`code that will be optimized across cache designs. An example of an attempt in this regard is the
`Automatically Tuned Linear Algebra Software (ATLAS) project. This software configures its blocking
`strategy based on what it can learn about the underlying cache hierarchy of a processor.
`
`Just as the limits of effective ILP are being reached, the limits of implicit data locality are also being
`reached. We are seeing more explicit data motion constructs creeping into microprocessor instruction sets
`in the form of pre—fetch instructions and cache “hints”.
`In the next section, we will see that the tradeoffs
`made in today’s implicit devices can become programming choices in a DEL environment, essentially
`creating optimized hardware tuned for individual programs.
`
`' B. Dally, P. Hanrahan, R. chkiw, A Streaming Supercomputer, September 18. 2001.
`
`Copyright© 2003 SRC Computers, inc.
`ALL RIGHTS RESERVED
`
`2
`
`4
`
`
`
`
`
`.MemdrysHiEratchies'in’a' DEL/Environment. '
`
`In this section, we’ll look at the explicit memory management constructs available in SRC’s DEL
`environment. These constructs allow a user to essentially instantiate locality when and where it is needed
`within an algorithm. Unlike conventional data caches, these mechanisms are parallel and extensible in
`nature, with no fixed bandwidth limitation or sequential access restrictions.
`
`The SRC-6E is a general purpose computing system with reconfigurable processing elements, called MAP
`processors. The SRC compilation system is designed to turn user codes and algorithms into direct
`execution logic within the MAP processor. The memory hierarchy on the SRC-6E processor is described
`below:
`
`1.
`
`System Memory (DDR DRAM): The MAP processor can access memory that is shared with the
`microprocessor.
`
`2. On-board memory (SRAM): The MAP processor is configured with 6 banks of SRAM memory,
`each capable of delivering one word per clock cycle to the MAP (4.8 GB/sec). Each of the 6
`banks is 4 Mbytes in size for a total of 24 Mbytes of storage. This memory can be randomly
`addressed with gather and scatter type operations.
`
`3. On—chip Block RAM (BRA M). The MAP processor has 288 sections of Block RAM memory.
`Each section is 36 bits wide and 512 elements deep and can deliver two words per cycle. When
`configured to deliver 64-bit operands, the 288 banks of Block RAM can deliver up to 288 64-bit
`operands per cycle (210.4 GB/see). Note that this memory can be randomly addressed.
`
`If the Configurable Logic Blocks (CLB) within the MAP are configured for
`4. Register Storage:
`storage, 10005 of registers can be configured. These can be used to pass operands from one
`computational unit to the next, or to construct other storage elements such as FIFO queues and
`look-up tables. Registers are created as needed on the SRC MAP processor and available
`bandwidth in this construct can reach into the 1005 of GB/sec.
`
`On the SRC MAP processor, memory is managed explicitly using constructs in either C or Fortran. The
`following table shows how this is expressed in these high level languages and how these constructs are
`built in hardware:
`
`Table 1.‘ [Memory hierarchy programming construe/s on MAP.
`
`
`
`
`
`
`
`Storage Construct
`Source Code Construct
`Builtfi'om
`Pre-fetch subroutine or function
`
`
`System Memory
`DDR DRAM chips.
`call.
`
`
`
`
`
`
`
`
`
`and internally On-Chip routing resources
`Scalar Variables
`Wires
`used by hardware macros.
`
`
`Copyright© 2003 SRC Computers, Inc.
`3
`ALL RIGHTS RESERVED
`
`On-Board Memory
`
`COMMON blocks in FORTRAN
`Static Structures in C
`
`On-Board SRAM chips.
`
`Addressable on-chip Memory
`
`Local Arrays within DEL routine
`
`On-Chip BRAM
`
`FIFO Queues. Etc.
`
`Called as subroutine or function
`within DEL routine.
`
`On-Chip CLBs or On—Chip BRAM
`
`Registers
`
`and internally On-Chip CLBs.
`Scalar Variables
`used by hardware macros.
`
`5
`
`
`
`
`
`Instruction Bandwidth
`
`Logic for DEL is downloaded into the MAP processor before execution can begin. There is no need to
`provide complex facilities for instruction bandwidth during execution because no “instructions” are issued
`during execution. All the bandwidth facilities on the MAP processor can be dedicated to program data
`requirements.
`
`System Memory
`In cache-based systems, the only way to move data into the higher levels of the memory hierarchy is to
`issue a memory load. DEL allows the programmer to instantiate a data pre-fetch functional unit. This
`functional unit can run autonomously from the rest of the computational logic hence pre—fetching can be
`explicitly overlapped with computation. Pro-fetch functional units do not have any implied cache line
`structure. As such, pre-fetch units can also effectively perform gather/scatter type memory operations as
`well as strided memory access.
`
`On-Board Memory
`On board memory is referenced in 64-bit words with a simple array reference. There is a 4 clock latency to
`access on—board memory, but the resulting data access functional units are fully pipelined and gather/scatter
`operations run at full speed. The 6 banks are explicitly managed via COMMON BLOCK key names or
`static structures in C. This feature allows a programmer to make up to 6 memory references per clock
`cycle to on-board memory.
`
`On-Chip BRAM
`The SRC MAP processor has over 600KB of on-chip, addressable memory or BRAM that is arranged in
`288 banks with 512 elements of width 36-bits in each bank. Unlike conventional caches, the parallel nature
`of explicit BRAM allows programmers to create on-chip storage that is both large and fast. This storage is
`allocated via local arrays in C or Fortran. Simply declaring and accessing multiple local arrays will obtain
`parallel access to this memory structure.
`
`Fixed Length FIFO Queues
`FIFO queues are simple and effective storage structures designed to delay a stream of operands in time by
`the number of stages in the queue. They are the ideal structure for “remembering” a previous column in a
`matrix calculation, for example. They are accessed in C or Fortran by explicitly calling a FIFO fianction or
`subroutine. The macro itself can be built out of logic blocks or BRAM, depending on the length.
`
`Registers
`An arbitrary number of registers can be used to connect computational functional units and to hold
`temporary variables. This allows the compiler to connect localized kernels directly, without a store
`followed by load sequence. Because the registers can be placed and routed near the functional units, and
`because each register has a limited number ofinputs and outputs, they are much easier to implement
`compared with the register set on a microprocessor. There is no negative pressure on clock frequency with
`the number of registers. On the contrary, adding registers where appropriate actually helps increase clock
`frequency with DEL. Finally, the logic blocks that are used to create registers are flexible. If they are not
`need for storage, they can be used to construct computational units and visa versa.
`
`Wires
`
`When possible, the compiler will simply use wires to connect functional units. For the programmer, local
`variables that are assigned and subsequently used in a loop can become wires in a DEL circuit.
`
`Copyright© 2003 SRC Computers. Inc.
`ALL RIGHTS RESERVED
`
`4
`
`6
`
`
`
`
`
`Giving the programmer direct access to the memory hierarchy through DEL has several advantages
`compared with the implicit nature of memory traffic found on microprocessors:
`
`1. Overhead is reduced. There is no register spill traffic, no cache coherency traffic, no cache Write-
`allocate traffic, no cache conflict traffic, no instruction traffic, etc.
`
`2.
`
`3.
`
`The programmer can create exactly the kind of storage structure that is necessary and sufficient.
`This may be on-chip addressable memory, FIFO queues, registers, wires, etc.
`
`Local data structures can be created to facilitate parallel access. There is no fixed on-chip
`bandwidth limitation such as inherent in fixed-logic microprocessors.
`
`4. Explicitly coded algorithms can take advantage of 100% of the available algorithm locality.
`
`5. Much of theon-chip storage resource is flexible because it is constructed of logic blocks. If it is
`not needed for storage, it can be used for computation.
`
`6.
`
`The programmer has the ability to create and use pre-fetch functional units which is much more
`latency tolerant than the “load and use” model found in implicit devices.
`
`7. There is no reliance on cache lines to hide memory latency. DEL architectures can be designed to
`efficiently address the irregular access and strided reference patterns that run poorly on
`microprocessors.
`
`The best way to illustrate how locality and peak bandwidth are fully exploited with DEL is with a few
`small programming examples. In the next section, we compare and contrast the performance of several
`example programs on an implicit, microprocessor-based system and on SRC’s DEL architecture.
`
`Pér'féirmahCe Measurements
`
`In order to highlight the overall bandwidth efficiency and utilization of DEL-based programs, it is useful to
`consider a set of programs and make some measurements on an existing, implicit architecture.
`
`A set of codes was selected and optimized for the Intel Pentium 4 and for SRC’s DEL architecture. For
`each example, the problem size was specified in such a manner as to eliminate cache—size as a
`,
`performance-limiting factor.
`In addition, highly tuned performance libraries were used where possible on
`the Intel system to show the implicit architecture in the best possible light.
`
`In each experiment, a sample code was compiled with the Intel Fortran or C compiler and run using the
`Intel Vtune performance analysis sofiware. This package allowed us to measure the memory bus traffic for
`each benchmark. This value can be compared with the algorithmic minimum to get a measure of
`bandwidth eficiency. In addition, the raw bandwidth speed was measured on the memory bus during the
`calculation. Comparing this value with the peak memory bandwidth of our specific system (3.2
`GB/second) gives us a measure of bandwidth utilization. Combining these two factors yields an overall
`efficiency value for the microprocessor.
`
`In each case, programming with the constructs available with DEL resulted in codes that used an
`algorithmically minimum amount of memory bandwidth, or perfect bandwidth efficiency. In addition,
`DEL allows for a bandwidth utilization level of 100% since computation capacity can be added until the
`available memory bandwidth is completely exhausted. The overall performance improvement that is
`attainable with DEL is hence inversely related to the overall efficiency value observed on the
`microprocessor. The improvement in performance when compared with the implicit computation carried
`out on an Intel Pentium 4 ranged from 6X for the median filter code to 216 X for a triple DES encryption
`code. Note that performance comparisons assume a similar memory bus structure for both the implicit and
`explicit models; hence the difference in performance is purely associated with the difference in the
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`5
`
`7
`
`
`
`computational model. A complete description of these programming examples as coded with SRC’s DEL
`architecture can be found in Appendix C. The performance of these codes are presented in Table 2.
`Data
`Transferred
`Intel Pentium 4
`measurement.
`
`64-bit Matrix
`Multiplication
`(Intel M K L
`library)
`
`Triple DES
`
`Image
`Processing
`Edge Detection
`(2-1) 9-point
`'stencil)
`
`Image
`Processing
`Median Filter
`(2-D 9-point
`stencil)
`
`Z—D Stencil
`Grid Generation
`(TOMCATV)
`
`3-D 27-point
`stcncil
`
`Median filter |
`Edge detector
`(pipelined 2-D
`stencils)
`
`Irregular
`Updates (64-bit
`elements)
`
` M: 1 024
`
`N:5 1 2
`K%4
`
`N:106
`
`600 x 472
`bytes
`
`601) x 472
`bytes
`
`512x512
`
`300x300x300
`
`600 x 472
`bytes
`
`Loop
`length: 1 06
`Tablc
`3 lie: 1 06
`
`
`
`Algorithmic
`Requirement
`(Mbylea)
`
`(Mbytes)
`
`31
`
`24
`
`7.8
`
`16
`
`.57
`
`Percent ofdata
`motion that
`was
`
`algorithmic-ally
`necessary
`
`Percent
`of RUS
`peak
`utilized
`
`Performance
`Improvement
`with DEL
`using the same
`memory B US
`
`16%
`
`20%
`
`31X
`
`1/.16*1/.20=31
`
`66%
`
`7%
`
`.7%
`
`28%
`
`216x
`
`51X
`
`.57
`
`.93
`
`61%
`
`28%
`
`6X
`
`8.4
`
`432
`
`.57
`
`22.9
`
`1125
`
`8.82
`
`37%
`
`38%
`
`6%
`
`20%
`
`14X
`
`24%
`
`11X
`
`22%
`
`76X
`
`32
`
`205
`
`15%
`
`22%
`
`30X
`
`Table 2: ll/ficroprocr'ssor / DEL performance comparison
`
`Matrix Multiplication
`Matrix multiplication was measured using the Intel MKL library. This algorithm is generally considered to
`run very efficiently on microprocessor-based system, yet in this case we see than an additional factor of
`30X can be achieved with DEL. Note that this figure can be pushed higher simply by running larger
`problems which exhibit higher data reuse.
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`8
`
`
`
`
`
`.
`Triple DES
`For the Triple DES example, we used a highly optimized, assembly coded routine for the Intel processor.
`The bandwidth efficiency was 1.5x higher compared with the DEL algorithm. This is most likely due to
`the unnecessary cache-line fill traffic that is generated during the store operation. The Intel processor falls
`short in memory utilization however, because the computational resources within the processor are fixed
`and cannot keep up with available memory bandwidth. As such, the Intel processor utilized a tiny fraction
`of peak memory bandwidth (less than 1%). Since DEL can run this algorithm at full memory bandwidth
`speed, the overall performance advantage of a DEL-based implementation is 2|6X.
`
`2-D Stencils
`
`We looked at 3 different 2—D stencils. Performance improvement of the DEL algorithms ranged from 6X
`to 51X. Both image-processing examples were coded using the optimized Intel graphics libraries and had a
`single array for input and output. The TOMCATV code had two input arrays and four output arrays.
`
`3-D Stencils
`
`The 3-D stencil code is based on an isotropic stencil algorithm. It contained a single source array and a
`single destination array. The computation consists of 26 floating—point additions and 4 floating-point
`multiplications. Performance improvement for the DEL implementation was 1 1X.
`
`Pipelined Stencils
`The pipelined stencil example used the median filter stencil followed by an edge detection stencil. On the
`Intel system, the Intel libraries were used for each of these steps. The DEL algorithm requires no
`additional bandwidth since the computation can be chained together without using an intermediate storage
`array. The DEL version is 76X times faster using the same memory bus.
`
`Irregular Updates
`The irregular update benchmark was 30X more efficient with DEL. This benchmark is somewhat unique
`among this group of codes as it is not computationally intensive, but instead it is data intensive. The Intel
`implicit pre—fetch hardware is not effective in this case and so the full Intel bus bandwidth cannot be
`utilized. The DEL version is‘ able to effectively pre—fetch these irregular memory patterns and achieves a
`30X performance advantage.
`
`QQnélusions
`
`
`
`For the past 20 years, microprocessors have employed increasingly complex techniques to implicitly
`exploit instruction level parallelism and locality to gain in performance. The performance improvement
`trend in these devices has slowed, because the implicit techniques available have reached a point of
`diminishing returns. Implicit constructs are simply running out of steam.
`
`The locality-exploiting mechanism within a microprocessor is fixed and implicit and designed with many
`tradeoffs considered. They are designed for the average case, and frequently do not exhibit the desired
`behavior. Bandwidth is frequently wasted on unnecessary data motion, resulting in low bandwidth
`cficiency, even for fully optimized code such as was illustrated in our examples. In addition, the fixed
`computational and locality-exploiting features within the implicit microprocessor limit the speed, and hence
`the bandwidth utilization, observed with most programs.
`
`By contrast, the scalable locality mechanisms available to the DEL programmer can be used to exploit all
`available algorithm locality and hence achieve 100% bandwidth efficiency. In addition, the scalable
`computational resources available to the DEL programmer can be leveraged to attain 100% bandwidth
`utilization. The end result is a programmable, computational system that delivers the maximum possible
`performance for any given memory bus speed. This combination of efficiency and utilization yields an
`orders ofmagnitude performance benefit compared with implicit models when using an equivalent memory
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`7
`
`9
`
`
`
`
`
`bus. More importantly, however, DEL will continue to scale with increased memory bus performance into
`the future. We believe that this combination of explicit programming with denser DEL devices and
`memory bandwidth enhancements will yield 'l‘eraflop-class processors by 2008,
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`8
`
`10
`
`10
`
`
`
`
`
`,Apgendix Ar: A:.B,r.ief;History ofa,fl1efMicroprOCessor
`
`In 1971, the Intel 4004 had about 2300 transistors, supported 45 instructions and ran at 740 kHz. Today, the
`latest Itanium 2 processor from Intel has about 200 million transistors running at over a gigahertz and the
`hottest topic at microprocessor design conferences is what to do with the billion transistors that are likely to
`be available to chip designers in a few years time. What follows is a summary of the key performance
`innovations that have taken place in microprocessors over the last 30 years.
`
`Pipelining
`The earliest microprocessors had no pipelining. Instructions were issued and completed before a new
`instruction was issued. Instruction pipelining had been around for years in mainfi-ame systems such as the
`Control Data 6600. The feature allowed for multiple instructions to be in process thereby increasing
`instruction throughput. This feature did not appear in microprocessors until the Intel 8086 processor in
`l9xx. The pipelining was fairly rudimentary- a second instruction could be fetched and decoded while the
`first was executing. Pipelining proved to be a very effective performance technique, but it also increased
`the processors appetite for instruction and data bandwidth.
`
`Caches
`
`For the first few years of the microprocessor, instruction and data loads were satisfied from main memory
`and logic speeds were compatible with memory access time in this regard. Logic performance was
`increasing at a faster rate than main memory latency and bandwidth. To respond to this, microprocessors
`designers adopted cache memories. Caches are designed to keep copies of recently used instructions and
`data in nearby SRAM instead of far away DRAM. In 1984 Motorola introduced the 68020, the first
`microprocessor with an on-chip cache, a mere 256 bytes for instructions only. The 68030 introduced
`separate 256 byte cache for instructions and one for data.
`
`Cache memories can either be fast or large, but not both so designers began to build cache hierarchies
`starting with small, fast caches near the processor core and adding larger, slower caches between the fast
`cache and memory. The Dec Alpha 21 164 was first with microprocessor with 2 levels of on-chip cache, a
`1024 word direct-mapped cache and a 96KB secondary cache. As we shall see in the next section,
`specifics of cache design are full of tradeoffs and compromises.
`
`Multiple Instruction Issue
`In 1989, Intel introduced the first superscalar microprocessor, the 1860. It was capable of issuing three
`instructions per clock cycle. This was followed by the DEC Alpha 21064 that was capable of two
`instructions per cycle. Later processors would add out—of—order execution capability. Superscalar
`techniques increased the demand on bandwidth for both instructions and data.
`
`Multiple Data Per Instruction
`Another technique used by designers to speed up processing is to have instructions perform more work.
`Good examples of this are the Single Instruction Multiple Data (SIMD) instructions that Intel has added to
`microprocessors over the years. The initial SIMD extensions, called MMX, were primarily for accelerating
`graphics. Later SSE, and SSE2 extensions were added to accelerate a wide variety of operations. This
`technique increases the demand for operand bandwidth and reduces the instruction bandwidth required for
`a given calculation.
`
`Speculative Execution, Branch Prediction
`The Pentium Pro was among the first microprocessors that supported speculative execution and branch
`prediction. Speculative execution allows instructions that probably will be executed to start early based on
`branch prediction. Results are stored in a special cache called a reorder buffer and are retired if the branch
`Copyright© 2003 SRC Computers, Inc.
`9
`ALL RIGHTS RESERVED
`
`11
`
`11
`
`
`
`
`
`«u
`
`
`
`was correctly predicted. Mis—predictions can and do occur and this can waste instruction and data
`bandwidth.
`
`Appendixi‘
`J*tCaCheDesiign:Tira/deoffs“;
`
`
`-
`
`Microprocessors are fixed logic devices. As such, decisions have to be made at design time that optimize
`for the “average” case, if such a thing exists. Of course, the corollary ol‘this is that the design will not be
`optimal for most applications.
`In this section, we discuss some of the design tradeoffs that have to be taken
`when constructing cache systems on microprocessors.
`
`Size l Latency Tradeoff
`Data caches can be small with low—latency access, or large with comparatively longer latency access. It is
`not possible to design a large capacity, low latency cache; hence microprocessor designers have introduced
`multiple levels of caching into their designs. Each level out from the registers is larger, but takes longer to
`access. At each level, some decision is made with regard to capacity and latency.
`
`Line Width Tradeoff
`
`In order to hide memory latency and exploit spatial locality, caches are arranged in lines. When a load
`suffers a cache miss, a cache line is loaded from main memory into the data cache. The idea is that there is
`a high degree of spatial locality in programs, and it is quite likely that the other memory locations in the
`cache line will also be required.
`
`For programs with a high degree of spatial locality, or stride—one access, wide cache lines are best because
`they minimize the number of times the processor suffers the latency of a memory access. For programs
`with a low degree of spatial locality, or random access, narrow lines are best as this minimizes the wasted
`bandwidth used to bring in the unused neighbors in the cache line.
`
`The widest cache lines are up to 256 bytes long on IBM Power 3 processors. Needless to say, this machine
`is not well known for its gather/scatter performance. At the other extreme, the Cray SVl had a cache line
`that was only one word, or 8 bytes wide. On this architecture, it was assumed that explicit vector loads
`would exploit spatial locality and so the data cache could be designed exc