`
`
`
`9
`Attorney Docket No.: SRC028 PRO
`SS»
`Client/Matter No.: 80404.0033
`or
`Express Mail No.: £V335405698US,,
`e
`Se
`
`="
`PROVISIONAL APPLICATION FOR PATENT COVER SHEET
`=0°This isa requestforfiling a PROVISIONAL APPLICATION FOR PATENTunder 37 CFR 1. 53(c)
`
`PTO/SB/16 (10-01)
`
`INVENTOR(S)
`
`
`_|
`Given Name(first & middle, if any)
`Family Name or Surname
`
`
`Colorado Springs, CO USA
`Jon M.
`Huppenthal
`
`
`Daniel
`Poznanovic
`Colorado Springs, CO USA
`
`
`Colorado Springs, CO USA
`Jeffrey
`Hammes
`
`
`Lisa
`Krause
`Minneapolis, MN USA
`
`
`Jon
`Steidel
`Minneapolis, MN USA
`
`
`Colorado Springs, CO USA
`David E.
`Caliga
`
`
`Thomas R.,
`Seeman
`Colorado Springs, CO USA
`
`
`Lee A.
`Burton
`Divide, CO USA
`
`
`St. Louis. MN USA
`
`Jeffrey Paul
`_| Brooks
`C1 Additional inventors are named on the _ separately numbered sheets attached hereto
`
`
`
`TITLE OF THE INVENTION(500 characters max
`
`
`BANDWIDTH EFFICIENCY AND UTILIZATION USING DIRECT EXECUTION LOGIC
`
`
`CORRESPONDENCE ADDRESS
`Place Customer Number
`Bar Code Label Here
`25235
`
`Type Customer Number here
`ITI
`
`
`
`LJ Firm or
`Individual Name
`
`
`Address
`
`
`City
`
`
`Country
`
`ENCLOSED APPLICATION PARTS(check all that apply)
`EX] Specification Number of Pages: 23
`[] CD(s), Number:
`[_] Drawing(s) Number of Sheets:
`|] Other (specify)
`[J] Application Data Sheet. See 37 CFR 1.76
`
`METHOD OF PAYMENTOFFILINGS FEES FOR THIS PROVISIONAL APPLICATION (check one)
`[| Applicant claims small entity status. See 37 CFR 1.27
`
`Direct all correspondenceto:
`&] Customer Number
`
`
` ><] A check or money order is enclosed to coverthefiling fees
`
`
`FILING FEE
`AMOUNT($)
`
`
`><] The Commissioneris hereby authorized to charge
`any additional filing fees or credit any overpayment to
`
`$160.00
`Deposit Account No. 50-1123
`
`
`The invention was made by an agencyof the U.S. Governmentor under a contract with an agencyof the U.S.
`Government.
`ix] No.
`
`
`y IKan/
`
`WiliamJ Kubida
`:
`
`719-448-5909
`Telephone:___
`NAACS - 80404/0001 - 61483 v1
`
`[| Yes, the nameof the agency and contract no. are
`
` 9,664
`Date:Reg. No.
`
`
`
`Docket No.
`
`SRC028 PRO
`
`1
`
`XILINX 1017
`
`1
`
`XILINX 1017
`
`
`
`
`
`Attorney Docket No. SRCO28 PRO
`Client/Matter No. 80404.0033
`Express Mail Label No. EV335405698US
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`In re Provisional Application of:
`
`Jon M. Huppenthal, Daniel Poznanovic, Jeffrey Hammes,
`Lisa Krause, Jon Steidel, David E. Caliga, Thomas R.
`Seeman, Lee A. Burton, and Jeffrey Paul Brooks
`
`Serial No. -----------
`
`Filed: Herewith
`
`For: BANDWIDTH EFFICIENCY AND UTILIZATION USING
`DIRECT EXECUTION LOGIC
`
`
`
`CERTIFICATE OF MAILING BY EXPRESS MAIL
`
`MAIL STOP PROVISIONAL PATENT APPLICATION
`Commissioner for Patents
`P.O. Box 1450
`Alexandria, VA 22313-1450
`
`Sir:
`
`akon=
`
`The undersigned herebycertifies that the attached:
`. Provisional Application for Patent Cover Sheet;
`. Specification- 23 pages;
`. Check in the Amount of $160.00;
`. Certificate of Mailing by Express Mail; and
`Return Card
`relating to the above application, were deposited as "Express Mail," Mailing Label
`No.EV335405698US, with the United States Postal Service, addressed to MAIL STOP
`PROVISIONAL PATENT APPLICATION, Commissioner for Patents, P.O. Box 1450
`Alexandria, VA 22313-1450, on
`iS
`Wwe
`2803
`Date
`
`
`a[Py A)
` eS
`
`
`KELKy 4
`William J. Kubida, Reg. No. 39664.
`
`HOGAN & HARTSON LLP
`One Tabor Center
`1200 17th Street, Suite 1500
`Denver, Colorado 80202
`(719) 448-5909 Tel
`(303) 899-7333 Fax
`
`NANCS - 80404/0001 - 61483 v1
`
`2
`
`
`
`
`
`Bandwidth Efficiency and Utilization
`Using Direct Execution Logic
`
`SRC Computers, Inc.
`
`
`Introduction: -
`
`
`
`In a previous paper, Petaflop Computing Using Direct Execution Logic, we showedthat Teraflop-class
`processors will be available to direct execution logic (DEL) programmersin 2008. The challenge with this
`kind of computational capacity, of course, is how to keep it fed with operands. One key facet ofthis is the
`ability to exploit all available locality in a program or algorithm, or bandwidth efficiency. A second facet
`of this is the ability to run calculations at such a rate as to use all available memory bandwidth, or
`bandwidth utilization.
`
`In this paper, we will show that the implicit features in modern microprocessor-based systemsfail to
`achieve full bandwidth efficiency or utilization, even for fully optimized and tuned codes. By contrast, we
`will show that a programmer can usc a high level language to directly manipulate the memory hierarchy
`and calculation pipeline via DEL achieving both maximum bandwidth efficiency andutilization across a
`wide variety of algorithms. The endresult is a programmable, computational system that delivers the
`maximum possible performance for any given memory bus speed. This combinationof efficiency and
`utilization yields an orders ofmagnitude performance benefit even when utilizing an equivalent memory
`bus. A diverse set of examples presented in this paper demonstrate between 6X and 216X performance gain
`through DEL.
`
`Theability of DEL to use 100% ofavailable bandwidth with near 100% efficiency is exciting today, but
`becomes even more exciting in the future as new interconnect technologies suchas stacked die promise to
`substantially increase available memory bandwidth. Webelieve that performance via DEL will continue to
`scale perfectly into these bandwidth-rich environments.
`
`Microprocessors:Reaching the Limits of ImplicitMethods.
`
`Overthe past 30 years, microprocessors have enjoyed annual performance improvements of about 50%.
`Designers have been taking advantage ofhigher frequencies, and have added a greal deal of complexity
`primarily in two areas- memory hierarchies and searching for instruction level parallelism (ILP). For a
`short history of the microprocessor and these implicit design optimizations, see the Appendix.
`
`Instruction Level Parallelism
`Techniques, which automatically exploit TLP at execution time, have reached the point of diminishing
`returns. Of the 10 million transistors in an Intel Pentium III processor, for example, more than halfare
`dedicated to instruction cache, branch prediction, out-of-order execution and superscalar logic. In addition,
`complex ILP techniques require chip-wide wires.
`In a few short years, it will take several clocks to travel
`across a microprocessordie so these complex techniques do not scale with integrated circuit technology.
`appears that viable techniques to extract implicit parallelism from a serial instruction stream have simply
`tunout of steam. According to Bill Dally, Pat anrahan and Ron Fedkiw at Stanford University:
`
`It
`
`Add the Copyright Statement
`
`3
`
`
`
`
`
`
`
`
`
`The performance ofthe microprocessors ... is no longer scaling at the historic
`rate of50% per year. Microprocessors have reached a point ofdiminishing
`returns in terms ofgates per clock and clocks per instruction. As we enter an era
`ofbillion transistor chips, there is not enough explicit parallelism in
`conventional programsto efficiently use these resources...1t is expected that
`| without new innovationsin parallel processor designs, microprocessor
`performancewill only increase with the increase in gate speed, at a rate of
`
`about 20% per year. Such as change would have a major effect on the computer
`
`| business, and the entire economy.
`
`
`Attaining high performance on microprocessorsis beginning to require more explicit techniques from the
`compiler and programmer. For example, on the Intel Itanium processor, the compiler must carefully
`arrange instructions to attain 6-way instruction issue. Future microprocessor chips will have multiple
`processors on the same die and programmers and compilers will have to modify coding techniques to take
`advantage of parallelism at this level.
`
`Attaining high performance using direct execution logic is much more explicit. A programmer and
`compiler explicitly identify parallelism and the compiler builds the necessary parallel logic pipelines. This
`technique has proven to be very effective. We have seen simple examples constructed in DEL where many
`hundreds ofoperations are executed in parallel each clock cycle. The available parallelism via DEL will
`increase linearly with integrated circuit density and so the technique will continue to scale.
`~
`
`DELalso allows the programmerto explicitly manipulate the memory hierarchy, avoiding the tradeotfs and
`complexities that are commonin cache designs today. This allows the programmer to make the absolute
`mostefficient use of on-chip storage and fully exploit available algorithm locality.
`In the next two
`sections, we will compare the implicit memory hierarchies found in microprocessors with the explicit
`memoryhierarchy available to programmers through DEL.
`
`Data and Instruction Cache
`The design goal of any cache memory system is to minimize memory bustransactions by taking advantage
`of application locality. Microprocessors are fixed logic devices. As such, decisions have to be madeat
`design time that optimize for the “average” case, if such a thing exists. Of course, the corollary of this is
`that the design will not be optimal for most applications.
`
`Typical microprocessors exploit data locality implicitly via a rather complex data cache structure fraught
`withtradeoffs. Some of the design parameters include cache size (which has an inverse relationship to
`cache latency), the numberof cache levels, cache-line width, the associativity and replacementpolicy, the
`write policy, the number of physical registers and the instruction cache design. A cache hierarchythatis
`optimized for the SPEC benchmark, for example, may not be a good choice for database applications or
`sparse matrix calculations. No cache design works well with irregular access patterns and strided memory
`references. Fora full discussion of cache design tradeoffs, see Appendix B.
`
`Because there are so many design choices when constructing caches, it is difficult to write portable, optimal
`code that will be optimized across cache designs. An example of an attemptin this regard is the
`Automatically Tuned Linear Algebra Software (ATLAS)project. This software configuresits blocking
`strategy based on whatit can learn about the underlying cache hierarchy of a processor.
`
`Just as the limits of effective ILP are being reached, the limits of implicit data locality are also being
`reached. Weare seeing moreexplicit data motion constructs creeping into microprocessorinstruction sets
`in the form of pre-fetch instructions and cache “hints”.
`In the next section, we will see that the tradeoffs
`madein today’s implicit devices can become programming choices in a DEL environment, essentially
`creating optimized hardware tuned for individual programs.
`
`'B. Dally, P. Hanrahan, R. Fedkiw, 4 Streaming Supercomputer, September 18, 2001.
`
`Copyright© 2003 SRC Computers,Inc.
`ALL RIGHTS RESERVED
`
`2
`
`4
`
`
`
`
`
`MemoryHierarchiesina DEL Environment:
`
`In this section, we’ ll look at the explicit memory managementconstructs available in SRC’s DEL
`environment. These constructs allow a user to essentially instantiate locality when and where it is needed
`within an algorithm. Unlike conventional data caches, these mechanisms are parallel and extensible in
`nature, with no fixed bandwidth limitation or sequential accessrestrictions.
`
`The SRC-6Eis a gencral purpose computing system with reconfigurable processing elements, called MAP
`processors. The SRC compilation system is designedto turn user codes and algorithmsinto direct
`execution logic within the MAP processor. The memoryhierarchy on the SRC-6E processoris described
`below:
`
`1,
`
`System Memory (DDR DRAM): The MAPprocessor can access memory that is shared with the
`microprocessor.
`
`2. On-board memory (SRAM): The MAPprocessoris configured with 6 banks of SRAM memory,
`each capable ofdelivering one word per clock cycle to the MAP (4.8 GB/sec). Each ofthe 6
`banks is 4 Mbytes in size for a total of 24 Mbytes of storage. This memory can be randomly
`addressed with gather and scatter type operations.
`
`3. On-chip Block RAM (BRAM). The MAPprocessor has 288 sections of Block RAM memory.
`Eachsection is 36 bits wide and 512 elements deep and candeliver two words per cycle. When
`configured to deliver 64-bit operands, the 288 banks of Block RAM can deliver up to 288 64-bit
`operands per cycle (210.4 GB/sec). Note that this memory can be randomly addressed.
`
`4. Register Storage: If the Configurable Logic Blocks (CLB) within the MAPare configuredfor
`storage, 1000s of registers can be configured. These can be used to pass operands from one
`computational unit to the next, or to construct other storage elements such as FIFO queues and
`look-up tables. Registers are created as needed on the SRC MAPprocessorand available
`bandwidth in this construct can reach into the 100s of GB/sec.
`
`On the SRC MAP processor, memory is managed explicitly using constructs in either C or Fortran. The
`following table shows howthis is expressed in these high Icvel languages and how these constructsare
`built in hardware:
`
`On-Board Memory
`
`COMMONblocks in FORTRAN
`Static Structures in C
`
`On-Board SRAM chips.
`
`Builtfrom
`
`
`DDR DRAM chips.
`
`Table 1: Memory hierarchy programming construcis on MAP.
`
`
`
`Source Code Construct
`Storage Construct
`
`
`Pre-fetch subroutine or function
`System Memory
`call.
`
`
`
`
`
`
`
`
`
`
`and internally On-Chip routing resources
`Sealar Variables
`Wires
`used by hardware macros.
`
`
`Copyright© 2003 SRC Computers,Inc.
`3
`ALL RIGHTS RESERVED
`
`Addressable on-chip Memory{Local Arrays within DEL routine On-Chip BRAM
`
`FIFO Queues. Etc.
`
`Called as subroutine or function
`within DEL routine.
`
`On-Chip CLBs or On-Chip BRAM
`
`Registers
`
`and internally On-Chip CLBs.
`Scalar Variables
`used by hardware macros,
`
`5
`
`
`
`
`
`Instruction Bandwidth
`Logic for DEL is downloaded into the MAPprocessor before execution can begin. There is no need to
`provide complexfacilities for instruction bandwidth during execution because no “instructions” are issued
`during execution. All the bandwidth facilities on the MAP processor can be dedicated to program data
`requirements.
`
`System Memory
`In cache-based systems, the only way to movedata into the higher levels of the memory hierarchyis to
`issue a memory load. DELallows the programmerto instantiate a data pre-fetch functional unit. This
`functional unit can run autonomously from the rest of the computational logic hence pre-fetching can be
`explicitly overlapped with computation. Pre-fetch functional units do not have any implied ‘cacheline
`structure. As such, pre-fetch units can also effectively perform gather/scatter type memory operationsas
`well as strided memoryaccess.
`
`On-Board Memory
`Onboard memoryis referenced in 64-bit words with a simple array reference. There is a 4 clock latency to
`access on-board memory,but the resulting data access functional units are fully pipelined and gather/scatter
`operations run at full speed. The 6 banksarc explicitly managed via COMMON BLOCKkey namesor
`static structures in C. This feature allows a programmer to make up to 6 memory references per clock
`cycle to on-board memory.
`
`On-Chip BRAM
`The SRC MAP processor has over 600KBof on-chip, addressable memory or BRAMthatis arranged in
`288 banks with 512 elements of width 36-bits in each bank. Unlike conventional caches, the parallel nature
`of explicit BRAM allows programmersto create on-chip storage that is both large and fast. This storageis
`allocated via local arrays in C or Fortran. Simply declaring and accessing multiple local arrays will obtain
`parallel access to this memorystructure.
`
`Fixed Length FIFO Queues
`FIFO queues are simple andeffective storage structures designed to delay a stream of operandsin time by
`the numberof stages in the qucuc. They are the ideal structure for “remembering”a previous column ina
`matrix calculation, for example. They are accessed in C or Fortran by explicitly calling a FIFO function or
`subroutine. The macroitself can be built out of logic blocks or BRAM,dependingonthe length.
`
`Registers
`An arbitrary numberofregisters can be used to connect computational functional units and to hold
`temporary variables. This allows the compiler to connect localized kernels directly, without a store
`followed by load sequence. Because the registers can be placed and routed nearthe functional units, and
`becausecachregister has a limited numberofinputs and outputs, they are much ecasicr to implement
`compared with the register set on a microprocessor. There is no negative pressure on clock frequency with
`the numberofregisters. On the contrary, adding registers where appropriate actually helps increase clock
`frequency with DEL. Finally, the logic blocks that are used to create registers are flexible. If they are not
`need for storage, they can be used to construct computational units and visa versa.
`
`Wires
`Whenpossible, the compiler will simply use wires to connect functional units. For the programmer,local
`variables that are assigned and subsequently used in a loop can become wires in a DELcircuit.
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`4
`
`6
`
`
`
`
`
`Giving the programmerdirect access to the memory hierarchy through DEL has several advantages
`compared with the implicit nature of memorytraffic found on microprocessors:
`
`Lt. Overhead is reduced. Thereis noregister spill traffic, no cache coherencytraffic, no cache write-
`allocate traffic, no cache conflict traffic, no instruction traffic, etc.
`
`2.
`
`3.
`
`The programmercan create exactly the kind of storage structure that is necessary andsufficient.
`This may be on-chip addressable memory, FIFO queues, registers, wires, etc.
`
`Localdata structures can becreatedto facilitate parallel access. There is no fixed on-chip
`bandwidth limitation such as inherentin fixed-logic microprocessors.
`
`4. Explicitly coded algorithms can take advantage of 100% ofthe available algorithm locality.
`
`5. Much of the.on-chip storage resourceis flexible because it is constructed of logic blocks. If it is
`not needed for storage, it can be used for computation.
`
`6.
`
`The programmerhasthe ability to create and use pre-fetch functional units which is much more
`latency tolerant than the “load and use” model found in implicit devices.
`
`7. There is no reliance on cachelines to hide memory latency. DEL architectures can be designed to
`efficiently address the irregular access and strided reference patterns that run poorly on
`microprocessors.
`
`The best way to illustrate how locality and peak bandwidthare fully exploited with DEL is with a few
`small programming examples. In the next section, we compare and contrast the performanceof several
`example programs on an implicit, microprocessor-based system and on SRC’s DELarchitecture.
`
`Performance Measurements «
`
`In order to highlight the overall bandwidth efficiency and utilization of DEL-based programs,it is useful to
`considera set of programs and make some measurements on an existing, implicit architecture.
`
`A set of codes was selected and optimized for the Intel Pentium 4 and for SRC’s DELarchitecture. For
`each example, the problem size was specified in such a manneras to eliminate cache-size as a
`,
`pcrformance-limiting factor.
`In addition, highly tuned performance librarics were used where possible on
`the Intel system to show the implicit architecture in the best possible light.
`
`In each experiment, a sample code was compiled with the Intel Fortran or C compiler and run using the
`Intel Vtune performanceanalysis software. This package allowed us to measure the memory bustraffic for
`each benchmark. This value can be compared with the algorithmic minimum to get a measure of
`bandwidth efficiency. In addition, the raw bandwidth speed was measured on the memory busduring the
`calculation. Comparing this value with the peak memory bandwidth ofourspecific system (3.2
`GB/second) gives us a measure of bandwidth utilization. Combining these two factors yields an overall
`efficiency value for the microprocessor.
`
`In each case, programming with the constructs available with DEL resulted in codes that used an
`algorithmically minimum amount of memory bandwidth, or perfect bandwidth efficiency. In addition,
`DELallowsfor a bandwidth utilization level of 100% since computation capacity can be added until the
`available memory bandwidth is completely exhausted. Thc overall performance improvementthat is
`attainable with DELis hence inversely related to the overall efficiency value observed on the
`microprocessor. The improvementin performance when compared with the implicit computation carried
`out on an Intcl Pentium 4 ranged from 6X for the medianfilter code to 216 X for a triple DES encryption
`code. Note that performance comparisons assume a similar memory busstructure for both the implicit and
`explicit models; hence the difference in performanccis purely associated with the difference in the
`
`Copyright© 2003 SRC Computers,Inc.
`ALL RIGHTS RESERVED
`
`5
`
`7
`
`
`
`computational model. A complete description of these programming examples as coded with SRC’s DEL
`architecture can be found in Appendix C. The performanceof these codes are presented in Table 2.
`Data
`Transferred
`Intel Pentium 4
`measurement.
`
` M=1024
`
`N=512
`K=64
`
`N=10°
`
`600 x 472
`bytes
`
`600 x 472
`bytcs
`
`512x512
`
`300x300x300
`
`600 x 472
`bytes
`
`Loop
`tength=10°
`Table
`Size=10°
`
`64-bit Matrix
`Multiplication
`(Intel MKL
`library)
`
`Triple DES
`
`Image
`Processing
`Edge Detection
`(2-D 9-point
`Stencil)
`
`Image
`Processing
`MedianFilter
`(2-D 9-point
`stencil)
`
`2-D Stencil
`Grid Generation
`(TOMCATYV)
`
`3-D 27-point
`stencil
`
`Median filter|
`Edge detector
`(pipelined 2-D
`stencils)
`
`Irregular
`Updates (64-bit
`elements)
`
`
`
`Algorithmic
`Requirement
`(Mbytes)
`
`(Mbytes)
`
`31
`
`24
`
`78
`
`93
`
`22.9
`
`1125
`
`8.82
`
`57
`
`8.4
`
`432
`
`57
`
`Percent ofdata
`motionthat
`was
`
`algorithmically
`necessary
`
`Percent
`of BUS
`peak
`utilized
`
`Performance
`Improvement
`with DEL
`using the same
`memory BUS
`
`16%
`
`20%
`
`31 X
`
`1/.16*1/.20=31
`
`66%
`
`T%
`
`T%
`
`28%
`
`216X%
`
`SIX
`
`61%
`
`28%
`
`6X
`
`37%
`
`38%
`
`6%
`
`20%
`
`14X
`
`24%
`
`22%
`
`76%
`
`205
`
`15%
`
`22%
`
`30 X
`
`Table 2: Microprocessor / DEL performance comparison
`
`Matrix Multiplication
`Matrix multiplication was measured using the Intel MKL library. This algorithm is generally considered to
`run very efficiently on microprocessor-based system, yet in this case we see than an additional factor of
`30X can be achieved with DEL.Note that this figure can be pushed higher simply by running larger
`problems which exhibit higher data reuse.
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`8
`
`
`
`
`
`Triple DES
`For the Triple DES example, we used a highly optimized, assembly coded routine for the Intel processor.
`The bandwidth efficiency was 1.5% higher compared with the DELalgorithm. This is most likely due to
`the unnecessary cache-linefill traffic that is generated during the store operation. The Intel processorfalls
`short in memory utilization however, because the computational resources within the processorare fixed
`and cannot keep up with available memory bandwidth. As such, the Intel processorutilized a tiny fraction
`of peak memory bandwidth (less than 1%). Since DEL can run this algorithm at full memory bandwidth
`speed, the overall performance advantage of a DEL-based implementation is 216X.
`
`2-D Stencils
`Welookedat 3 different 2-D stencils. Performance improvementof the DEL algorithms ranged from 6X
`to 51X. Both image-processing examples were coded using the optimized Intel graphics libraries and had a
`single array for input and output. The TOMCATVcode had two input arrays and four output arrays.
`
`3-D Stencils
`The 3-Dstencil code is based on anisotropic stencil algorithm. It contained a single source array anda
`single destination array. The computation consists of 26 floating-point additions and 4 floating-point
`multiplications. Performance improvementfor the DEL implementation was 1 1X.
`
`Pipelined Stencils
`The pipelined stencil example used the medianfilter stencil followed by an edge detection stencil. On the
`Intel system, the Intel libraries were used for each of these steps. The DEL algorithm requires no
`additional bandwidth since the computation can be chained together without using an intermediate storage
`atray. The DEL version is 76X times faster using the same memory bus.
`
`Irregular Updates
`The irregular update benchmark was 30X moreefficient with DEL. This benchmark is somewhat unique
`amongthis groupofcodesasit is not computationally intensive, but instead it is data intensive. TheIntel
`implicit pre-fetch hardware is not effective in this case and sothe full Intel bus bandwidth cannot be
`utilized. The DEL versionisable to effectively pre-fetch these irregular memory patterns and achieves a
`30X performance advantage.
`
`Conclusions
`
`
`
`Forthe past 20 years, microprocessors have employedincreasingly complex techniques to implicitly
`exploit instruction level parallelism and locality to gain in performance. The performance improvement
`trend in these devices has stowed, because the implicit techniques available have reached a point of
`diminishing returns. Implicit constructs are simply running outof steam.
`
`Thelocality-exploiting mechanism within a microprocessoris fixed and implicit and designed with many
`tradeoffs considered. They are designed for the average case, and frequently do not exhibit the desired
`behavior. Bandwidth is frequently wasted on unnecessary data motion, resulting in low bandwidth
`efficiency, even for fully optimized code such as wasillustrated in our examples. In addition, the fixed
`computationaland locality-exploiting features within the implicit microprocessorlimit the speed, and hence
`the bandwidth utilization, observed with most programs.
`
`Bycontrast, the scalable locality mechanismsavailable to the DEL programmercan beuscd to exploitall
`available algorithm locality and hence achieve 100% bandwidth efficiency. In addition, the scalable
`computational resources available to the DEL programmercanbe leveraged to attain 100% bandwidth
`utilization. The end resull is a programmable, computational system that delivers the maximum possible
`performance for any given memorybus speed. This combination of efficiency andutilization yields an
`orders ofmagnitude performance bencfit compared with implicit models when using an equivalent memory
`Copyright© 2003 SRC Computers,Inc.
`7
`ALL RIGHTS RESERVED
`
`9
`
`
`
`
`
`bus. More importantly, however, DEL will continue to scale with increased memory bus performance into
`the future. Webelieve that this combination ofexplicit programming with denser DEL devices and
`memory bandwidth enhancements will yield Teraflop-class processors by 2008.
`
`Copyright© 2003 SRC Computers,Inc.
`ALL RIGHTS RESERVED
`
`8
`
`10
`
`10
`
`
`
`
`
`Appendix A: ABriefHistory ofthe Microprocessor
`
`In 1971, the Intel 4004 had about 2300 transistors, supported 45 instructions and ran at 740 kHz. Today, the
`latest Itanium 2 processor from Intel has about 200 million transistors runningat over a gigahertz and the
`hottest topic at microprocessor design conferences is what to do with the billion transistors that are likely to
`be available to chip designers in a few years time. What follows is a summary of the key performance
`innovations that have taken place in microprocessors overthe last 30 years.
`
`Pipelining
`The earlicst microprocessors had nopipelining. Instructions were issued and complcted before a new
`instruction was issued. Instruction pipelining had been around for years in mainframe systems such as the
`Control Data 6600. The feature allowed for multiple instructions to be in process thereby increasing
`instruction throughput. This feature did not appear in microprocessors until the Intel 8086 processorin
`19xx. The pipelining was fairly rudimentary- a secondinstruction could be fetched and decoded while the
`first was executing. Pipelining provedto be a very effective performance technique, butit also increased
`the processors appetite for instruction and data bandwidth.
`
`Caches
`Forthe first few years of the microprocessor,instruction and data loads were satisfied from main memory
`and logic speeds were compatible with memory access time in this regard. Logic performance was
`increasing at a faster rate than main memory latency and bandwidth. To respondto this, microprocessors
`designers adopted cache memories. Caches are designed to keep copiesof recently used instructions and
`data in nearby SRAM instead of far away DRAM.In 1984 Motorolaintroduced the 68020, the first
`microprocessor with an on-chip cache, a mere 256 bytes for instructions only. The 68030 introduced
`separate 256 byte cache for instructions and one for data.
`
`Cache memories can either be fast or large, but not both so designers began to build cache hierarchies
`starting with small, fast caches near the processor core and addinglarger, slower caches between thefast
`cache and memory. The Dec Alpha 21164 wasfirst with microprocessor with 2 levels of an-chip cache, a
`1024 word direct-mapped cache and a 96KB secondary cache. As weshall see in the next section,
`specifics of cache design are full of tradeoffs and compromises.
`
`Multiple Instruction Issue
`In 1989,Intel introducedthe first superscalar microprocessor, the i860. It was capable ofissuing three
`instructions per clock cycle. This was followed by the DEC Alpha 21064 that was capable of two
`instructions per cycle. Later processors would addout-of-order execution capability. Superscalar
`techniques increased the demand on bandwidth for both instructions and data.
`
`Multiple Data Per Instruction
`Another technique used by designers to speed up processing is to have instructions perform more work.
`Good examplesofthis are the Single Instruction Multiple Data (SIMD)instructions that Intel has added to
`microprocessors over the years. The initial SIMD cxtensions, called MMX,were primarily for accelerating
`graphics. Later SSE, and SSE2 extensions were added to accelerate a wide variety of operations. This
`technique increases the demand for operand bandwidth and reducesthe instruction bandwidth required for
`a givencalculation.
`
`Speculative Execution, Branch Prediction
`The Pentium Pro was among the first microprocessors that supported speculative execution and branch
`prediction. Speculative execution allows instructions that probably will be executed to start early based on
`branch prediction. Results are stored in a special cache called a reorder buffer andareretired if the branch
`Copyright® 2003 SRC Computers,Inc.
`9
`ALL RIGHTS RESERVED
`
`11
`
`11
`
`
`
`
`
`was correctly predicted. Mis-predictions can and do occur and this can waste instruction and data
`bandwidth.
`
` Appendix'B
`
`
`CacheDesign Tradeoffs: -
`
`Microprocessors are fixed logic devices. As such, decisions have to be made at design time that optimize
`for the “average” case, if such a thing exists. Of course, the corollary of this is that the design will not be
`optimal for most applications.
`In this section, we discuss some of the design tradeoffs that have to be taken
`when constructing cache systems on microprocessors.
`
`Size / Latency Tradeoff
`Data caches can be small with low-latency access, or large with comparatively longerlatency access.It is
`not possible to design a large capacity, low latency cache; hence microprocessor designers have introduced
`multiple levels of caching into their designs. Eachlevel out from the registers is larger, but takes longer to
`access. At each level, some decision is made with regard to capacity and latency.
`
`Line Width Tradeoff
`In order to hide memory latency and exploit spatial locality, caches are arranged in lines. When a load
`suffers a cache miss, a cacheline is loaded from main memory into the data cache. The ideais that there is
`a high degree of spatial locality in programs, and it is quite likely that the other memory locations in the
`cacheline will also be required.
`
`For programs with a high degree of spatial locality, or stride-one access, wide cache lines are best because
`they minimize the number of times the processor suffers the latency of a memory access. For programs
`with a low degree of spatial locality, or random access, narrow lines are best as this minimizes the wasted
`bandwidth used to bring in the unused neighbors in the cache linc.
`
`The widest cache lines are up to 256 bytes long on IBM Power3 processors. Needlessto say, this machine
`is not well knownforits gather/scatter performance. At the other extreme, the Cray SV1 had a cacheline
`that was only one word, or 8 bytes wide. On this architecture, it was assumedthat explicit vector loads
`would exploit spatial locality and so the data cache could be designed exclusively to exploit temporal
`locality.
`.
`
`Associativity Tradeoff
`Cacheassociativity refers to the mapping between memory locations and cache sectors. The two extreme
`design points are direct-mapped caches andfully associative caches. In a direct-mapped cache, a specific
`memorylocation can be mappedto only a single cache line. Direct-map