throbber
061‘s" ‘6‘ \ cg
`
`
`,,
`,_
`.
`Attorney Docket N0.: SRCOZS PRO
`Client/Matter N0.: 80404.0033
`Express Mail N0.: EV335405698USo
`
`:3
`gig
`Kg“
`5% .c
`
`
`
`
`I;’—’-—-
`
`PROVISIONAL APPLICATION FOR PA TENT COVER SHEET
`g3
`5-; 0 This is a request for filing a PROVISIONAL APPLICATION FOR PATENT under 37 CFR 1.53(c)
`
`
`'
`PTO/SB/16110—O1)
`
`INVENTOR(S)
`
`
`I
`Given Name (first & middle, if any)
`Family Name or Surname
`
`
`Colorado Springs, CO USA
`Jon M.
`Huppenthal
`
`
`Daniel
`Poznanovic
`Colorado Springs, CO USA
`
`
`Colorado Springs, CO USA
`Jeffrey
`Hammes
`
`
`Lisa
`Krause
`Minneapolis, MN USA
`
`
`Jon
`Steidel
`Minneapolis, MN USA
`
`
`Colorado Springs, CO USA
`David E.
`Caliga
`
`
`Thomas R.
`Seeman
`Colorado Springs, CO USA
`
`
`Lee A.
`Burton
`Divide, CO USA
`
`
`St. Louis. MN USA
`Jeffrey Paul
`,,, Brooks
`
`E] Additional inventors are named on the _ separately numbered sheets attached hereto
`
`
`TITLE OF THE INVENTION {500 characters max
`BANDWIDTH EFFICIENCY AND UTILIZATION USING DIRECT EXECUTION LOGIC
`
`CORRESPONDENCE ADDRESS
`
`
`
`
`
`
`Direct all correspondence to:
`IX] Customer Number
`
`Place Customer Number
`Bar Code Label Here
`25235
`Illllll|l|||||l||ll|||||l||||l||||l
`We Customer Number here
`
`
`
`E] Fiirrm or
`Individual Name
`
`
`
`City
`
`Country
`
`ENCLOSED APPLICATION PARTS (Check all that apply)
`E Specification Number of Pages: 23
`D CD(S)’ Number:
`|:I Drawing(s) Number of Sheets:
`D Other (specify)
`
`
`[I Application Data Sheet. See 37 CFR 1.76
`METHOD OF PAYMENT OF FILINGS FEES FOR THIS PROVISIONAL APPLICATION (check one)
`
`I:I Applicant claims small entity status. See 37 CFR 1.27
`
`g A check or money order is enclosed to cover the filing fees
`
`FILING FEE
`
`AMOUNT ($)
`
`
`
`
`
`
`E The Commissioner is hereby authorized to charge
`any additional filing fees or credit any overpayment to
`
`Deposit Account No. 50-1123 $160.00
`
`The invention was made by an agency of the US. Government or under a contract with an agency of the US.
`Government.
`
`lzl No.
`
`
`
`[:I Yes, the name of the agency and contract no. are
`
`
`
`- d,“\/
`__ ,
`.
`
`:
`WI Iam J. Kubid
`Telephone:
`719—448-5909
`\\\CS ~ 8040410001 , 61483 v1
`
`Date:
`
`Reg. No.
`
`
`9 664
`
`Docket No.
`
`SRCOZ8 PRO
`
`1
`
`XILINX 1017
`
`1
`
`XILINX 1017
`
`

`

`
`
`Attorney Docket No. SRCOZ8 PRO
`Client/Matter No. 80404.0033
`
`Express Mail Label No. EV335405698US
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`In re Provisional Application of:
`
`Jon M. Huppenthal, Daniel Poznanovic, Jeffrey Hammes,
`Lisa Krause, Jon Steidel, David E. Caliga, Thomas R.
`Seeman, Lee A. Burton, and Jeffrey Paul Brooks
`
`Serial No_ ___________
`
`Filed: Herewith
`
`For: BANDWIDTH EFFICIENCY AND UTILIZATION USING
`DIRECT EXECUTION LOGIC
`
`
`CERTIFICATE OF MAILING BY EXPRESS MAIL
`
`MAIL STOP PROVISIONAL PATENT APPLICATION
`Commissioner for Patents
`PO. Box 1450
`
`Alexandria, VA 22313—1450
`
`Sir:
`
`The undersigned hereby certifies that the attached:
`Provisional Application for Patent Cover Sheet;
`Specification- 23 pages;
`Check in the Amount of $160.00;
`Certificate of Mailing by Express Mail; and
`. Return Card
`
`mewwe
`
`relating to the above application, were deposited as "Express Mail," Mailing Label
`No.EV335405698US, with the United States Postal Service, addressed to MAIL STOP
`PROVISIONAL PATENT APPLICATION Commissioner for Patents, PO. Box 1450
`
`Alexandria, VA 22313—1450, on
`
`
`I? W 22:03
`Dat
`
`
` g. .
`HOGAN & HARTSON LLP
`One Tabor Center
`1200 17th Street, Suite 1500
`Denver, Colorado 80202
`(719) 448—5909 Tel
`(303) 899-7333 Fax
`
`
`
`\\\CS - 8040410001 - 61483 v1
`
`2
`
`

`

`
`
`Bandwidth Efliciency and Utilization
`Using Direct Execution Logic
`
`SRC Computers, Inc.
`
`
`introduction _. 1::
`
`,
`
`
`
`In a previous paper, Petqflop Computing Using Direct Execution Logic, we showed that Teraflop-class
`processors will be available to direct execution logic (DEL) programmers in 2008. The challenge with this
`kind of computational capacity, of course, is how to keep it fed with operands. One key facet of this is the
`ability to exploit all available locality in a program or algorithm, or bandwidth efliciency. A second facet
`of this is the ability to run calculations at such a rate as to use all available memory bandwidth, or
`bandwidth utilization.
`
`In this paper, we will show that the implicit features in modern microprocessor—based systems fail to
`achieve full bandwidth efficiency or utilization, even for fiilly optimized and tuned codes. By contrast, we
`will show that a programmer can use a high level language to directly manipulate the memory hierarchy
`and calculation pipeline via DEL achieving both maximum bandwidth efficiency and utilization across a
`wide variety of algorithms. The end result is a programmable, computational system that delivers the
`maximum possible performance for any given memory bus speed. This combination of efficiency and
`utilization yields an orders ofmagnitude performance benefit even when utilizing an equivalent memory
`bus. A diverse set of examples presented in this paper demonstrate between 6X and 216x performance gain
`through DEL.
`
`The ability of DEL to use 100% of available bandwidth with near 100% efficiency is exciting today, but
`becomes even more exciting in the future as new interconnect technologies such as stacked die promise to
`substantially increase available memory bandwidth, We believe that performance via DEL will continue to
`scale perfectly into these bandwidth-rich environments.
`
`Microprocessors:gRéacl‘iing the Limits (jar implieiimétifeqse
`
`Over the past 30 years, microprocessors have enjoyed annual performance improvements of about 50%.
`Designers have been taking advantage of higher frequencies, and have added a great deal of complexity
`primarily in two areas- memory hierarchies and searching for instruction level parallelism (ILP). For a
`short history of the microprocessor and these implicit design optimizations, see the Appendix.
`
`Instruction Level Parallelism
`Techniques, which automatically exploit HP at execution time, have reached the point of diminishing
`returns, Of the 10 million transistors in an Intel Pentium III processor, for example, more than halfare
`dedicated to instruction cache, branch prediction, out—of-order execution and superscalar logic. In addition,
`complex ILP techniques require chip~wide wires.
`In a few short years, it will take several clocks to travel
`across a microprocessor die so these complex techniques do not scale with integrated circuit technology.
`appears that viable techniques to extract implicit parallelism from a serial instruction stream have simply
`run out of steam. According to Bill Dally, Pat IIanrahan and Ron Fedkiw at Stanford University:
`
`It
`
`Add the Copyright Statement
`
`3
`
`

`

`
`
`is no longer scaling at the historic
`The performance ofthe microprocessors
`
`rate of50% per year. Microprocessors have reached a point ofdiminishing
`returns in terms ofgates per clock and clocks per instruction. As we enter an era
`
`ofbillion transistor chips, there is not enough explicit parallelism in
`conventional programs to efficiently use these resources...1t is expected that
`
`‘ without new innovations in parallel processor designs, microprocessor
`performance will only increase with the increase in gate speed, at a rate of
`
`about 20% per year. Such as Change would have a major effect on the computer
`
`‘ business, and the entire economy.’
`
`
`Attaining high performance on microprocessors is beginning to require more explicit techniques from the
`compiler and programmer. For example, on the Intel ltanium processor, the compiler must carefully
`arrange instructions to attain 6-way instruction issue. Future microprocessor chips will have multiple
`processors on the same die and programmers and compilers will have to modify coding techniques to take
`advantage of parallelism at this level.
`
`Attaining high performance using direct execution logic is much more explicit. A programmer and
`compiler explicitly identify parallelism and the compiler builds the necessary parallel logic. pipelines. This
`technique has proven to be very effective. We have seen simple examples constructed in DEL where many
`hundreds of operations are executed in parallel each clock cycle. The available parallelism via DEL will
`increase linearly with integrated circuit density and so the technique will continue to scale.
`'
`
`DEL also allows the programmer to explicitly manipulate the memory hierarchy, avoiding the tradeoffs and
`complexities that are common in cache designs today. This allows the programmer to make the absolute
`most efficient use of on-chip storage and fully exploit available algorithm locality.
`In the next two
`sections, we will compare the implicit memory hierarchies found in microprocessors with the explicit
`memory hierarchy available to programmers through DEL.
`
`Data and Instruction Cache
`The design goal of any cache melnory system is to minimize memory bus transactions by taking advantage
`of application locality. Microprocessors are fixed logic devices. As such, decisions have to be made at
`design time that optimize for the “average” case, if such a thing exists. Of course, the corollary of this is
`that the design will not be optimal for most applications.
`
`Typical microprocessors exploit data locality implicitly via a rather complex data cache structure fraught
`with tradeoffs. Some of the design parameters include cache size (which has an inverse relationship to
`cache latency), the number of cache levels, cache-line width, the associativity and replacement policy, the
`write policy, the number of physical registers and the instruction cache design. A cache hierarchy that is
`optimized for the SPEC benchmark, for example, may not be a good choice for database applications or
`sparse matrix calculations. No cache design works well with irregular access patterns and strided memory
`references. For a full discussion of cache design tradeoffs, see Appendix B.
`
`Because there are so many design choices when constructing caches, it is difficult to write portable, optimal
`code that will be optimized across cache designs. An example of an attempt in this regard is the
`Automatically Tuned Linear Algebra Software (ATLAS) project. This software configures its blocking
`strategy based on what it can learn about the underlying cache hierarchy of a processor.
`
`Just as the limits of effective ILP are being reached, the limits of implicit data locality are also being
`reached. We are seeing more explicit data motion constructs creeping into microprocessor instruction sets
`in the form of pre—fetch instructions and cache “hints”.
`In the next section, we will see that the tradeoffs
`made in today’s implicit devices can become programming choices in a DEL environment, essentially
`creating optimized hardware tuned for individual programs.
`
`' B. Dally, P. Hanrahan, R. chkiw, A Streaming Supercomputer, September 18. 2001.
`
`Copyright© 2003 SRC Computers, inc.
`ALL RIGHTS RESERVED
`
`2
`
`4
`
`

`

`
`
`.MemdrysHiEratchies'in’a' DEL/Environment. '
`
`In this section, we’ll look at the explicit memory management constructs available in SRC’s DEL
`environment. These constructs allow a user to essentially instantiate locality when and where it is needed
`within an algorithm. Unlike conventional data caches, these mechanisms are parallel and extensible in
`nature, with no fixed bandwidth limitation or sequential access restrictions.
`
`The SRC-6E is a general purpose computing system with reconfigurable processing elements, called MAP
`processors. The SRC compilation system is designed to turn user codes and algorithms into direct
`execution logic within the MAP processor. The memory hierarchy on the SRC-6E processor is described
`below:
`
`1.
`
`System Memory (DDR DRAM): The MAP processor can access memory that is shared with the
`microprocessor.
`
`2. On-board memory (SRAM): The MAP processor is configured with 6 banks of SRAM memory,
`each capable of delivering one word per clock cycle to the MAP (4.8 GB/sec). Each of the 6
`banks is 4 Mbytes in size for a total of 24 Mbytes of storage. This memory can be randomly
`addressed with gather and scatter type operations.
`
`3. On—chip Block RAM (BRA M). The MAP processor has 288 sections of Block RAM memory.
`Each section is 36 bits wide and 512 elements deep and can deliver two words per cycle. When
`configured to deliver 64-bit operands, the 288 banks of Block RAM can deliver up to 288 64-bit
`operands per cycle (210.4 GB/see). Note that this memory can be randomly addressed.
`
`If the Configurable Logic Blocks (CLB) within the MAP are configured for
`4. Register Storage:
`storage, 10005 of registers can be configured. These can be used to pass operands from one
`computational unit to the next, or to construct other storage elements such as FIFO queues and
`look-up tables. Registers are created as needed on the SRC MAP processor and available
`bandwidth in this construct can reach into the 1005 of GB/sec.
`
`On the SRC MAP processor, memory is managed explicitly using constructs in either C or Fortran. The
`following table shows how this is expressed in these high level languages and how these constructs are
`built in hardware:
`
`Table 1.‘ [Memory hierarchy programming construe/s on MAP.
`
`
`
`
`
`
`
`Storage Construct
`Source Code Construct
`Builtfi'om
`Pre-fetch subroutine or function
`
`
`System Memory
`DDR DRAM chips.
`call.
`
`
`
`
`
`
`
`
`
`and internally On-Chip routing resources
`Scalar Variables
`Wires
`used by hardware macros.
`
`
`Copyright© 2003 SRC Computers, Inc.
`3
`ALL RIGHTS RESERVED
`
`On-Board Memory
`
`COMMON blocks in FORTRAN
`Static Structures in C
`
`On-Board SRAM chips.
`
`Addressable on-chip Memory
`
`Local Arrays within DEL routine
`
`On-Chip BRAM
`
`FIFO Queues. Etc.
`
`Called as subroutine or function
`within DEL routine.
`
`On-Chip CLBs or On—Chip BRAM
`
`Registers
`
`and internally On-Chip CLBs.
`Scalar Variables
`used by hardware macros.
`
`5
`
`

`

`
`
`Instruction Bandwidth
`
`Logic for DEL is downloaded into the MAP processor before execution can begin. There is no need to
`provide complex facilities for instruction bandwidth during execution because no “instructions” are issued
`during execution. All the bandwidth facilities on the MAP processor can be dedicated to program data
`requirements.
`
`System Memory
`In cache-based systems, the only way to move data into the higher levels of the memory hierarchy is to
`issue a memory load. DEL allows the programmer to instantiate a data pre-fetch functional unit. This
`functional unit can run autonomously from the rest of the computational logic hence pre—fetching can be
`explicitly overlapped with computation. Pro-fetch functional units do not have any implied cache line
`structure. As such, pre-fetch units can also effectively perform gather/scatter type memory operations as
`well as strided memory access.
`
`On-Board Memory
`On board memory is referenced in 64-bit words with a simple array reference. There is a 4 clock latency to
`access on—board memory, but the resulting data access functional units are fully pipelined and gather/scatter
`operations run at full speed. The 6 banks are explicitly managed via COMMON BLOCK key names or
`static structures in C. This feature allows a programmer to make up to 6 memory references per clock
`cycle to on-board memory.
`
`On-Chip BRAM
`The SRC MAP processor has over 600KB of on-chip, addressable memory or BRAM that is arranged in
`288 banks with 512 elements of width 36-bits in each bank. Unlike conventional caches, the parallel nature
`of explicit BRAM allows programmers to create on-chip storage that is both large and fast. This storage is
`allocated via local arrays in C or Fortran. Simply declaring and accessing multiple local arrays will obtain
`parallel access to this memory structure.
`
`Fixed Length FIFO Queues
`FIFO queues are simple and effective storage structures designed to delay a stream of operands in time by
`the number of stages in the queue. They are the ideal structure for “remembering” a previous column in a
`matrix calculation, for example. They are accessed in C or Fortran by explicitly calling a FIFO fianction or
`subroutine. The macro itself can be built out of logic blocks or BRAM, depending on the length.
`
`Registers
`An arbitrary number of registers can be used to connect computational functional units and to hold
`temporary variables. This allows the compiler to connect localized kernels directly, without a store
`followed by load sequence. Because the registers can be placed and routed near the functional units, and
`because each register has a limited number ofinputs and outputs, they are much easier to implement
`compared with the register set on a microprocessor. There is no negative pressure on clock frequency with
`the number of registers. On the contrary, adding registers where appropriate actually helps increase clock
`frequency with DEL. Finally, the logic blocks that are used to create registers are flexible. If they are not
`need for storage, they can be used to construct computational units and visa versa.
`
`Wires
`
`When possible, the compiler will simply use wires to connect functional units. For the programmer, local
`variables that are assigned and subsequently used in a loop can become wires in a DEL circuit.
`
`Copyright© 2003 SRC Computers. Inc.
`ALL RIGHTS RESERVED
`
`4
`
`6
`
`

`

`
`
`Giving the programmer direct access to the memory hierarchy through DEL has several advantages
`compared with the implicit nature of memory traffic found on microprocessors:
`
`1. Overhead is reduced. There is no register spill traffic, no cache coherency traffic, no cache Write-
`allocate traffic, no cache conflict traffic, no instruction traffic, etc.
`
`2.
`
`3.
`
`The programmer can create exactly the kind of storage structure that is necessary and sufficient.
`This may be on-chip addressable memory, FIFO queues, registers, wires, etc.
`
`Local data structures can be created to facilitate parallel access. There is no fixed on-chip
`bandwidth limitation such as inherent in fixed-logic microprocessors.
`
`4. Explicitly coded algorithms can take advantage of 100% of the available algorithm locality.
`
`5. Much of theon-chip storage resource is flexible because it is constructed of logic blocks. If it is
`not needed for storage, it can be used for computation.
`
`6.
`
`The programmer has the ability to create and use pre-fetch functional units which is much more
`latency tolerant than the “load and use” model found in implicit devices.
`
`7. There is no reliance on cache lines to hide memory latency. DEL architectures can be designed to
`efficiently address the irregular access and strided reference patterns that run poorly on
`microprocessors.
`
`The best way to illustrate how locality and peak bandwidth are fully exploited with DEL is with a few
`small programming examples. In the next section, we compare and contrast the performance of several
`example programs on an implicit, microprocessor-based system and on SRC’s DEL architecture.
`
`Pér'féirmahCe Measurements
`
`In order to highlight the overall bandwidth efficiency and utilization of DEL-based programs, it is useful to
`consider a set of programs and make some measurements on an existing, implicit architecture.
`
`A set of codes was selected and optimized for the Intel Pentium 4 and for SRC’s DEL architecture. For
`each example, the problem size was specified in such a manner as to eliminate cache—size as a
`,
`performance-limiting factor.
`In addition, highly tuned performance libraries were used where possible on
`the Intel system to show the implicit architecture in the best possible light.
`
`In each experiment, a sample code was compiled with the Intel Fortran or C compiler and run using the
`Intel Vtune performance analysis sofiware. This package allowed us to measure the memory bus traffic for
`each benchmark. This value can be compared with the algorithmic minimum to get a measure of
`bandwidth eficiency. In addition, the raw bandwidth speed was measured on the memory bus during the
`calculation. Comparing this value with the peak memory bandwidth of our specific system (3.2
`GB/second) gives us a measure of bandwidth utilization. Combining these two factors yields an overall
`efficiency value for the microprocessor.
`
`In each case, programming with the constructs available with DEL resulted in codes that used an
`algorithmically minimum amount of memory bandwidth, or perfect bandwidth efficiency. In addition,
`DEL allows for a bandwidth utilization level of 100% since computation capacity can be added until the
`available memory bandwidth is completely exhausted. The overall performance improvement that is
`attainable with DEL is hence inversely related to the overall efficiency value observed on the
`microprocessor. The improvement in performance when compared with the implicit computation carried
`out on an Intel Pentium 4 ranged from 6X for the median filter code to 216 X for a triple DES encryption
`code. Note that performance comparisons assume a similar memory bus structure for both the implicit and
`explicit models; hence the difference in performance is purely associated with the difference in the
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`5
`
`7
`
`

`

`computational model. A complete description of these programming examples as coded with SRC’s DEL
`architecture can be found in Appendix C. The performance of these codes are presented in Table 2.
`Data
`Transferred
`Intel Pentium 4
`measurement.
`
`64-bit Matrix
`Multiplication
`(Intel M K L
`library)
`
`Triple DES
`
`Image
`Processing
`Edge Detection
`(2-1) 9-point
`'stencil)
`
`Image
`Processing
`Median Filter
`(2-D 9-point
`stencil)
`
`Z—D Stencil
`Grid Generation
`(TOMCATV)
`
`3-D 27-point
`stcncil
`
`Median filter |
`Edge detector
`(pipelined 2-D
`stencils)
`
`Irregular
`Updates (64-bit
`elements)
`
` M: 1 024
`
`N:5 1 2
`K%4
`
`N:106
`
`600 x 472
`bytes
`
`601) x 472
`bytes
`
`512x512
`
`300x300x300
`
`600 x 472
`bytes
`
`Loop
`length: 1 06
`Tablc
`3 lie: 1 06
`
`
`
`Algorithmic
`Requirement
`(Mbylea)
`
`(Mbytes)
`
`31
`
`24
`
`7.8
`
`16
`
`.57
`
`Percent ofdata
`motion that
`was
`
`algorithmic-ally
`necessary
`
`Percent
`of RUS
`peak
`utilized
`
`Performance
`Improvement
`with DEL
`using the same
`memory B US
`
`16%
`
`20%
`
`31X
`
`1/.16*1/.20=31
`
`66%
`
`7%
`
`.7%
`
`28%
`
`216x
`
`51X
`
`.57
`
`.93
`
`61%
`
`28%
`
`6X
`
`8.4
`
`432
`
`.57
`
`22.9
`
`1125
`
`8.82
`
`37%
`
`38%
`
`6%
`
`20%
`
`14X
`
`24%
`
`11X
`
`22%
`
`76X
`
`32
`
`205
`
`15%
`
`22%
`
`30X
`
`Table 2: ll/ficroprocr'ssor / DEL performance comparison
`
`Matrix Multiplication
`Matrix multiplication was measured using the Intel MKL library. This algorithm is generally considered to
`run very efficiently on microprocessor-based system, yet in this case we see than an additional factor of
`30X can be achieved with DEL. Note that this figure can be pushed higher simply by running larger
`problems which exhibit higher data reuse.
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`8
`
`

`

`
`
`.
`Triple DES
`For the Triple DES example, we used a highly optimized, assembly coded routine for the Intel processor.
`The bandwidth efficiency was 1.5x higher compared with the DEL algorithm. This is most likely due to
`the unnecessary cache-line fill traffic that is generated during the store operation. The Intel processor falls
`short in memory utilization however, because the computational resources within the processor are fixed
`and cannot keep up with available memory bandwidth. As such, the Intel processor utilized a tiny fraction
`of peak memory bandwidth (less than 1%). Since DEL can run this algorithm at full memory bandwidth
`speed, the overall performance advantage of a DEL-based implementation is 2|6X.
`
`2-D Stencils
`
`We looked at 3 different 2—D stencils. Performance improvement of the DEL algorithms ranged from 6X
`to 51X. Both image-processing examples were coded using the optimized Intel graphics libraries and had a
`single array for input and output. The TOMCATV code had two input arrays and four output arrays.
`
`3-D Stencils
`
`The 3-D stencil code is based on an isotropic stencil algorithm. It contained a single source array and a
`single destination array. The computation consists of 26 floating—point additions and 4 floating-point
`multiplications. Performance improvement for the DEL implementation was 1 1X.
`
`Pipelined Stencils
`The pipelined stencil example used the median filter stencil followed by an edge detection stencil. On the
`Intel system, the Intel libraries were used for each of these steps. The DEL algorithm requires no
`additional bandwidth since the computation can be chained together without using an intermediate storage
`array. The DEL version is 76X times faster using the same memory bus.
`
`Irregular Updates
`The irregular update benchmark was 30X more efficient with DEL. This benchmark is somewhat unique
`among this group of codes as it is not computationally intensive, but instead it is data intensive. The Intel
`implicit pre—fetch hardware is not effective in this case and so the full Intel bus bandwidth cannot be
`utilized. The DEL version is‘ able to effectively pre—fetch these irregular memory patterns and achieves a
`30X performance advantage.
`
`QQnélusions
`
`
`
`For the past 20 years, microprocessors have employed increasingly complex techniques to implicitly
`exploit instruction level parallelism and locality to gain in performance. The performance improvement
`trend in these devices has slowed, because the implicit techniques available have reached a point of
`diminishing returns. Implicit constructs are simply running out of steam.
`
`The locality-exploiting mechanism within a microprocessor is fixed and implicit and designed with many
`tradeoffs considered. They are designed for the average case, and frequently do not exhibit the desired
`behavior. Bandwidth is frequently wasted on unnecessary data motion, resulting in low bandwidth
`cficiency, even for fully optimized code such as was illustrated in our examples. In addition, the fixed
`computational and locality-exploiting features within the implicit microprocessor limit the speed, and hence
`the bandwidth utilization, observed with most programs.
`
`By contrast, the scalable locality mechanisms available to the DEL programmer can be used to exploit all
`available algorithm locality and hence achieve 100% bandwidth efficiency. In addition, the scalable
`computational resources available to the DEL programmer can be leveraged to attain 100% bandwidth
`utilization. The end result is a programmable, computational system that delivers the maximum possible
`performance for any given memory bus speed. This combination of efficiency and utilization yields an
`orders ofmagnitude performance benefit compared with implicit models when using an equivalent memory
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`7
`
`9
`
`

`

`
`
`bus. More importantly, however, DEL will continue to scale with increased memory bus performance into
`the future. We believe that this combination of explicit programming with denser DEL devices and
`memory bandwidth enhancements will yield 'l‘eraflop-class processors by 2008,
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`8
`
`10
`
`10
`
`

`

`
`
`,Apgendix Ar: A:.B,r.ief;History ofa,fl1efMicroprOCessor
`
`In 1971, the Intel 4004 had about 2300 transistors, supported 45 instructions and ran at 740 kHz. Today, the
`latest Itanium 2 processor from Intel has about 200 million transistors running at over a gigahertz and the
`hottest topic at microprocessor design conferences is what to do with the billion transistors that are likely to
`be available to chip designers in a few years time. What follows is a summary of the key performance
`innovations that have taken place in microprocessors over the last 30 years.
`
`Pipelining
`The earliest microprocessors had no pipelining. Instructions were issued and completed before a new
`instruction was issued. Instruction pipelining had been around for years in mainfi-ame systems such as the
`Control Data 6600. The feature allowed for multiple instructions to be in process thereby increasing
`instruction throughput. This feature did not appear in microprocessors until the Intel 8086 processor in
`l9xx. The pipelining was fairly rudimentary- a second instruction could be fetched and decoded while the
`first was executing. Pipelining proved to be a very effective performance technique, but it also increased
`the processors appetite for instruction and data bandwidth.
`
`Caches
`
`For the first few years of the microprocessor, instruction and data loads were satisfied from main memory
`and logic speeds were compatible with memory access time in this regard. Logic performance was
`increasing at a faster rate than main memory latency and bandwidth. To respond to this, microprocessors
`designers adopted cache memories. Caches are designed to keep copies of recently used instructions and
`data in nearby SRAM instead of far away DRAM. In 1984 Motorola introduced the 68020, the first
`microprocessor with an on-chip cache, a mere 256 bytes for instructions only. The 68030 introduced
`separate 256 byte cache for instructions and one for data.
`
`Cache memories can either be fast or large, but not both so designers began to build cache hierarchies
`starting with small, fast caches near the processor core and adding larger, slower caches between the fast
`cache and memory. The Dec Alpha 21 164 was first with microprocessor with 2 levels of on-chip cache, a
`1024 word direct-mapped cache and a 96KB secondary cache. As we shall see in the next section,
`specifics of cache design are full of tradeoffs and compromises.
`
`Multiple Instruction Issue
`In 1989, Intel introduced the first superscalar microprocessor, the 1860. It was capable of issuing three
`instructions per clock cycle. This was followed by the DEC Alpha 21064 that was capable of two
`instructions per cycle. Later processors would add out—of—order execution capability. Superscalar
`techniques increased the demand on bandwidth for both instructions and data.
`
`Multiple Data Per Instruction
`Another technique used by designers to speed up processing is to have instructions perform more work.
`Good examples of this are the Single Instruction Multiple Data (SIMD) instructions that Intel has added to
`microprocessors over the years. The initial SIMD extensions, called MMX, were primarily for accelerating
`graphics. Later SSE, and SSE2 extensions were added to accelerate a wide variety of operations. This
`technique increases the demand for operand bandwidth and reduces the instruction bandwidth required for
`a given calculation.
`
`Speculative Execution, Branch Prediction
`The Pentium Pro was among the first microprocessors that supported speculative execution and branch
`prediction. Speculative execution allows instructions that probably will be executed to start early based on
`branch prediction. Results are stored in a special cache called a reorder buffer and are retired if the branch
`Copyright© 2003 SRC Computers, Inc.
`9
`ALL RIGHTS RESERVED
`
`11
`
`11
`
`

`

`
`
`«u
`
`
`
`was correctly predicted. Mis—predictions can and do occur and this can waste instruction and data
`bandwidth.
`
`Appendixi‘
`J*tCaCheDesiign:Tira/deoffs“;
`
`
`-
`
`Microprocessors are fixed logic devices. As such, decisions have to be made at design time that optimize
`for the “average” case, if such a thing exists. Of course, the corollary ol‘this is that the design will not be
`optimal for most applications.
`In this section, we discuss some of the design tradeoffs that have to be taken
`when constructing cache systems on microprocessors.
`
`Size l Latency Tradeoff
`Data caches can be small with low—latency access, or large with comparatively longer latency access. It is
`not possible to design a large capacity, low latency cache; hence microprocessor designers have introduced
`multiple levels of caching into their designs. Each level out from the registers is larger, but takes longer to
`access. At each level, some decision is made with regard to capacity and latency.
`
`Line Width Tradeoff
`
`In order to hide memory latency and exploit spatial locality, caches are arranged in lines. When a load
`suffers a cache miss, a cache line is loaded from main memory into the data cache. The idea is that there is
`a high degree of spatial locality in programs, and it is quite likely that the other memory locations in the
`cache line will also be required.
`
`For programs with a high degree of spatial locality, or stride—one access, wide cache lines are best because
`they minimize the number of times the processor suffers the latency of a memory access. For programs
`with a low degree of spatial locality, or random access, narrow lines are best as this minimizes the wasted
`bandwidth used to bring in the unused neighbors in the cache line.
`
`The widest cache lines are up to 256 bytes long on IBM Power 3 processors. Needless to say, this machine
`is not well known for its gather/scatter performance. At the other extreme, the Cray SVl had a cache line
`that was only one word, or 8 bytes wide. On this architecture, it was assumed that explicit vector loads
`would exploit spatial locality and so the data cache could be designed exc

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket