throbber
o6-\4~
`
`»RePerma
`=
`
`Attorney Docket No.: SRC028 PRO
`4
`Client/Matter No.: 80404.0033
`FS
`Express Mail No.: EV335405698US,
`ee
`
`a
`oe
`PROVISIONAL APPLICATION FOR PATENT COVER SHEET
`a="
`
`== © Thisis a requestforfiling a PROVISIONAL APPLICATION FOR PATENTunder 37 CFR 1. 53(c)
`PTO/SB/16 (10-01)
`INVENTOR(S
`
`
`
`
`Given Name (first & middle, if any)
`Family Name or Surname
`Residence (City and State/Country)
`
`Huppenthal
`Colorado Springs, CO USA
`Jon M.
`Daniel
`Poznanovic
`Colorado Springs, CO USA
`
`
`Jeffrey
`Hammes
`Colorado Springs, CO USA
`
`
`Lisa
`Krause
`Minneapolis, MN USA
`
`
`Jon
`Steidel
`Minneapolis, MN USA
`David E.
`Caliga
`Colorado Springs, CO USA
`
`
`Thomas R.
`Seeman
`Colorado Springs, CO USA
`
`
`Lee A.
`Burton
`Divide, CO USA
`
`
`St. Louis. MN USA
`
`
`Jeffrey Paul
`Brooks
`[| Additional inventors are named on the _ separately numbered sheets attached hereto
`__
`TITLE OF THE INVENTION (500 characters max
`
`BANDWIDTH EFFICIENCY AND UTILIZATION USING DIRECT EXECUTION LOGIC
`
`CORRESPONDENCE ADDRESS
`Direct all correspondenceto:
`Place Customer Number
`—&] Customer Number
`Bar Code Label Here
`IAIN
`
`
`
`
`
`25235
`Type Customer Number here
`
`
`
`
`
`
`
`
`[|] Firm or
`Individual Name
`Address
`
`State__
`(City
`.
`|
`zp [|
`
`* frax|SS[Country Phone
`
`ENCLOSED APPLICATION PARTS(checkall that apply)
`
`Specification Numberof Pages: 23
`L] CD(s), Number:
`[_] Drawing(s) Numberof Sheets:
`[-] Other (specify)
`[_] Application Data Sheet. See 37 CFR 1.76
`
`METHOD OF PAYMENTOF FILINGS FEES FOR THIS PROVISIONAL APPLICATION (check one)
`[|] Applicant claims small entity status. See 37 CFR 1.27
`[<] A check or moneyorderis enclosedto coverthefiling fees
`fx] The Commissioneris hereby authorized to charge
`any additionalfiling fees or credit any overpayment to
`Deposit Account No. 50-1123
`$160.00
`
`The invention was madeby an agency of the U.S. Governmentor undera contract with an agency of the U.S.
`Government.
`
`Xx] No. [}Yes, the name of the agency and contract no. are
`
`
`FILING FEE
`AMOUNT($)
`
`Date: Z¢-—doers 2ovi
`itt a/
`
`qj27
`Reg. No.
`9,664
`LNs
`: Witham J.KkKubida
`
`719-448-5909
`Docket No.
`SRCO028 PRO
`Telephone:
`\AACS - 80404/0001 - 61483 v1
`
`Intel Exhibit 1017 - 1
`
`
`
`Intel Exhibit 1017 - 1
`
`

`

`
`
`Attorney Docket No. SRCO028 PRO
`Client/Matter No. 80404.0033
`Express Mail Label No. EV335405698US
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`In re Provisional Application of:
`
`Jon M. Huppenthal, Daniel Poznanovic, Jeffrey Hammes,
`Lisa Krause, Jon Steidel, David E. Caliga, Thomas R.
`Seeman, Lee A. Burton, and Jeffrey Paul Brooks
`
`Serial No. -----------
`
`Filed: Herewith
`
`For: BANDWIDTH EFFICIENCY AND UTILIZATION USING
`DIRECT EXECUTION LOGIC
`
`
`CERTIFICATE OF MAILING BY EXPRESS MAIL
`
`MAIL STOP PROVISIONAL PATENT APPLICATION
`Commissioner for Patents
`P.O. Box 1450
`Alexandria, VA 22313-1450
`
`Sir:
`
`ahonh=
`
`The undersigned herebycertifies that the attached:
`. Provisional Application for Patent Cover Sheet;
`. Specification- 23 pages;
`. Check in the Amount of $160.00;
`. Certificate of Mailing by Express Mail; and
`Return Card
`relating to the above application, were deposited as "Express Mail," Mailing Label
`No.EV335405698US,with the United States Postal Service, addressed to MAIL STOP
`PROVISIONAL PATENT APPLICATION, Commissioner for Patents, P.O. Box 1450
`Alexandria, VA 22313-1450, on
`03
`
`
`IS
`we
`2803
`
`Date
`
` William J.oR— N
`
`HOGAN & HARTSON LLP
`One Tabor Center
`1200 17th Street, Suite 1500
`Denver, Colorado 80202
`(719) 448-5909 Tel
`(303) 899-7333 Fax
`
`\ANGS- 80404/0001 - 61483 v1
`
`Intel Exhibit 1017 - 2
`
`Intel Exhibit 1017 - 2
`
`

`

`
`
`Bandwidth Efficiency and Utilization
`Using Direct Execution Logic
`
`SRC Computers, Inc.
`
`Introduction
`
`In a previous paper, Petaflop Computing Using Direct Execution Logic, we showed that Teraflop-class
`processors will be available to direct execution logic (DEL) programmers in 2008. The challenge with this
`kind of computational capacity, of course, is how to keep it fed with operands. One key facetofthis is the
`ability to exploit all available locality in a program or algorithm, or bandwidth efficiency. A second facet
`of this is the ability to run calculations at such a rate as to use all available memory bandwidth, or
`bandwidth utilization.
`
`In this paper, we will show that the implicit features in modern microprocessor-based systemsfail to
`achieve full bandwidth efficiency or utilization, even for fully optimized and tuned codes. By contrast, we
`will show that a programmercan use a high level language to directly manipulate the memory hierarchy
`and calculation pipeline via DEL achieving both maximum bandwidth efficiency and utilization across a
`wide variety of algorithms. The endresult is a programmable, computational system that delivers the
`maximum possible performance for any given memory bus speed. This combination of efficiency and
`utilization yields an orders ofmagnitude performance benefit even when utilizing an equivalent memory
`bus. A diverse set of examples presented in this paper demonstrate between 6X and 216X performance gain
`through DEL.
`
`The ability of DEL to use 100% of available bandwidth with near 100% efficiency is exciting today, but
`becomes even more exciting in the future as new interconnect technologies such as stacked die promise to
`substantially increase available memory bandwidth. We believe that performance via DEL will continuc to
`scale perfectly into these bandwidth-rich environments.
`
`Microprocessors:Reachingthe Limits of Implicit Methods|
`
`Overthe past 30 years, microprocessors have enjoyed annual performance improvements of about 50%.
`Designers have been taking advantage ofhigher frequencies, and have added a great deal of complexity
`primarily in two areas- memory hierarchies and searching for instruction level parallelism (ILP). Fora
`short history of the microprocessor and these implicit design optimizations, see the Appendix.
`
`Instruction Level Parallelism
`Techniques, which automatically exploit [LP at execution time, have reached the point of diminishing
`returns. Of the 10 million transistors in an Intel Pentium III processor, for example, more than halfare
`dedicated to instruction cache, branch prediction, out-of-order execution and superscalar logic. In addition,
`complex ILP techniques require chip-wide wires. In a few short years, it will take several clocks to travel
`across a microprocessordic so these complex techniques do not scale with integrated circuit technology.
`appears that viable techniques to extract implicit parallelism from a serial instruction stream have simply
`run out of steam. According to Bill Dally, Pat Hanrahan and Ron Fedkiw at Stanford University:
`
`It
`
`Add the Copyright Statement
`
`Intel Exhibit 1017 - 3
`
`
`
`Intel Exhibit 1017 - 3
`
`

`

`
`
`
`
`|| The performance ofthe microprocessors ... is no longer scaling at the historic
`| rate of50% per year. Microprocessors have reached a point ofdiminishing
`| returns in terms ofgates per clock and clocks per instruction. As we enter an era
`L of billion transistor chips, there is not enough explicit parallelism in
`C conventional programsto efficiently use these resources... It is expecied that
`| without new innovations in parallel processor designs, microprocessor
`|. performance will only increase with the increase in gate speed, at a rate of
`EF} about 20% per year. Such as change would have a major effect on the computer
`:| business, and the entire economy.!
`
`Attaining high performance on microprocessors is beginning to require more explicit techniques from the
`compiler and programmer. lor example, on the Intel Itanium processor, the compiler must carefully
`arrange instructions to attain 6-way instruction issue. Future microprocessor chips will have multiple
`processors on the same die and programmersand compilers will have to modify coding techniquesto take
`advantage of parallelism at this level.
`
`Attaining high performance using direct execution logic is much more explicit. A programmer and
`compiler explicitly identify paralielism and the compiler builds the necessary parallel logicpipelines. This
`technique has proven to be very effective. We have seen simple examples constructed in DEL where many
`hundreds of operations are executed in parallel each clock cycle. The available parallelism via DEL will
`increase linearly with integrated circuit density and so the technique will continueto scale.
`~
`
`DELalso allows the programmerto explicitly manipulate the memory hierarchy, avoiding the tradeoffs and
`complexities that are commonin cache designs today. This allows the programmer to make the absolute
`most efficient use of on-chip storage and fully exploit available algorithm locality.
`In the next two
`sections, we will compare the implicit memory hierarchies found in microprocessors with the explicit
`memoryhierarchy available to programmers through DEL.
`
`Data and Instruction Cache
`The design goal of any cache memorysystem is to minimize memory bustransactions by taking advantage
`of application locality. Microprocessors are fixed logic devices. As such, decisions have to be madeat
`design time that optimize for the “average”case, if such a thing exists. Of course, the corollary ofthis is
`that the design will not bc optimal for most applications.
`
`Typical microprocessors exploit data locality implicitly via a rather complex data cachestructure fraught
`with tradeoffs. Some of the design parameters include cache size (which has an inverse relationship to
`cache latency), the numberof cache levels, cache-line width, the associativity and replacementpolicy, the
`write policy, the number of physical registers and the instruction cache design. A cachehierarchythatis
`optimized for the SPEC benchmark, for example, may not be a good choice for database applications or
`sparse matrix calculations. No cache design works well with irregular access patterns and strided memory
`references. Fora full discussion ofcache design tradeofts, see Appendix B.
`
`Becausethere are so many design choices when constructing caches, it is difficult to write portable, optimal
`codethat will be optimized across cache designs. An example of an attempt in this regard is the
`Automatically Tuned Linear Algebra Software (ATLAS)project. This software configures its blocking
`strategy based on whatit can learn about the underlying cache hierarchy of a processor.
`
`Just as the limits of effective ILP are being reached,the limits of implicit data locality are also being
`reached. Weare seeing more explicit data motion constructs creeping into microprocessorinstruction sets
`in the form of pre-fetch instructions and cache “hints”.
`In the next section, we will see that the tradeoffs
`made in today’s implicit devices can become programming choices in a DEL environment, essentially
`creating optimized hardwaretuned for individual programs.
`
`'B. Dally, P. Hanrahan, R. Fedkiw, A Streaming Supercomputer, September 18, 2001.
`
`Copyright© 2003 SRC Computers,Inc.
`ALL RIGHTS RESERVED
`
`2
`
`
`
`Intel Exhibit 1017 - 4
`
`Intel Exhibit 1017 - 4
`
`

`

`
`
`‘MemoryHierarchiesin aDEL Environment:
`
`In this section, we’ ll look at the explicit memory managementconstructs available in SRC’s DEL
`environment. These constructs allowauser to essentially instantiate locality when and whereit is needed
`within an algorithm. Unlike conventional data caches, these mechanismsare parallel and extensible in
`nature, with no fixed bandwidth limitation or sequential accessrestrictions.
`
`The SRC-6Eis a general purpose computing system with reconfigurable processing elements, called MAP
`processors. The SRC compilation system is designedto turn user codes and algorithmsinto direct
`execution logic within the MAP processor. The memory hierarchy on the SRC-6E processor is described
`below:
`
`1.
`
`System Memory (DDR DRAM): The MAPprocessor can access memory that is shared with the
`microprocessor.
`
`2. On-board memory (SRAM): The MAPprocessor is configured with 6 banks of SRAM memory,
`each capable of delivering one word per clock cycle to the MAP (4.8 GB/sec). Each of the 6
`banks is 4 Mbytesin size for a total of 24 Mbytesof storage. This memory can be randomly
`addressed with gather and scatter type operations.
`
`3. On-chip Block RAM (BRAM). The MAPprocessorhas 288 sections of Block RAM memory.
`Each section is 36 bits wide and 512 elements deep and can deliver two words per cycle. When
`configured to deliver 64-bit operands, the 288 banks of Block RAM can deliver up to 288 64-bit
`operands per cycle (210.4 GB/sec). Note that this memory can be randomly addressed.
`
`4. Register Storage: If the Configurable Logic Blocks (CLB) within the MAPare configuredfor
`storage, 1000sofregisters can be configured. These can be usedto pass operands from one
`computationalunit to the next, or to construct other storage elements such as FIFO queues and
`look-up tables. Registers are created as needed on the SRC MAPprocessorand available
`bandwidthin this construct can reach into the 100s of GB/sec.
`
`On the SRC MAPprocessor, memory is managed explicitly using constructs in either C or Fortran. The
`following table shows howthis is expressed in these high level languages and how these constructs are
`built in hardware:
`
`Table 1: Memory hierarchy programming constructs on MAP.
`
`On-Board Memory
`
`COMMONblocks in FORTRAN
`Static Structures in C
`
`On-Board SRAM chips.
`
`
` Source Code Construct
`
`Builtfrom
`Storage Construct
`
`
`
`
`Pre-fetch subroutine or function
`DDR DRAMchips.
`System Memory
`call.
`
`
`
`
`
`
`
`
`
`internally On-Chip routing resources
`and
`Scalar Variables
`Wires
`
` used by hardware macros.
`
`Copyright© 2003 SRC Computers, Inc.
`3
`ALL RIGHTS RESERVED
`
`Addressable on-chip Memory{Local Arrays within DEL routine On-Chip BRAM
`
`FIFO Queues. Etc.
`
`Called as subroutine or function
`within DELroutine.
`
`On-Chip CLBs or On-Chip BRAM
`
`Registers
`
`and
`Scalar Variables
`used by hardware macros.
`
`internally On-Chip CLBs.
`
`
`
`Intel Exhibit 1017 - 5
`
`Intel Exhibit 1017 - 5
`
`

`

`
`
`Instruction Bandwidth
`Logic for DEL is downloaded into the MAP processor before execution can begin. There is no need to
`provide complex facilitics for instruction bandwidth during execution because no “instructions”are issued
`during execution. All the bandwidth facilities on the MAP processor can be dedicated to program data
`requirements.
`
`System Memory
`In cache-based systems, the only way to movedata into the higherlevels of the memory hierarchy is to
`issue a memory load. DEL allows the programmerto instantiate a data pre-fetch functional unit. This
`functional unit can run autonomously from the rest of the computational logic hence pre-fetching can be
`explicitly overlapped with computation. Pre-fetch functional units do not have any implied cache line
`structure. As such, pre-fetch units can also effectively perform gather/scatter type memory operations as
`well as strided memory access.
`
`On-Board Memory
`On board memory is referenced in 64-bit words with a simple array reference. There is a 4 clock latency to
`access on-board memory, but the resulting data access functional units are fully pipelined and gather/scatter
`operationsrun at full speed. The 6 banks are explicitly managed via COMMON BLOCKkeynamesor
`static structures in C. This feature allows a programmerto make up to 6 memoryreferences per clock
`cycle to on-board memory.
`
`On-Chip BRAM
`The SRC MAPprocessor has over 600KB ofon-chip, addressable memory or BRAMthatis arranged in
`288 banks with 512 elements of width 36-bits in cach bank. Unlike conventional caches, the parallel nature
`of explicit BRAM allows programmersto create on-chip storage that is both large and fast. This storageis
`allocated via local arrays in C or Fortran. Simply declaring and accessing multiple local arrays will obtain
`parallel access to this memory structure.
`
`Fixed Length FIFO Queues
`FIFO queuesare simple and effective storage structures designed to delay a stream of operandsin time by
`the numberofstages in the queue. They are the ideal structure for “remembering”a previous columnin a
`matrix calculation, for example. They are accessed in C or Fortran by explicitly calling a FIFO function or
`subroutine.
`‘The macro itself can be built out of logic blocks or BRAM,depending on the length.
`
`Registers
`Anarbitrary numberofregisters can be used to connect computational functional units and to hold
`temporary variables. This allows the compiler to connect localized kernels directly, without a store
`followed by load sequence. Because the registers can be placed and routed near the functional units, and
`becauseeach register has a limited numberof inputs and outputs, they are much easier to implement
`compared with the register set on a microprocessor. There is no negative pressure on clock frequency with
`the numberof registers. On the contrary, adding registers where appropriate actually helps increase clock
`frequency with DEL. Finally, the logic blocks that are used to create registers are flexible. If they are not
`need for storage, they can be used to construct computational units and visa versa.
`
`Wires
`Whenpossible, the compiler will simply use wires to connect functionalunits. For the programmer,local
`variables that are assigned and subsequently used in a loop can becomewires in a DELcircuit.
`
`Copyright© 2003 SRC Computers,Inc.
`ALL RIGHTS RESERVED
`
`4
`
`
`
`Intel Exhibit 1017 - 6
`
`Intel Exhibit 1017 - 6
`
`

`

`
`
`Giving the programmerdirect access to the memory hierarchy through DEL has several advantages
`compared with the implicit nature of memory traffic found on microprocessors:
`
`1. Overhead is reduced. Thereis no register spill traffic, no cache coherencytraffic, no cache write-
`allocate traffic, no cache conflict traffic, no instruction traffic, etc.
`
`2.
`
`3.
`
`The programmercan create exactly the kind of storage structure that is necessary and sufficient.
`This may be on-chip addressable memory, FIFO queues,registers, wires, etc.
`
`Local data structures can becreated to facilitate parallel access. There is no fixed on-chip
`bandwidthlimitation such as inherent in fixed-logic microprocessors.
`
`4. Explicitly coded algorithms can take advantage of 100% ofthe available algorithm locality.
`
`5. Muchofthe.on-chip storage resource is flexible because it is constructed oflogic blocks.
`not needed for storage, it can be used for computation.
`
`Ifit is
`
`6.
`
`The programmerhasthe ability to create and use pre-fetch functional units which is much more
`latency tolerant than the “load and use” model found in implicit devices.
`
`7. There is no reliance on cache lines to hide memory latency. DEL architectures can be designed to
`efficiently address the irregular access and strided reference patterns that run poorly on
`microprocessors.
`
`The best wayto illustrate how locality and peak bandwidth are fully exploited with DEL is with a few
`small programming examples. In the next section, we compare and contrast the performance of several
`example programson an implicit, microprocessor-based system and on SRC’s DEL architecture.
`
`Performance Measurements «
`
`In order to highlight the overall bandwidth efficiency and utilization of DEL-based programs,it is useful to
`consider a set of programs and make some measurements on an existing, implicit architecture.
`
`A set of codes was selected and optimized for the Intel Pentium 4 and for SRC’s DELarchitecture. For
`each example, the problem size was specified in such a manneras to eliminate cache-size as a
`performance-limiting factor.
`In addition, highly tuned performancelibraries were used where possible on
`the Intel system to show the implicit architecture in the best possible light.
`
`In each experiment, a sample code was compiled with the Intel Fortran or C compiler and run using the
`Intel Vtune performanceanalysis software. This package allowed us to measure the memory bustraffic for
`each benchmark. This value can be compared with the algorithmic minimum to get a measure of
`bandwidth efficiency.
`\n addition, the raw bandwidth speed was measured on the memory busduring the
`calculation. Comparing this value with the peak memory bandwidth of our specific system (3.2
`GB/second) gives us a measure of bandwidth utilization. Combining these two factors yields an overall
`efficiency value for the microprocessor.
`
`Tn each case, programming with the constructs available with DEL resulted in codes that used an
`algorithmically minimum amount of memory bandwidth, or perfect bandwidth efficiency. In addition,
`DELallowsfor a bandwidth utilization level of 100% since computation capacity can be added until the
`available memory bandwidth is completely exhausted. The overall performance improvementthat is
`attainable with DELis hence inversely related to the overall efficiency value observed on the
`microprocessor. The improvementin performance when compared with the implicit computation carried
`out on an Intel Pentium 4 ranged from 6X for the medianfilter code to 216 X for a triple DES encryption
`code. Note that performance comparisons assumea similar memory bus structure for both the implicit and
`explicit models; hence the difference in performanccis purely associated with the difference in the
`
`Copyright© 2003 SRC Computers,Inc.
`ALL RIGHTS RESERVED
`
`5
`
`
`
`Intel Exhibit 1017 - 7
`
`Intel Exhibit 1017 - 7
`
`

`

`
`
`computational model. A complete description of these programming examples as coded with SRC’s DEL
`architecture can be found in Appendix C. The performanceofthese codes are presented in Table 2.
`
`Performance
`Percent
`Percent ofdata
`Data
`Algorithmic
`
`
`
`Requirement ofBUS—ImprovementTransferred motion that
`
`
`(Mbytes) Intel Pentium 4—was peak with DEL
`
`measurement.
`algorithmically
`utilized
`using the same
`(Mbytes)
`necessary
`memory BUS
`
`64-bit Matrix
`Multiplication
`(Intel MKL
`library)
`
`Triple DES
`
`Image
`Processing
`Edge Detection
`(2-D 9-point
`Stencil)
`
`Image
`Processing
`MedianFilter
`
`(2-D 9-point
`stencil)
`
`2-D Stencil
`Grid Generation
`(TOMCATV)
`
`M=1024
`N=512
`K=64
`
`N-10°
`
`600 x 472
`bytes
`
`600 x 472
`bytes
`
`5
`
`16
`
`57
`
`31
`
`24
`
`7.8
`
`16%
`
`20 %
`
`31X
`
`1/.16* 1/.20=31
`
`66%
`
`7%
`
`7%
`
`28%
`
`216X
`
`S51 X
`
`57
`
`93
`
`61%
`
`28%
`
`6X
`
`512x512
`
`8.4
`
`22.9
`
`37%
`
`20%
`
`14X
`
`3-D 27-point 300x300x300=432 1125 38% 24% 11x
`
`
`
`stencil
`
`
`
`
`
`
`
`
`
`Medianfilter|
`Edge detector
`(pipelined 2-D
`stencils)
`
`600 x 472
`bytes
`
`57
`
`8.82
`
`6%
`
`22%
`
`76 X
`
`30 X
`22%
`15%
`205
`32
`Loop
`tregular
`Updates (64-bit|length=10°
`elements)
`Table
`Size=10°
`
`Table 2: Microprocessor / DEL performance comparison
`
`Matrix Multiplication
`Matrix multiplication was measuredusing the Intel MKLlibrary. This algorithmis generally considered to
`run very efficiently on microprocessor-based system, yet in this case we see than an additionalfactor of
`30X can be achieved with DEL. Note that this figure can be pushed higher simply by running larger
`problems which exhibit higher data reuse.
`
`Copyright© 2003 SRC Computers,Inc.
`ALL RIGHTS RESERVED
`
`
`
`Intel Exhibit 1017 - 8
`
`Intel Exhibit 1017 - 8
`
`

`

`
`
`Triple DES
`Forthe Triple DES example, we used a highly optimized, assembly codedroutine for the Intel processor.
`The bandwidth efficiency was 1.5X higher compared with the DEL algorithm. This is most likely due to
`the unnecessary cache-linefill traffic that is generated during the store operation. The Intel processorfalls
`short in memory utilization however, because the computational resources within the processorare fixed
`and cannot keep up with available memory bandwidth. As such, the Intel processorutilized a tiny fraction
`of peak memory bandwidth (less than 1%). Since DEL can runthis algorithm at full memory bandwidth
`speed, the overall performance advantage of a DEL-based implementation is 216X.
`
`2-D Stencils
`Welookedat 3 different 2-D stencils.. Performance improvement of the DEL algorithms ranged from 6X
`to 51X. Both image-processing examples were coded using the optimized Intel graphics libraries and had a
`single array for input and output. The TOMCATVcode had two input arrays and four output arrays.
`
`3-D Stencils
`The 3-D stencil code is based on an isotropic stencil algorithm. It contained a single source array and a
`single destination array. The computation consists of 26 floating-point additions and 4 floating-point
`multiplications. Performance improvementfor the DEL implementation was 1 LX.
`
`Pipelined Stencils
`The pipelinedstencil example used the medianfilter stencil followed by an edge detection stencil. On the
`Intel system,the Intel libraries were used for each of these steps. The DEL algorithm requires no
`additional bandwidth since the computation can be chained together without using an intermediate storage
`array. The DEL version is 76X times faster using the same memorybus.
`
`Irregular Updates
`Theirregular update benchmark was 30X more efficient with DEL. This benchmark is somewhat unique
`among this group of codes as it is not computationally intensive, but instead it is data intensive. The Intel
`implicit pre-fetch hardware is not effective in this case and so the full Intel bus bandwidth cannot be
`utilized. The DEL versionisable to effectively pre-fetch these irregular memory patterns and achieves a
`30X performance advantage.
`
`Conclusions
`
`Ab
`
`Forthe past 20 years, microprocessors have employed increasingly complex techniques to implicitly
`exploit instruction level parallelism and locality to gain in performance. The performance improvement
`trend in these devices has slowed, because the implicit techniques available have reached a point of
`diminishing returns. Implicit constructs are simply running out of steam.
`
`The locality-exploiting mechanism within a microprocessoris fixed and implicit and designed with many
`tradeoffs considered. They are designedfor the average case, and frequently do not exhibit the desired
`behavior. Bandwidth is frequently wasted on unnecessary data motion,resulting in low bandwidth
`efficiency, even for fully optimized code such as wasillustrated in our examples. In addition, the fixed
`computational andlocality-exploiting features within the implicit microprocessorlimit the speed, and hence
`the bandwidth utilization, observed with most programs.
`
`By contrast, the scalable locality mechanismsavailable to the DEL programmercanbe used to exploitall
`available algorithm locality and hence achieve 100% bandwidth efficiency.
`In addition, the scalable
`computational resources available to the DEL programmercan be leveraged to attain 100% bandwidth
`utilization. The endresult is a programmable, computational system that delivers the maximum possible
`performance for any given memory bus speed. This combination of efficiency andutilization yields an
`orders ofmagnitude performance benefit compared with implicit models when using an equivalent memory
`Copyright© 2003 SRC Computers,Inc.
`7
`ALL RIGHTS RESERVED
`
`
`
`Intel Exhibit 1017 - 9
`
`Intel Exhibit 1017 - 9
`
`

`

`
`
`bus. More importantly, however, DELwill continue to scale with increased memorybus performance into
`the future. We believe that this combination of explicit programming with denser DEL devices and
`memory bandwidth enhancementswill yield Teraflop-class processors by 2008.
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`8
`
`
`
`Intel Exhibit 1017 - 10
`
`Intel Exhibit 1017 - 10
`
`

`

`
`
`_Appendix:A: ABriefHistory ofthe Microprocessor
`
`In 1971, the Intel 4004 had about 2300transistors, supported 45 instructions and ran at 740 kHz. Today, the
`latest Itanium 2 processorfrom Intel has about 200 million transistors running at over a gigahertz and the
`hottest topic at microprocessor design conferences is what to do with the billion transistors that are likely to
`be available to chip designers in a few years time. What follows is a summary ofthe key performance
`innovationsthat have taken place in microprocessors overthe last 30 years.
`
`Pipelining
`The earliest microprocessors had nopipelining. Instructions were issued and completed before a new
`instruction wasissued. Instruction pipelining had been around for years in mainframe systems suchas the
`Control] Data 6600. The feature allowed for multiple instructions to be in process thereby increasing
`instruction throughput. This feature did not appear in microprocessors until the Intel 8086 processorin
`19xx. The pipelining wasfairly rudimentary- a second instruction could be fetched and decoded while the
`first was executing. Pipelining proved to be a very effective performance technique, butit also increased
`the processors appetite for instruction and data bandwidth.
`
`Caches
`For the first few years of the microprocessor, instruction and data loads were satisfied from main memory
`and logic speeds were compatible with memoryaccess timein this regard. Logic performance was
`increasing at a faster rate than main memory latency and bandwidth. To respond to this, microprocessors
`designers adopted cache memories. Caches are designed to keep copies of recently used instructions and
`data in nearby SRAM instead of far away DRAM.In 1984 Motorola introduced the 68020,the first
`microprocessor with an on-chip cache, a mere 256 bytes for instructions only. The 68030 introduced
`separate 256 byte cache for instructions and one for data.
`
`Cache memoriescaneither be fast or large, but not both so designers began to build cache hierarchies
`starting with small, fast caches near the processor core and adding larger, slower caches between the fast
`cache and memory. The Dec Alpha 21164 wasfirst with microprocessor with 2 levels of on-chip cache, a
`1024 word direct-mapped cache and a 96KB secondary cache. As weshall see in the next section,
`specifics of cache design are full of tradeoffs and compromises.
`
`Multiple Instruction Issue
`In 1989,Intel introducedthe first superscalar microprocessor, the i860. It was capable ofissuing three
`instructions per clock cycle. This was followed by the DEC Alpha 21064 that was capable of two
`instructions per cycle. Later processors would add out-of-order execution capability. Superscalar
`techniques increased the demand on bandwidth for both instructions and data.
`
`Multiple Data PerInstruction
`Another technique used by designers to speed up processing is to have instructions perform more work.
`Good examplesofthis are the Single Instruction Multiple Data (SIMD)instructions that Intel has added to
`microprocessors overthe years. The initial SIMD extensions, called MMX,were primarily for accelerating
`graphics. Later SSE, and SSE2 extensions were added to accclerate-a wide variety of operations. This
`technique increases the demand for operand bandwidth and reducesthe instruction bandwidth required for
`a given calculation.
`
`Speculative Execution, Branch Prediction
`The Pentium Pro was among the first microprocessors that supported speculative execution and branch
`prediction. Speculative execution allows instructions that probably will be executed to start early based on
`branch prediction. Results are stored in a special cache called a reorder buffer and are retired if the branch
`Copyright© 2003 SRC Computers,Inc.
`9
`ALL RIGHTS RESERVED
`
`
`
`Intel Exhibit 1017 - 11
`
`Intel Exhibit 1017 - 11
`
`

`

`
`
`was correctly predicted. Mis-predictions can and do occur and this can waste instruction and data
`bandwidth.
`
`AppendixB: CacheDesign Tradeoffs:
`
`Microprocessors are fixed logic devices. As such, decisions have to be madeat design time that optimize
`for the “average”case, if such a thing exists. Of course,the corollary ofthis is that the design will not be
`optimal for most applications.
`In this section, we discuss some of the design tradeoffs that have to be taken
`whenconstructing cache systems on microprocessors.
`
`Size / Latency Tradeoff
`Data caches can be small with low-latency access,or large with comparatively longer latency access.It is
`not possible to design a large capacity, low latency cache; hence microprocessor designers have introduced
`multiple levels of caching into their designs. Fach level out from the registers is larger, but takes longer to
`access. At each level, some decision is made with regard to capacity and latency.
`
`Line Width Tradeoff
`In order to hide memory latency and exploit spatial locality, caches are arranged in lines. When a load
`suffers a cache miss, a cache line is loaded from main memory into the data cache. The ideais that there is
`a high degree of spatial locality in programs, andit is quite likely that the other memory locations in the
`cacheline will also be required.
`
`For programs with a high degree of spatial locality, or stride-one access, wide cache lines are best because
`they minimize the number of times the processor suffers the latency of a memory access. For programs
`with a low degree of spatial locality, or random access, narrow lines are best as this minimizes the wasted
`bandwidth used to bring in the unused neighbors in the cacheline.
`,
`
`The widest cachelines are up to 256 bytes long on IBM Power3 processors. Needless to say, this machine
`is not well knownfor its gather/scatter

This document is available on Docket Alarm but you must sign up to view it.


Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge
throbber

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

throbber

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

  • Up-to-date information for this case.
  • Email alerts whenever there is an update.
  • Full text search for other cases.
  • Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.


Access Government Site

We are redirecting you
to a mobile optimized page.





Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket