IPR2021-00633, No. 1017 Exhibit - Provisional Patent Application No 60479,339 (P.T.A.B. Mar. 15, 2021)

—_ o6--\4~ 3 [pce
`
`
`
`9
`Attorney Docket No.: SRC028 PRO
`SS»
`Client/Matter No.: 80404.0033
`or
`Express Mail No.: £V335405698US,,
`e
`Se
`
`="
`PROVISIONAL APPLICATION FOR PATENT COVER SHEET
`=0°This isa requestforfiling a PROVISIONAL APPLICATION FOR PATENTunder 37 CFR 1. 53(c)
`
`PTO/SB/16 (10-01)
`
`INVENTOR(S)
`
`
`_|
`Given Name(first & middle, if any)
`Family Name or Surname
`
`
`Colorado Springs, CO USA
`Jon M.
`Huppenthal
`
`
`Daniel
`Poznanovic
`Colorado Springs, CO USA
`
`
`Colorado Springs, CO USA
`Jeffrey
`Hammes
`
`
`Lisa
`Krause
`Minneapolis, MN USA
`
`
`Jon
`Steidel
`Minneapolis, MN USA
`
`
`Colorado Springs, CO USA
`David E.
`Caliga
`
`
`Thomas R.,
`Seeman
`Colorado Springs, CO USA
`
`
`Lee A.
`Burton
`Divide, CO USA
`
`
`St. Louis. MN USA
`
`Jeffrey Paul
`_| Brooks
`C1 Additional inventors are named on the _ separately numbered sheets attached hereto
`
`
`
`TITLE OF THE INVENTION(500 characters max
`
`
`BANDWIDTH EFFICIENCY AND UTILIZATION USING DIRECT EXECUTION LOGIC
`
`
`CORRESPONDENCE ADDRESS
`Place Customer Number
`Bar Code Label Here
`25235
`
`Type Customer Number here
`ITI
`
`
`
`LJ Firm or
`Individual Name
`
`
`Address
`
`
`City
`
`
`Country
`
`ENCLOSED APPLICATION PARTS(check all that apply)
`EX] Specification Number of Pages: 23
`[] CD(s), Number:
`[_] Drawing(s) Number of Sheets:
`|] Other (specify)
`[J] Application Data Sheet. See 37 CFR 1.76
`
`METHOD OF PAYMENTOFFILINGS FEES FOR THIS PROVISIONAL APPLICATION (check one)
`[| Applicant claims small entity status. See 37 CFR 1.27
`
`Direct all correspondenceto:
`&] Customer Number
`
`
` ><] A check or money order is enclosed to coverthefiling fees
`
`
`FILING FEE
`AMOUNT($)
`
`
`><] The Commissioneris hereby authorized to charge
`any additional filing fees or credit any overpayment to
`
`$160.00
`Deposit Account No. 50-1123
`
`
`The invention was made by an agencyof the U.S. Governmentor under a contract with an agencyof the U.S.
`Government.
`ix] No.
`
`
`y IKan/
`
`WiliamJ Kubida
`:
`
`719-448-5909
`Telephone:___
`NAACS - 80404/0001 - 61483 v1
`
`[| Yes, the nameof the agency and contract no. are
`
` 9,664
`Date:Reg. No.
`
`
`
`Docket No.
`
`SRC028 PRO
`
`1
`
`XILINX 1017
`
`1
`
`XILINX 1017
`
`

`
`
`Attorney Docket No. SRCO28 PRO
`Client/Matter No. 80404.0033
`Express Mail Label No. EV335405698US
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`In re Provisional Application of:
`
`Jon M. Huppenthal, Daniel Poznanovic, Jeffrey Hammes,
`Lisa Krause, Jon Steidel, David E. Caliga, Thomas R.
`Seeman, Lee A. Burton, and Jeffrey Paul Brooks
`
`Serial No. -----------
`
`Filed: Herewith
`
`For: BANDWIDTH EFFICIENCY AND UTILIZATION USING
`DIRECT EXECUTION LOGIC
`
`
`
`CERTIFICATE OF MAILING BY EXPRESS MAIL
`
`MAIL STOP PROVISIONAL PATENT APPLICATION
`Commissioner for Patents
`P.O. Box 1450
`Alexandria, VA 22313-1450
`
`Sir:
`
`akon=
`
`The undersigned herebycertifies that the attached:
`. Provisional Application for Patent Cover Sheet;
`. Specification- 23 pages;
`. Check in the Amount of $160.00;
`. Certificate of Mailing by Express Mail; and
`Return Card
`relating to the above application, were deposited as "Express Mail," Mailing Label
`No.EV335405698US, with the United States Postal Service, addressed to MAIL STOP
`PROVISIONAL PATENT APPLICATION, Commissioner for Patents, P.O. Box 1450
`Alexandria, VA 22313-1450, on
`iS
`Wwe
`2803
`Date
`
`
`a[Py A)
` eS
`
`
`KELKy 4
`William J. Kubida, Reg. No. 39664.
`
`HOGAN & HARTSON LLP
`One Tabor Center
`1200 17th Street, Suite 1500
`Denver, Colorado 80202
`(719) 448-5909 Tel
`(303) 899-7333 Fax
`
`NANCS - 80404/0001 - 61483 v1
`
`2
`
`

`
`
`Bandwidth Efficiency and Utilization
`Using Direct Execution Logic
`
`SRC Computers, Inc.
`
`
`Introduction: -
`
`
`
`In a previous paper, Petaflop Computing Using Direct Execution Logic, we showedthat Teraflop-class
`processors will be available to direct execution logic (DEL) programmersin 2008. The challenge with this
`kind of computational capacity, of course, is how to keep it fed with operands. One key facet ofthis is the
`ability to exploit all available locality in a program or algorithm, or bandwidth efficiency. A second facet
`of this is the ability to run calculations at such a rate as to use all available memory bandwidth, or
`bandwidth utilization.
`
`In this paper, we will show that the implicit features in modern microprocessor-based systemsfail to
`achieve full bandwidth efficiency or utilization, even for fully optimized and tuned codes. By contrast, we
`will show that a programmer can usc a high level language to directly manipulate the memory hierarchy
`and calculation pipeline via DEL achieving both maximum bandwidth efficiency andutilization across a
`wide variety of algorithms. The endresult is a programmable, computational system that delivers the
`maximum possible performance for any given memory bus speed. This combinationof efficiency and
`utilization yields an orders ofmagnitude performance benefit even when utilizing an equivalent memory
`bus. A diverse set of examples presented in this paper demonstrate between 6X and 216X performance gain
`through DEL.
`
`Theability of DEL to use 100% ofavailable bandwidth with near 100% efficiency is exciting today, but
`becomes even more exciting in the future as new interconnect technologies suchas stacked die promise to
`substantially increase available memory bandwidth. Webelieve that performance via DEL will continue to
`scale perfectly into these bandwidth-rich environments.
`
`Microprocessors:Reaching the Limits of ImplicitMethods.
`
`Overthe past 30 years, microprocessors have enjoyed annual performance improvements of about 50%.
`Designers have been taking advantage ofhigher frequencies, and have added a greal deal of complexity
`primarily in two areas- memory hierarchies and searching for instruction level parallelism (ILP). For a
`short history of the microprocessor and these implicit design optimizations, see the Appendix.
`
`Instruction Level Parallelism
`Techniques, which automatically exploit TLP at execution time, have reached the point of diminishing
`returns. Of the 10 million transistors in an Intel Pentium III processor, for example, more than halfare
`dedicated to instruction cache, branch prediction, out-of-order execution and superscalar logic. In addition,
`complex ILP techniques require chip-wide wires.
`In a few short years, it will take several clocks to travel
`across a microprocessordie so these complex techniques do not scale with integrated circuit technology.
`appears that viable techniques to extract implicit parallelism from a serial instruction stream have simply
`tunout of steam. According to Bill Dally, Pat anrahan and Ron Fedkiw at Stanford University:
`
`It
`
`Add the Copyright Statement
`
`3
`
`

`
`
`
`
`
`
`The performance ofthe microprocessors ... is no longer scaling at the historic
`rate of50% per year. Microprocessors have reached a point ofdiminishing
`returns in terms ofgates per clock and clocks per instruction. As we enter an era
`ofbillion transistor chips, there is not enough explicit parallelism in
`conventional programsto efficiently use these resources...1t is expected that
`| without new innovationsin parallel processor designs, microprocessor
`performancewill only increase with the increase in gate speed, at a rate of
`
`about 20% per year. Such as change would have a major effect on the computer
`
`| business, and the entire economy.
`
`
`Attaining high performance on microprocessorsis beginning to require more explicit techniques from the
`compiler and programmer. For example, on the Intel Itanium processor, the compiler must carefully
`arrange instructions to attain 6-way instruction issue. Future microprocessor chips will have multiple
`processors on the same die and programmers and compilers will have to modify coding techniques to take
`advantage of parallelism at this level.
`
`Attaining high performance using direct execution logic is much more explicit. A programmer and
`compiler explicitly identify parallelism and the compiler builds the necessary parallel logic pipelines. This
`technique has proven to be very effective. We have seen simple examples constructed in DEL where many
`hundreds ofoperations are executed in parallel each clock cycle. The available parallelism via DEL will
`increase linearly with integrated circuit density and so the technique will continue to scale.
`~
`
`DELalso allows the programmerto explicitly manipulate the memory hierarchy, avoiding the tradeotfs and
`complexities that are commonin cache designs today. This allows the programmer to make the absolute
`mostefficient use of on-chip storage and fully exploit available algorithm locality.
`In the next two
`sections, we will compare the implicit memory hierarchies found in microprocessors with the explicit
`memoryhierarchy available to programmers through DEL.
`
`Data and Instruction Cache
`The design goal of any cache memory system is to minimize memory bustransactions by taking advantage
`of application locality. Microprocessors are fixed logic devices. As such, decisions have to be madeat
`design time that optimize for the “average” case, if such a thing exists. Of course, the corollary of this is
`that the design will not be optimal for most applications.
`
`Typical microprocessors exploit data locality implicitly via a rather complex data cache structure fraught
`withtradeoffs. Some of the design parameters include cache size (which has an inverse relationship to
`cache latency), the numberof cache levels, cache-line width, the associativity and replacementpolicy, the
`write policy, the number of physical registers and the instruction cache design. A cache hierarchythatis
`optimized for the SPEC benchmark, for example, may not be a good choice for database applications or
`sparse matrix calculations. No cache design works well with irregular access patterns and strided memory
`references. Fora full discussion of cache design tradeoffs, see Appendix B.
`
`Because there are so many design choices when constructing caches, it is difficult to write portable, optimal
`code that will be optimized across cache designs. An example of an attemptin this regard is the
`Automatically Tuned Linear Algebra Software (ATLAS)project. This software configuresits blocking
`strategy based on whatit can learn about the underlying cache hierarchy of a processor.
`
`Just as the limits of effective ILP are being reached, the limits of implicit data locality are also being
`reached. Weare seeing moreexplicit data motion constructs creeping into microprocessorinstruction sets
`in the form of pre-fetch instructions and cache “hints”.
`In the next section, we will see that the tradeoffs
`madein today’s implicit devices can become programming choices in a DEL environment, essentially
`creating optimized hardware tuned for individual programs.
`
`'B. Dally, P. Hanrahan, R. Fedkiw, 4 Streaming Supercomputer, September 18, 2001.
`
`Copyright© 2003 SRC Computers,Inc.
`ALL RIGHTS RESERVED
`
`2
`
`4
`
`

`
`
`MemoryHierarchiesina DEL Environment:
`
`In this section, we’ ll look at the explicit memory managementconstructs available in SRC’s DEL
`environment. These constructs allow a user to essentially instantiate locality when and where it is needed
`within an algorithm. Unlike conventional data caches, these mechanisms are parallel and extensible in
`nature, with no fixed bandwidth limitation or sequential accessrestrictions.
`
`The SRC-6Eis a gencral purpose computing system with reconfigurable processing elements, called MAP
`processors. The SRC compilation system is designedto turn user codes and algorithmsinto direct
`execution logic within the MAP processor. The memoryhierarchy on the SRC-6E processoris described
`below:
`
`1,
`
`System Memory (DDR DRAM): The MAPprocessor can access memory that is shared with the
`microprocessor.
`
`2. On-board memory (SRAM): The MAPprocessoris configured with 6 banks of SRAM memory,
`each capable ofdelivering one word per clock cycle to the MAP (4.8 GB/sec). Each ofthe 6
`banks is 4 Mbytes in size for a total of 24 Mbytes of storage. This memory can be randomly
`addressed with gather and scatter type operations.
`
`3. On-chip Block RAM (BRAM). The MAPprocessor has 288 sections of Block RAM memory.
`Eachsection is 36 bits wide and 512 elements deep and candeliver two words per cycle. When
`configured to deliver 64-bit operands, the 288 banks of Block RAM can deliver up to 288 64-bit
`operands per cycle (210.4 GB/sec). Note that this memory can be randomly addressed.
`
`4. Register Storage: If the Configurable Logic Blocks (CLB) within the MAPare configuredfor
`storage, 1000s of registers can be configured. These can be used to pass operands from one
`computational unit to the next, or to construct other storage elements such as FIFO queues and
`look-up tables. Registers are created as needed on the SRC MAPprocessorand available
`bandwidth in this construct can reach into the 100s of GB/sec.
`
`On the SRC MAP processor, memory is managed explicitly using constructs in either C or Fortran. The
`following table shows howthis is expressed in these high Icvel languages and how these constructsare
`built in hardware:
`
`On-Board Memory
`
`COMMONblocks in FORTRAN
`Static Structures in C
`
`On-Board SRAM chips.
`
`Builtfrom
`
`
`DDR DRAM chips.
`
`Table 1: Memory hierarchy programming construcis on MAP.
`
`
`
`Source Code Construct
`Storage Construct
`
`
`Pre-fetch subroutine or function
`System Memory
`call.
`
`
`
`
`
`
`
`
`
`
`and internally On-Chip routing resources
`Sealar Variables
`Wires
`used by hardware macros.
`
`
`Copyright© 2003 SRC Computers,Inc.
`3
`ALL RIGHTS RESERVED
`
`Addressable on-chip Memory{Local Arrays within DEL routine On-Chip BRAM
`
`FIFO Queues. Etc.
`
`Called as subroutine or function
`within DEL routine.
`
`On-Chip CLBs or On-Chip BRAM
`
`Registers
`
`and internally On-Chip CLBs.
`Scalar Variables
`used by hardware macros,
`
`5
`
`

`
`
`Instruction Bandwidth
`Logic for DEL is downloaded into the MAPprocessor before execution can begin. There is no need to
`provide complexfacilities for instruction bandwidth during execution because no “instructions” are issued
`during execution. All the bandwidth facilities on the MAP processor can be dedicated to program data
`requirements.
`
`System Memory
`In cache-based systems, the only way to movedata into the higher levels of the memory hierarchyis to
`issue a memory load. DELallows the programmerto instantiate a data pre-fetch functional unit. This
`functional unit can run autonomously from the rest of the computational logic hence pre-fetching can be
`explicitly overlapped with computation. Pre-fetch functional units do not have any implied ‘cacheline
`structure. As such, pre-fetch units can also effectively perform gather/scatter type memory operationsas
`well as strided memoryaccess.
`
`On-Board Memory
`Onboard memoryis referenced in 64-bit words with a simple array reference. There is a 4 clock latency to
`access on-board memory,but the resulting data access functional units are fully pipelined and gather/scatter
`operations run at full speed. The 6 banksarc explicitly managed via COMMON BLOCKkey namesor
`static structures in C. This feature allows a programmer to make up to 6 memory references per clock
`cycle to on-board memory.
`
`On-Chip BRAM
`The SRC MAP processor has over 600KBof on-chip, addressable memory or BRAMthatis arranged in
`288 banks with 512 elements of width 36-bits in each bank. Unlike conventional caches, the parallel nature
`of explicit BRAM allows programmersto create on-chip storage that is both large and fast. This storageis
`allocated via local arrays in C or Fortran. Simply declaring and accessing multiple local arrays will obtain
`parallel access to this memorystructure.
`
`Fixed Length FIFO Queues
`FIFO queues are simple andeffective storage structures designed to delay a stream of operandsin time by
`the numberof stages in the qucuc. They are the ideal structure for “remembering”a previous column ina
`matrix calculation, for example. They are accessed in C or Fortran by explicitly calling a FIFO function or
`subroutine. The macroitself can be built out of logic blocks or BRAM,dependingonthe length.
`
`Registers
`An arbitrary numberofregisters can be used to connect computational functional units and to hold
`temporary variables. This allows the compiler to connect localized kernels directly, without a store
`followed by load sequence. Because the registers can be placed and routed nearthe functional units, and
`becausecachregister has a limited numberofinputs and outputs, they are much ecasicr to implement
`compared with the register set on a microprocessor. There is no negative pressure on clock frequency with
`the numberofregisters. On the contrary, adding registers where appropriate actually helps increase clock
`frequency with DEL. Finally, the logic blocks that are used to create registers are flexible. If they are not
`need for storage, they can be used to construct computational units and visa versa.
`
`Wires
`Whenpossible, the compiler will simply use wires to connect functional units. For the programmer,local
`variables that are assigned and subsequently used in a loop can become wires in a DELcircuit.
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`4
`
`6
`
`

`
`
`Giving the programmerdirect access to the memory hierarchy through DEL has several advantages
`compared with the implicit nature of memorytraffic found on microprocessors:
`
`Lt. Overhead is reduced. Thereis noregister spill traffic, no cache coherencytraffic, no cache write-
`allocate traffic, no cache conflict traffic, no instruction traffic, etc.
`
`2.
`
`3.
`
`The programmercan create exactly the kind of storage structure that is necessary andsufficient.
`This may be on-chip addressable memory, FIFO queues, registers, wires, etc.
`
`Localdata structures can becreatedto facilitate parallel access. There is no fixed on-chip
`bandwidth limitation such as inherentin fixed-logic microprocessors.
`
`4. Explicitly coded algorithms can take advantage of 100% ofthe available algorithm locality.
`
`5. Much of the.on-chip storage resourceis flexible because it is constructed of logic blocks. If it is
`not needed for storage, it can be used for computation.
`
`6.
`
`The programmerhasthe ability to create and use pre-fetch functional units which is much more
`latency tolerant than the “load and use” model found in implicit devices.
`
`7. There is no reliance on cachelines to hide memory latency. DEL architectures can be designed to
`efficiently address the irregular access and strided reference patterns that run poorly on
`microprocessors.
`
`The best way to illustrate how locality and peak bandwidthare fully exploited with DEL is with a few
`small programming examples. In the next section, we compare and contrast the performanceof several
`example programs on an implicit, microprocessor-based system and on SRC’s DELarchitecture.
`
`Performance Measurements «
`
`In order to highlight the overall bandwidth efficiency and utilization of DEL-based programs,it is useful to
`considera set of programs and make some measurements on an existing, implicit architecture.
`
`A set of codes was selected and optimized for the Intel Pentium 4 and for SRC’s DELarchitecture. For
`each example, the problem size was specified in such a manneras to eliminate cache-size as a
`,
`pcrformance-limiting factor.
`In addition, highly tuned performance librarics were used where possible on
`the Intel system to show the implicit architecture in the best possible light.
`
`In each experiment, a sample code was compiled with the Intel Fortran or C compiler and run using the
`Intel Vtune performanceanalysis software. This package allowed us to measure the memory bustraffic for
`each benchmark. This value can be compared with the algorithmic minimum to get a measure of
`bandwidth efficiency. In addition, the raw bandwidth speed was measured on the memory busduring the
`calculation. Comparing this value with the peak memory bandwidth ofourspecific system (3.2
`GB/second) gives us a measure of bandwidth utilization. Combining these two factors yields an overall
`efficiency value for the microprocessor.
`
`In each case, programming with the constructs available with DEL resulted in codes that used an
`algorithmically minimum amount of memory bandwidth, or perfect bandwidth efficiency. In addition,
`DELallowsfor a bandwidth utilization level of 100% since computation capacity can be added until the
`available memory bandwidth is completely exhausted. Thc overall performance improvementthat is
`attainable with DELis hence inversely related to the overall efficiency value observed on the
`microprocessor. The improvementin performance when compared with the implicit computation carried
`out on an Intcl Pentium 4 ranged from 6X for the medianfilter code to 216 X for a triple DES encryption
`code. Note that performance comparisons assume a similar memory busstructure for both the implicit and
`explicit models; hence the difference in performanccis purely associated with the difference in the
`
`Copyright© 2003 SRC Computers,Inc.
`ALL RIGHTS RESERVED
`
`5
`
`7
`
`

`computational model. A complete description of these programming examples as coded with SRC’s DEL
`architecture can be found in Appendix C. The performanceof these codes are presented in Table 2.
`Data
`Transferred
`Intel Pentium 4
`measurement.
`
` M=1024
`
`N=512
`K=64
`
`N=10°
`
`600 x 472
`bytes
`
`600 x 472
`bytcs
`
`512x512
`
`300x300x300
`
`600 x 472
`bytes
`
`Loop
`tength=10°
`Table
`Size=10°
`
`64-bit Matrix
`Multiplication
`(Intel MKL
`library)
`
`Triple DES
`
`Image
`Processing
`Edge Detection
`(2-D 9-point
`Stencil)
`
`Image
`Processing
`MedianFilter
`(2-D 9-point
`stencil)
`
`2-D Stencil
`Grid Generation
`(TOMCATYV)
`
`3-D 27-point
`stencil
`
`Median filter|
`Edge detector
`(pipelined 2-D
`stencils)
`
`Irregular
`Updates (64-bit
`elements)
`
`
`
`Algorithmic
`Requirement
`(Mbytes)
`
`(Mbytes)
`
`31
`
`24
`
`78
`
`93
`
`22.9
`
`1125
`
`8.82
`
`57
`
`8.4
`
`432
`
`57
`
`Percent ofdata
`motionthat
`was
`
`algorithmically
`necessary
`
`Percent
`of BUS
`peak
`utilized
`
`Performance
`Improvement
`with DEL
`using the same
`memory BUS
`
`16%
`
`20%
`
`31 X
`
`1/.16*1/.20=31
`
`66%
`
`T%
`
`T%
`
`28%
`
`216X%
`
`SIX
`
`61%
`
`28%
`
`6X
`
`37%
`
`38%
`
`6%
`
`20%
`
`14X
`
`24%
`
`22%
`
`76%
`
`205
`
`15%
`
`22%
`
`30 X
`
`Table 2: Microprocessor / DEL performance comparison
`
`Matrix Multiplication
`Matrix multiplication was measured using the Intel MKL library. This algorithm is generally considered to
`run very efficiently on microprocessor-based system, yet in this case we see than an additional factor of
`30X can be achieved with DEL.Note that this figure can be pushed higher simply by running larger
`problems which exhibit higher data reuse.
`
`Copyright© 2003 SRC Computers, Inc.
`ALL RIGHTS RESERVED
`
`8
`
`

`
`
`Triple DES
`For the Triple DES example, we used a highly optimized, assembly coded routine for the Intel processor.
`The bandwidth efficiency was 1.5% higher compared with the DELalgorithm. This is most likely due to
`the unnecessary cache-linefill traffic that is generated during the store operation. The Intel processorfalls
`short in memory utilization however, because the computational resources within the processorare fixed
`and cannot keep up with available memory bandwidth. As such, the Intel processorutilized a tiny fraction
`of peak memory bandwidth (less than 1%). Since DEL can run this algorithm at full memory bandwidth
`speed, the overall performance advantage of a DEL-based implementation is 216X.
`
`2-D Stencils
`Welookedat 3 different 2-D stencils. Performance improvementof the DEL algorithms ranged from 6X
`to 51X. Both image-processing examples were coded using the optimized Intel graphics libraries and had a
`single array for input and output. The TOMCATVcode had two input arrays and four output arrays.
`
`3-D Stencils
`The 3-Dstencil code is based on anisotropic stencil algorithm. It contained a single source array anda
`single destination array. The computation consists of 26 floating-point additions and 4 floating-point
`multiplications. Performance improvementfor the DEL implementation was 1 1X.
`
`Pipelined Stencils
`The pipelined stencil example used the medianfilter stencil followed by an edge detection stencil. On the
`Intel system, the Intel libraries were used for each of these steps. The DEL algorithm requires no
`additional bandwidth since the computation can be chained together without using an intermediate storage
`atray. The DEL version is 76X times faster using the same memory bus.
`
`Irregular Updates
`The irregular update benchmark was 30X moreefficient with DEL. This benchmark is somewhat unique
`amongthis groupofcodesasit is not computationally intensive, but instead it is data intensive. TheIntel
`implicit pre-fetch hardware is not effective in this case and sothe full Intel bus bandwidth cannot be
`utilized. The DEL versionisable to effectively pre-fetch these irregular memory patterns and achieves a
`30X performance advantage.
`
`Conclusions
`
`
`
`Forthe past 20 years, microprocessors have employedincreasingly complex techniques to implicitly
`exploit instruction level parallelism and locality to gain in performance. The performance improvement
`trend in these devices has stowed, because the implicit techniques available have reached a point of
`diminishing returns. Implicit constructs are simply running outof steam.
`
`Thelocality-exploiting mechanism within a microprocessoris fixed and implicit and designed with many
`tradeoffs considered. They are designed for the average case, and frequently do not exhibit the desired
`behavior. Bandwidth is frequently wasted on unnecessary data motion, resulting in low bandwidth
`efficiency, even for fully optimized code such as wasillustrated in our examples. In addition, the fixed
`computationaland locality-exploiting features within the implicit microprocessorlimit the speed, and hence
`the bandwidth utilization, observed with most programs.
`
`Bycontrast, the scalable locality mechanismsavailable to the DEL programmercan beuscd to exploitall
`available algorithm locality and hence achieve 100% bandwidth efficiency. In addition, the scalable
`computational resources available to the DEL programmercanbe leveraged to attain 100% bandwidth
`utilization. The end resull is a programmable, computational system that delivers the maximum possible
`performance for any given memorybus speed. This combination of efficiency andutilization yields an
`orders ofmagnitude performance bencfit compared with implicit models when using an equivalent memory
`Copyright© 2003 SRC Computers,Inc.
`7
`ALL RIGHTS RESERVED
`
`9
`
`

`
`
`bus. More importantly, however, DEL will continue to scale with increased memory bus performance into
`the future. Webelieve that this combination ofexplicit programming with denser DEL devices and
`memory bandwidth enhancements will yield Teraflop-class processors by 2008.
`
`Copyright© 2003 SRC Computers,Inc.
`ALL RIGHTS RESERVED
`
`8
`
`10
`
`10
`
`

`
`
`Appendix A: ABriefHistory ofthe Microprocessor
`
`In 1971, the Intel 4004 had about 2300 transistors, supported 45 instructions and ran at 740 kHz. Today, the
`latest Itanium 2 processor from Intel has about 200 million transistors runningat over a gigahertz and the
`hottest topic at microprocessor design conferences is what to do with the billion transistors that are likely to
`be available to chip designers in a few years time. What follows is a summary of the key performance
`innovations that have taken place in microprocessors overthe last 30 years.
`
`Pipelining
`The earlicst microprocessors had nopipelining. Instructions were issued and complcted before a new
`instruction was issued. Instruction pipelining had been around for years in mainframe systems such as the
`Control Data 6600. The feature allowed for multiple instructions to be in process thereby increasing
`instruction throughput. This feature did not appear in microprocessors until the Intel 8086 processorin
`19xx. The pipelining was fairly rudimentary- a secondinstruction could be fetched and decoded while the
`first was executing. Pipelining provedto be a very effective performance technique, butit also increased
`the processors appetite for instruction and data bandwidth.
`
`Caches
`Forthe first few years of the microprocessor,instruction and data loads were satisfied from main memory
`and logic speeds were compatible with memory access time in this regard. Logic performance was
`increasing at a faster rate than main memory latency and bandwidth. To respondto this, microprocessors
`designers adopted cache memories. Caches are designed to keep copiesof recently used instructions and
`data in nearby SRAM instead of far away DRAM.In 1984 Motorolaintroduced the 68020, the first
`microprocessor with an on-chip cache, a mere 256 bytes for instructions only. The 68030 introduced
`separate 256 byte cache for instructions and one for data.
`
`Cache memories can either be fast or large, but not both so designers began to build cache hierarchies
`starting with small, fast caches near the processor core and addinglarger, slower caches between thefast
`cache and memory. The Dec Alpha 21164 wasfirst with microprocessor with 2 levels of an-chip cache, a
`1024 word direct-mapped cache and a 96KB secondary cache. As weshall see in the next section,
`specifics of cache design are full of tradeoffs and compromises.
`
`Multiple Instruction Issue
`In 1989,Intel introducedthe first superscalar microprocessor, the i860. It was capable ofissuing three
`instructions per clock cycle. This was followed by the DEC Alpha 21064 that was capable of two
`instructions per cycle. Later processors would addout-of-order execution capability. Superscalar
`techniques increased the demand on bandwidth for both instructions and data.
`
`Multiple Data Per Instruction
`Another technique used by designers to speed up processing is to have instructions perform more work.
`Good examplesofthis are the Single Instruction Multiple Data (SIMD)instructions that Intel has added to
`microprocessors over the years. The initial SIMD cxtensions, called MMX,were primarily for accelerating
`graphics. Later SSE, and SSE2 extensions were added to accelerate a wide variety of operations. This
`technique increases the demand for operand bandwidth and reducesthe instruction bandwidth required for
`a givencalculation.
`
`Speculative Execution, Branch Prediction
`The Pentium Pro was among the first microprocessors that supported speculative execution and branch
`prediction. Speculative execution allows instructions that probably will be executed to start early based on
`branch prediction. Results are stored in a special cache called a reorder buffer andareretired if the branch
`Copyright® 2003 SRC Computers,Inc.
`9
`ALL RIGHTS RESERVED
`
`11
`
`11
`
`

`
`
`was correctly predicted. Mis-predictions can and do occur and this can waste instruction and data
`bandwidth.
`
` Appendix'B
`
`
`CacheDesign Tradeoffs: -
`
`Microprocessors are fixed logic devices. As such, decisions have to be made at design time that optimize
`for the “average” case, if such a thing exists. Of course, the corollary of this is that the design will not be
`optimal for most applications.
`In this section, we discuss some of the design tradeoffs that have to be taken
`when constructing cache systems on microprocessors.
`
`Size / Latency Tradeoff
`Data caches can be small with low-latency access, or large with comparatively longerlatency access.It is
`not possible to design a large capacity, low latency cache; hence microprocessor designers have introduced
`multiple levels of caching into their designs. Eachlevel out from the registers is larger, but takes longer to
`access. At each level, some decision is made with regard to capacity and latency.
`
`Line Width Tradeoff
`In order to hide memory latency and exploit spatial locality, caches are arranged in lines. When a load
`suffers a cache miss, a cacheline is loaded from main memory into the data cache. The ideais that there is
`a high degree of spatial locality in programs, and it is quite likely that the other memory locations in the
`cacheline will also be required.
`
`For programs with a high degree of spatial locality, or stride-one access, wide cache lines are best because
`they minimize the number of times the processor suffers the latency of a memory access. For programs
`with a low degree of spatial locality, or random access, narrow lines are best as this minimizes the wasted
`bandwidth used to bring in the unused neighbors in the cache linc.
`
`The widest cache lines are up to 256 bytes long on IBM Power3 processors. Needlessto say, this machine
`is not well knownforits gather/scatter performance. At the other extreme, the Cray SV1 had a cacheline
`that was only one word, or 8 bytes wide. On this architecture, it was assumedthat explicit vector loads
`would exploit spatial locality and so the data cache could be designed exclusively to exploit temporal
`locality.
`.
`
`Associativity Tradeoff
`Cacheassociativity refers to the mapping between memory locations and cache sectors. The two extreme
`design points are direct-mapped caches andfully associative caches. In a direct-mapped cache, a specific
`memorylocation can be mappedto only a single cache line. Direct-map

This document is available on Docket Alarm but you must sign up to view it.

Or .

Accessing this document will incur an additional charge of $.

After purchase, you can access this document again without charge.

Accept $ Charge

Still Working On It

This document is taking longer than usual to download. This can happen if we need to contact the court directly to obtain the document and their servers are running slowly.

Give it another minute or two to complete, and then try the refresh button.

A few More Minutes ... Still Working

It can take up to 5 minutes for us to download a document if the court servers are running slowly.

Thank you for your continued patience.

This document could not be displayed.

We could not find this document within its docket. Please go back to the docket page and check the link. If that does not work, go back to the docket and refresh it to pull the newest information.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

Set your membership status to view this document.

With a Docket Alarm membership, you'll get a whole lot more, including:

Up-to-date information for this case.
Email alerts whenever there is an update.
Full text search for other cases.
Get email alerts whenever a new case matches your search.

Become a Member

One Moment Please

The filing “” is large (MB) and is being downloaded.

Please refresh this page in a few minutes to see if the filing has been downloaded. The filing will also be emailed to you when the download completes.

Your document is on its way!

If you do not receive the document in five minutes, contact support at support@docketalarm.com.

Sealed Document

We are unable to display this document, it may be under a court ordered seal.

If you have proper credentials to access the file, you may proceed directly to the court's system using your government issued username and password.

Access Government Site

We are redirecting you
to a mobile optimized page.

Document Unreadable or Corrupt

Refresh this Document
Go to the Docket

We are unable to display this document.

Refresh this Document
Go to the Docket

Supplemental Search

Search for PTAB Motions

PTAB Analytics

TTAB Analytics

Basic Search

Filters

Party Search

Advanced

Selected Courts

Recently Selected Courts

Find PTAB Decisions

PTAB Analytics

Special PTAB Alerts

Orange Book

Directly Search Federal Courts

Search Trademark ...

This document is available on Docket Alarm but you must sign up to view it.

Accessing this document will incur an additional charge of $.

Still Working On It

A few More Minutes ... Still Working

This document could not be displayed.

Your account does not support viewing this document.

You need a Paid Account to view this document. Click here to change your account type.

Your account does not support viewing this document.

One Moment Please

Your document is on its way!

Sealed Document

We are redirecting youto a mobile optimized page.

Document Unreadable or Corrupt

We are unable to display this document.

STEP 2 of 2

Choose your membership type

Flat-Fee

Pay-As-You-Go

Add your payment information

Login or Join

Enter your corporate Email

Thousands of your peers are saving time and gaining a competitive advantage with Docket Alarm.

Join Docket Alarm to perform smarter legal research.

Download this document and millions of others instantly with a Docket Alarm membership.

Join Docket Alarm and start performing smarter legal research.

Start tracking this docket instantly with a Docket Alarm membership.

Join thousands of your peers and start performing smarter legal research.

STEP 1 of 2

Millions of Documents | 15 Seconds to Signup

Hi !

Welcome to Docket Alarm

Welcome to Docket Alarm!

Explore Litigation Insights andManage Your Cases

Reset Password

What is PACER?

Why do I need it?

What will I be charged?

Do other courts have fees?

Basic Free Access

Welcome

Thank you

Check Firm Account

We are redirecting you
to a mobile optimized page.

Explore Litigation Insights and
Manage Your Cases