`.,
`~
`-
`
`C
`
`UTILITY
`PATENT APPLICATION
`TRANSMITTAL
`(Only for new nonprovlslonal applications under
`37 CFR 1.53(b))
`
`APPLICATION ELEMENTS
`
`PTO/SB/05 (04-04)
`Approved for use throuQh 07/31/2006
`
`Attorney Docket No.
`
`SRC028
`
`First Inventor
`
`Daniel Poznanovic et al.
`
`Title
`
`SYSTEM AND METHOD OF ENHANCING
`EFFICIENCY AND UTILIZATION OF
`~
`MEMORY BANDWIDTH IN
`a..O
`.0
`RECONFIGURABLE HARDWARE
`,A_
`·o,
`=>co
`v~
`&50
`,....
`
`(,0,-
`
`EV331755319US
`Express Mail Label No.
`Commissioner for Patents
`P.O. Box 1450
`Alexandria, VA 22313-1450
`
`1.
`
`2.
`
`3.
`
`[81
`
`(cid:143)
`
`[81
`
`Fee Transmittal Form
`(submit an original and a duplicate for fee processing)
`Applicant claims small entity status.
`See 37 CFR 1.27
`total pages _26_ 1
`Specification [
`(preferred Arrangement set forth below)
`- Descriptive title of the Invention
`- Cross References to Related Applications
`- Statement Regarding Fed sponsored R&D
`- Reference to sequence listing, a table, or a
`computer program listing appendix
`- Background of the Invention
`- Brief Summary of the Invention
`
`4.
`5.
`
`- Brief Description of the Drawings
`- Detailed Description
`- Claim(s)
`- Abstract of the Disclosure
`[81 Drawing(s) [ total sheets 12 1
`total pages _3_ 1
`[81 Oath or Declaration [
`[81 Newly executed (original or copy)
`a.
`b. 0 Copy from prior appl. (37 C.F.R. § 1.63(d))
`(cid:143) DELETION OF INVENTOR(S)
`(for continuation/divisional with Box 18 completed)
`
`i.
`
`6. 0 Application Data Sheet. (See 37 CFR 1. 76
`7. 0 CD-ROM or CD-R in duplicate, large table
`or Computer Program (Appendix)
`8. Nucleotide and/or Amino Acid Sequence Submission
`(if applicable, all necessary)
`Computer Readable Form
`a.
`Specification Sequence Listing on:
`b.
`i. 0 CD-ROM or CD-R (2 copies); or
`ii. 0 paper
`Statements verifying identity of above copies
`
`(cid:143)
`(cid:143)
`
`C.
`
`(cid:143)
`
`ACCOMPANYING APPLICATION PARTS
`[81 Assignment Papers (coversheeUdocument(s))
`9.
`10. 0
`[81 Power of
`37 CFR. 3.73(b) Statement
`(when there is an assignee)
`Attorney
`English Translation Document
`IDS & Form PTO/SB/OBA
`
`D Copies of IDS
`Citations
`
`11. 0
`12. 0
`13. 0
`Preliminary Amendment
`14. [81 Return Receipt Postcard (MPEP 503)
`15. 0 Certified Copy of Priority Document(s)
`16. 0 Nonpublication Request Under 35 USC
`122(b)(2)(B)(i).Applicant must attach form PTO/SB/35
`17. [81 Other: Certificate of Mailini:1 bv Express Mail
`
`Signed statement attached deleting
`inventor(s) named in prior application,
`see 37 C.F.R. §§ 1.63(d)(2) and1 .33(b).
`18. If a CONTINUING APPLICATION, check appropriate box, and supply the requisite information below and in a
`preliminary amendment, or in an Application Data Sheet under 37 CFR 1.76:
`0 Continuation D Divisional
`Group/Art Unit:
`Examiner:
`Prior application information:
`FOR CONTINUATION OR DIVISIONAL APPS only: The entire disclosure of the prior application, from which an oath or declaration is supplied under Box Sb. is considered a part of
`the disclosure of the accompanying continuation or divisional application and is hereby incorporated by reference. The incorporation can only be relied upon when a portion has been ·
`inadvertentlv omitted form the submitted application oarts.
`
`(cid:143) Continuation-in-part (CIP)
`
`of prior application No.:
`
`I
`
`19. CORRESPONDENCEADDRESS
`or D Correspondence address below
`
`[81 Customer Number 25235
`Name
`
`Address
`
`City
`
`Country
`Name (Print/Type) William 1
`
`State
`
`I
`I
`Telephone
`___.l Re~is~ation No.
`''- L~a
`
`r )~Pi i~~ ) L
`(
`
`(Signature)
`
`\\\CS 80404/0033 - 68264 v I
`
`-
`
`V°'::/{_,(
`
`,_
`
`ZIP
`
`Fax
`
`29,664
`
`-~7.fn--1/
`
`Date
`
`~
`
`-
`
`Intel Exhibit 1002 - 1
`
`
`
`EXPRESS MAIL NO. EV331755319US
`Attorney Docket No. SRC028
`Client/Matter No. 80404.0033.001
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`In re Application of:
`
`Daniel Poznanovic, David E. Caliga,
`and Jeffrey Hammes
`
`Serial No. NEW
`
`Filed: Herewith
`
`For: SYSTEM AND METHOD OF
`ENHANCING EFFICIENCY AND
`UTILIZATION OF MEMORY
`BANDWIDTH IN RECONFIGURABLE
`HARDWARE
`
`CERTIFICATE OF MAILING BY EXPRESS MAIL
`
`Commissioner for Patents
`P.O. Box 1450
`Alexandria, VA 22313-1450
`
`Sir:
`
`1.
`2.
`3.
`4.
`5.
`6.
`
`The undersigned hereby certifies that the following documents:
`Utility Patent Application Transmittal;
`Fee Transmittal and $928 filing fee;
`Utility Patent Application- 22 pgs. Spec, 3 pgs. Claims, 1 pg. Abstract;
`Executed Declaration for Utility Patent Application;
`12 sheets of drawings;
`Recordation Form Cover Sheet PTO 1595 with Executed
`Assignment and Recording Fee of $40.00;
`'J.
`Return postcard; and
`Certificate of Mailing By Express Mail
`8.
`relating to the above application, were deposited as "Express Mail", Mailing Label
`No. EV331755319US, with the United States Postal Service, addressed to Commissioner
`50.
`for Patents, P.O. Box 1450, Alexandria, VA 22313-
`
`One Tabor Center
`1200 17th Street, Suite lpOO
`Denver, Colorado 80202
`(719) 448-5909 Tel
`(303) 899-7333 Fax
`
`\\\CS • 80404/0033 • 68264 v I
`
`Intel Exhibit 1002 - 2
`
`
`
`...
`
`O')
`0,
`O')
`(D
`
`FEE TRANSMITTAL
`for FY 2004
`e~e 10/01/2003. Patent fees are subject to annual revision
`
`0
`.....
`0)
`0)
`0
`.i:,.
`
`TOTAL AMOUNT OF PAYMENT
`
`~ ...
`Fc'.j A~licant claims small entity status. See 37 CFR 1.27
`I($) gsa.oo
`METHOD OF PAYMENT (check all that apply)
`181 check O credit card O money order O other O none
`D Deposit Account
`
`The Director is authorized to: (check all that apply)
`0 Charge fee(s) indicated below 1:8:J Credit any overpayments
`1:8:J Charge any additional fee(s) or any underpayment of fee(s)
`0 Charge fee(s) indicated below, except for the filing fee to the above-
`identified deposit account
`
`1. BASIC FILING FEE
`Large
`Small
`Entity Fee
`Entity Fee
`($)
`($)
`
`FEE CALCULATION
`
`Fee Description
`
`770
`340
`
`530
`
`770
`
`160
`
`385 Utility Filing Fee
`170 Design filing fee
`
`265 Plant filing fee
`
`385 Reissue filing fee
`
`80 Provisional filing fee
`
`Fee Paid
`
`770.00
`
`SUBTOTAL (1) I ($) 770.00
`2. EXTRA CLAIM FEES FOR UTILITY AND REISSUE
`Fee from
`Fee Paid
`below
`
`Account
`Number
`
`Deposit
`Account
`Name
`
`Deposit I
`I
`
`50-1123
`
`Hogan & Hartson L.L.P.
`
`Application Number
`Filing Date
`First Named Inventor
`
`Examiner Name
`Group/ Art Unit
`
`Attorney Docket No.
`
`Complete if Known
`--------
`Herewith
`
`Daniel Poznanovic et al.
`
`SRC028
`
`FEE CALCULATION (continued)
`3. ADDITIONAL FEES
`
`Fee Paid
`
`I
`I
`
`Large
`Entity
`Fee($)
`130
`50
`
`130
`
`2,520
`
`920'
`
`1,840'
`
`110
`420
`
`950
`1,480
`
`2,010
`330
`
`330
`
`290
`
`1,510
`
`110
`1,330
`1,330
`480
`
`2,520
`
`Fee Description
`
`Small
`Entity
`Fee($)
`65 Surcharge - late filing fee or oath
`25 Surcharge - late provisional filing fee or
`cover sheet
`130 Non-English specification
`For filing a request for ex parte
`reexamination
`920' Requesting publication of SIR prior to
`Examiner action
`1,840' Requesting publication of SRI after
`Examiner action
`55 Extension for reply within first month
`210 Extension for reply within second month
`475 Extension for reply within third month
`740 Extension for reply within fourth month
`
`1,005 Extension for reply within fifth month
`165 Notice of Appeal
`
`165
`
`Filing a brief in support of an appeal
`
`145 Request for oral hearing
`
`1;510 Petition to institute a public use
`proceeding
`55 Petition to revive - unavoidable
`665 Petition to revive unintentional
`664 Utility issue fee (or reissue)
`240 Design issue fee
`
`I
`
`Extra Claims
`
`Total Claims ~ _,,_ Er~~
`
`Independent
`Claims·
`Multiple Dependent
`
`-3-=
`
`X
`
`86
`
`=
`
`86.00
`
`-or number previously paid, if greater; For Reissues, see below
`Large Entity
`Fee Description
`Small Entity
`Fee($)
`Fee($)
`18
`9
`
`Claims in excess of 20
`
`86
`
`290
`86
`
`18
`
`43
`
`145
`43
`
`9
`
`Independent claims in excess of 3
`
`Multiple dependent claim, if not paid
`"Reissue independent claims over
`original patent
`.. Reissue claims in excess of 20 and over
`original patent
`
`SUBTOTAL (2)
`
`I($) 158.00
`
`I
`
`SUBMITTED BY
`
`Comolet .. fif a110/icablel
`Name (PrinVTyprlNil~m ,,• ~ ., .
`~7
`.....
`Signature I )(v'j/J V - -J~ -~ 'Jv J ~
`
`~
`
`\ \\CS • 80404/0033 • 68264 v 1
`
`640
`
`130
`
`50
`180
`40
`
`770
`
`770
`
`770
`900
`
`320 Plant issue fee
`
`130 Petitions to the Commissioner
`
`50 Processing fee under 37 CFR 1.17(q)
`180 Submission of Info Disclosure Simi
`40 Recording each patent assignment per
`property (times number of properties)
`Filing a submission after final rejection (37
`CFR § 1.129(a))
`For each additional invention to be
`examined (37 CFR §1.129(b))
`385 Request for Continued Examination
`900 Request for expedited examination of a
`design application
`
`385
`
`385
`
`40.00
`
`Other fee (specify)
`
`• (cid:127)
`
`(cid:127)
`
`(cid:127)
`
`(cid:127) * • 8 II II II• II I• I•••• (cid:127)
`
`•Reduced by Basic Fling Fee Paid
`
`SUBTOTAL (3)
`
`($) 40.00
`
`I
`
`I
`
`I Registration No.
`
`(Attorney/Agent)
`
`I 2s.sa4
`
`Telep~-(719) 448-5900
`Date /r.
`
`,A - loesY
`
`u
`
`Intel Exhibit 1002 - 3
`
`
`
`-.
`
`PATENT APPLICATION
`ATTORNEY DOCKET No. SRC028
`Client/Matter No. 80404.0033.001
`Express Mail Label No. EV331755319US
`
`SYSTEM AND METHOD OF ENHANCING EFFICIENCY
`AND UTILIZATION OF MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`Related Applications.
`1.
`[0001] The present invention claims the benefit of U.S. Provisional Patent
`application Serial No. 60/479,339 filed on June 18, 2003, which is incorporated
`herein by reference in its entirety.
`
`BACKGROUND OF THE INVENTION
`
`1.
`
`Field of the Invention.
`
`[0002] The present invention relates, in general, to enhancing the efficiency and
`utilization of memory bandwidth in reconfigurable hardware. More specifically,
`to
`implementing explicit memory hierarchies
`in
`the
`invention
`relates
`
`reconfigurable processors that make efficient use of off-board, on-board, on(cid:173)
`
`chip storage and available algorithm locality. These explicit memory hierarchies
`
`avoid many of the tradeoffs and complexities found in the traditional memory
`
`hierarchies of microprocessors.
`
`Relevant Background .
`2.
`. [0003] Over the past 30 years, microprocessors have enjoyed annual
`performance gains averaging about 50% per year. Most of the gains can be
`attributed to higher processor clock speeds, more memory bandwidth and
`increasing utilization of instruction level parallelism (ILP) at execution time.
`
`[0004] As microprocessors and other dense logic devices (DLDs) consume
`data at ever-increasing rates it becomes more of a challenge to design memory
`
`hierarchies that can keep up.
`
`Two measures of the gap between the
`
`microprocessor and memory hierarchy are bandwidth efficiency and bandwidth
`
`\\\CS • 80404/0033 • 68254 v2
`
`1
`
`Intel Exhibit 1002 - 4
`
`
`
`utilization. Bandwidth efficiency refers to the ability to exploit available locality
`
`In the ideal situation, when there is maximum
`in a program or algorithm.
`bandwidth efficiency, all available locality is utilized. Bandwidth utilization refers
`to the amount of memory bandwidth that is utilized during a calculation.
`Maximum bandwidth utilization occurs when all available memory bandwidth is
`
`utilized.
`
`[0005] Potential performance gains from using a faster microprocessor can be
`reduced or even negated by a corresponding drop in bandwidth efficiency and
`bandwidth utilization. Thus, there has been significant effort spent on the
`development of memory hierarchies that can maintain high bandwidth efficiency
`and utilization with faster microprocessors.
`
`[0006] One approach to improving bandwidth efficiency and utilization in
`memory hierarchies has been to develop ever more powerful processor
`
`caches. These caches are high-speed memories (typically SRAM) in close
`
`proximity to the microprocessor that try to keep copies of instructions and data
`the microprocessor may soon need. The microprocessor can store and retrieve
`
`data from the cache at a much higher rate than from a slower, more distant
`
`main memory.
`
`[0007] In designing cache memories, there are a number of considerations to
`take into account. One consideration is the width of the cache line. Caches
`are arranged in lines to help hide memory latency and exploit spatial locality.
`When a load suffers a cache miss, a new cache line is loaded from main
`memory into the cache. The assumption is that a program being executed by
`the microprocessor has a high degree of spatial locality, making it likely that
`other memory locations in the cache line will also be required.
`
`[0008] For programs with a high degree of spatial locality (e.g., stride-one
`access), wide cache lines are more efficient since they reduce the number of
`
`times a processor has to suffer the latency of a memory access. However, for
`
`programs with lower levels of spatial locality, or random access, narrow lines
`
`\\\CS - 80404/0033 - 68254 v2
`
`-2-
`
`Intel Exhibit 1002 - 5
`
`
`
`are best as they reduce the wasted bandwidth from the unused neighbors in
`the cache line. Caches designed with wide cache lines perform well with
`programs that have a high degree of spatial locality, but generally have poor
`gather/scatter performance. Likewise, caches with short cache lines have good
`gather/scatter performance, but loose efficiency executing programs with high
`spatial locality because of the additional runs to the main memory.
`
`(0009] Another consideration in cache design is cache associativity, which
`
`refers to the mapping between locations in main memory and cache sectors.
`At one extreme of cache associativity is a direct-mapped cache, while at
`another extreme is a fully associative ca_che.
`In a direct mapped-cache, a
`specific memory location can be mapped to only a single cache line. Direct(cid:173)
`mapped caches have the advantage of being fast and easy to construct in
`logic. The disadvantage is that they suffer the maximum number of cache
`conflicts. At the other extreme, a fully associative cache allows a specific
`location in memory to be mapped to any cache line. Fully associative caches
`tend to be slower and more complex due to the large amount of comparison
`logic they need, but suffer no cache conflict misses. Oftentimes, caches fall
`between the extremes of direct-mapped and fully associative caches. A design
`point between the extremes is a k-set associative cache, where each memory
`location can map to k cache sectors. These caches generally have less
`fully associative caches, and reduce cache conflicts by
`overhead than
`
`increasing the value of k.
`
`(001 0] Another consideration in cache design is how cache lines are replaced
`due to a capacity or conflict miss. In a direct-mapped cache, there is only one
`possible cache line that can be replaced due to a miss. However, in caches
`with higher levels of associativity, cache lines can be replaced in more that one
`way. The way the cache lines are replaced is referred to as the replaceme.nt
`policy.
`
`\\\CS - 80404/003 J - 68254 v2
`
`-3-
`
`Intel Exhibit 1002 - 6
`
`
`
`[0011] Options for the replacement policy include least recently used (LRU),
`random replacement, and first in-first out (FIFO). LRU is used in the majority
`of circumstances where the temporal locality set is smaller than the cache size,
`but it is normally more expensive to build in hardware than a random
`replacement cache. An LRU policy can also quickly degrade depending on the
`working set size. For example, consider an iterative application with a matrix
`size of N bytes running through a LRU cache of size M bytes. If N is less than
`M, then the policy has the desired behavior of 100% cache hits, however, if N is
`
`only slightly larger than M, the LRU policy results in 0% cache hits as lines are
`
`removed just as they are needed.
`
`[0012] Another consideration is deciding on a write policy for the cache. Write(cid:173)
`through caches send data through the cache hierarchy to main memory. This
`policy reduces cache coherency issues for multiple processor systems and is
`best suited for data that will not be re-read by the processor in the immediate
`future. In contrast, write-back caches place a copy of the data in the cache, but
`does not immediately update main memory. This type of caching works best
`when a data just written to the cache is quickly requested again by the
`processor.
`
`[0013] In addition to write-through and write-~ack caches, another kind of write
`policy is implemented in a write-allocate cache where a cache line is allocated
`
`on a write that misses in cache. Write-allocate caches improve performance
`when the microprocessor exhibits a lot of write followed by read behavior.
`However, when writes are not subsequently read, a write-allocate cache has a
`
`number of disadvantages: When a cache line is allocated, it is necessary to
`read the remaining values from main memory to complete the cache line. This
`adds unnecessary memory read traffic during store operations. Also, when the
`data is not read again, potentially useful data in the cache is displaced by the
`unused data.
`
`\\\CS - &0404/0033 - 6&254 v2
`
`-4-
`
`Intel Exhibit 1002 - 7
`
`
`
`(0014] Another consideration is made between the size and the speed of the
`cache: small caches are typically much faster than larger caches, but store
`
`less data and fewer instructions. Less data means a greater chance the cache
`will not have data the microprocessor is requesting (i.e., a cache miss) which
`
`can slow everything down while the data is being retrieved from the main
`
`memory.
`
`(0015] Newer cache designs reduce the frequency of cache misses by trying to
`predict in advance the data that the microprocessor will request. An example of
`
`this type of cache is one that supports speculative execution and branch
`
`prediction. Speculative execution allows instructions that likely will be executed
`
`to start early based on branch prediction. Results are stored in a cache called
`a reorder buffer and retired if the branch was correctly predicted. Of course,
`when mis-predictions occur instruction and data bandwidth are wasted.
`
`(0016] There are additional considerations and tradeoffs in cache design, but it
`should be apparent from the considerations described hereinbefore that it is
`very difficult to design a single cache structure that is optimized for many
`different programs. This makes cache design particularly challenging for a
`multipurpose microprocessor that executes a wide variety of programs. Cache
`designers try to derive the program behavior of "average" program constructed
`
`from several actual programs that run on the microprocessor. The cache is
`
`optimized for the average program, but no actual program behaves exactly like
`
`the average program. As a result, the designed cache ends up being sub(cid:173)
`
`optimal for nearly every program actually executed by the microprocessor.
`Thus, there is a need for memory hierarchies that have data storage and
`retrieval characteristics that are optimized for actual programs executed by a
`processor.
`
`(0017] Designers trying to develop ever more efficient caches optimized for a
`variety of actual programs also face another problem: as caches add additional
`features, the overhead needed to implement the added features also grows.
`
`\\\CS • 80404/0033 - 68254 v2
`
`-5-
`
`Intel Exhibit 1002 - 8
`
`
`
`Caches today have so much overhead that microprocessor performance may
`
`be reaching a point of diminishing returns as the overhead starts to cut into
`
`performance. In the Intel Pentium Ill proc~ssor for example, more than half of
`
`the 10 million transistors are dedicated to instruction cache, branch prediction,
`
`out-of-order execution and supers·calar logic. The situation has prompted
`
`predictions that as microprocessors grow to a billion transistors per chip,
`
`performance increases will drop to about 20% per year. Such a prediction, if
`
`borne out, could have a significant impact on technology growth and the
`
`computer business.
`
`[0018] Thus, there is a growing need to develop improved memory hierarchies
`that limit the overhead of a memory hierarchy without also reducing bandwidth
`
`efficiency and utilization.
`
`SUMMARY OF THE INVENTION
`
`[0019]Accordingly, an embodiment of the invention includes a reconfigurable
`
`processor that includes a computational unit and a data access unit coupled to
`
`the computational unit, where the data access unit retrieves data from an on(cid:173)
`
`processor memory and supplies the data to the computational unit, and where
`
`the computational unit and the data access unit are configured by a program.
`
`[0020] The present invention also involves a reconfigurable processor that
`includes a first memory of a first type and a data prefetch unit coupled to the
`
`memory, where the data prefetch unit retrieves data from a second memory of
`
`a second type different from the first type, and the first and second memory
`
`types and the data prefetch unit are configured by a program.
`
`[0021] Another embodiment of the
`reconfigurable
`includes a
`invention
`hardware system that includes a common memory, also referred to as external
`
`memory, and one or more reconfigurable processors coupled to the common
`
`memory, where at least one of the reconfigurable processors includes a data
`
`prefetch unit to read and write data between the unit and the common memory,
`
`\\\CS • 80404/0033 - 68254 v2
`
`-6-
`
`Intel Exhibit 1002 - 9
`
`
`
`and where the data prefetch unit is configured by a program executed on the
`
`system.
`
`[0022] Another embodiment of the invention includes a method of transferring
`data that includes transferring data between a memory and a data prefetch unit
`in a reconfigurable processor, transferring data between the prefetch unit and a
`
`data access unit, and transferring the data between a computational unit and
`the data access unit, where the computational unit, data access unit and the
`data prefetch unit are configured by a program.
`
`[0023] Additional embodiments of the invention are set forth in part in the
`description that follows, and in part will become apparent to those skilled in the
`
`art upon examination of the following specification, or may be learned by the
`
`practice of the invention. The advantages of the invention may be realized and
`
`attained by means of the instrumentalities, combinations, compositions, and
`
`methods particularly pointed out in the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0024] Figure 1 shows a reconfigurable processor in which the present
`
`invention may be implemented;
`
`[0025] Figure 2 shows computational
`
`logic as might be loaded
`
`into a
`
`reconfigurable processor;
`
`[0026] Figure 3 shows a reconfigurable processor as in Figure 1, but with the
`
`additi~:>n of data access units;
`
`[0027] Figure 4 shows a reconfigurable processor as in Figure 3, but with the
`addition of data prefetch units;
`
`[0028] Figure 5 shows reconfigurable processor with the inclusion of external
`memory;
`
`\\\CS - 80404/0033 - 68254 v2
`
`-7-
`
`Intel Exhibit 1002 - 10
`
`
`
`[0029] Figure 6 shows reconfigurable processors with external memory and
`
`with an intelligent memory controller;
`
`[0030] Figure 7 shows a reconfigurable processor having a combination of
`data prefetch units and data access units feeding computational logic;
`
`[0031) Figure 8 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform strided memory references;
`
`[0032] Figure 9A and Figure 98 show the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`controller to perform subset memory references in X-Y plane;
`
`[0033] Figure 10A and Figure 108 show the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in X-Z plane;
`
`[0034] Figure 11A and Figure 118 show the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`controller to perform subset memory references in Y-Z plane;
`
`[0035) Figure 12A and Figure 128 show the bandwidth efficiency and utilization
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`controller to perform subset memory references in a mini-cube;
`
`[0036] Figure 13 shows the bandwidth efficiency and utilization gains obtained
`when utilizing a data prefetch unit and an intelligent memory controller to
`perform indirect memory references;
`
`[0037] Figure 14 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`perform strided memory reference together with computation.
`
`\\\CS - 80404/0033 - 68254 v2
`
`-8-
`
`Intel Exhibit 1002 - 11
`
`
`
`DETAILED DESCRIPTION
`
`1.
`
`Definitions:
`
`[0038) Direct execution logic (DEL) -
`
`is an assemblage of dynamically
`
`reconfigurable functional elements that enables a program to establish an
`
`to
`in order
`functional units
`interconnection among selected
`optimized
`implement a desired computational, data prefetch and/or data access
`
`functionality for maximizing the parallelism inherent in the particular code.
`
`[0039] Reconfigurable Processor -
`reconfigurable components such as FPGAs and can, through reconfiguration,
`
`is a computing device that contains
`
`instantiate an algorithm as hardware.
`
`[0040] Reconfigurable Logic -
`functional units, control, and storage that implements an algorithm and can be
`
`is composed of an
`
`interconnection of
`
`loaded into a Reconfigurable Processor.
`
`[0041] Functional Unit -
`is a set of logic that performs a specific operation.
`The operation may for example be arithmetic,
`logical, control, or data
`
`movement. Functional units are used as building blocks of reconfigurable logic.
`
`[0042] Macro - is another name for a functional unit.
`
`[0043] Memory Hierarchy -
`
`is a collection of memories
`
`[0044) Data prefetch Unit -
`members of a memory hierarchy. The movement may be as simple as a copy,
`or as complex as an indirect indexed strided copy into a unit stride memory.
`
`is a functional unit that moves data between
`
`[0045] Data access Unit -
`
`is a functional unit that accesses a component of a
`
`memory hierarchy, and delivers data directly to computational logic.
`
`\\\CS • 1!0404/0033 - 68254 v2
`
`-9-
`
`Intel Exhibit 1002 - 12
`
`
`
`[0046] Intelligent Memory Control Unit - is a control unit that has the ability to
`
`select data from its storage according to a variety of algorithms that can be
`
`selected by a data requestor, such as a data prefetch unit.
`
`[0047] Bandwidth Efficiency -
`
`is defined as the percentage of contributory
`
`data transferred between two points. Contributory data is data that actually
`
`participates in the recipients processing.
`
`[0048] Bandwidth Utilization -
`is defined as the percentage of maximum
`bandwidth between two points that is actually used to pass contributory data.
`
`2.
`
`Description
`
`[0049) A reconfigurable processor (RP) 100 implements direct executable logic
`(DEL) to perform computation, as well a memory hierarchy for maintaining input
`
`data and computational results. DEL is an assemblage of dynamically
`
`reconfigurable functional elements that enables a program to establish an
`
`in order
`functional units
`interconnection among selected
`optimized
`to
`implement a desired computational, data prefetch and/or data access
`
`functionality for maximizing the parallelism inherent in the particular code. The
`
`term DEL may also be used to refer to the set of constructs such as code,
`data, configuration variables, and the like that can be loaded into RP 100 to
`cause RP 100 to implement a particular assemblage of functional elements.
`
`[0050] Figure 1 presents an RP 100, which may be implemented using field
`programmable gate arrays (FPGAs) or other reconfigurable logic devices, that
`can be configured and
`reconfigured
`to contain
`functional units and
`interconnecting circuits, and a memory hierarchy comprising on-board memory
`banks 104, on-chip block RAM 106, registers wires, and a connection 108 to
`external memory. On-chip reconfigurable components 102 create memory
`
`structures such as registers, FIFOs, wires and arrays using block RAM. Dual(cid:173)
`
`ported memory 106 is shared between on-chip reconfigurable components 102.
`
`The reconfigurable processor 100 also implements user-defined computational
`
`\\\CS • 80404/0033 - 68254 v2
`
`-10-
`
`Intel Exhibit 1002 - 13
`
`
`
`logic (e.g., such as DEL 200 shown in Figure 2) constructed by programming
`
`an FPGA to implement a particular interconnection of computational functional
`In a particular implementation, a number of RPs 100 are implemented
`units.
`
`within a memory subsystem of a conventional computer, such as on devices
`that are physically installed in dual inline memory module (DIMM) sockets of a
`computer. In this manner the RPs 100 can be accessed by memory operations
`
`It should be
`and so coexist well with a more conventional hardware platform.
`noted that, although the exemplary implementation of the present invention
`two
`includes six banks of dual ported memory 104 and
`illustrated
`reconfigurable components 102, any number of memory banks and/or
`reconfigurable components may be used depending upon the particular
`implementation or application.
`
`[0051] Any computer program,
`including complex graphics processing
`programs, word processing programs, database programs and the like, is a
`collection of algorithms that interact to implement desired functionality.
`In the
`common case in which static computing hardware resources are used (e.g., a
`
`conventional microprocessor), the computer program is compiled into a set of
`executable code (i.e., object code) units that are linked together to implement
`the computer program on the particular hardware resources. The executable
`
`code is generated specifically for a particular hardware platform.
`In this
`manner, the computer program is adapted to conform to the limitations of the
`static hardware platform. However, the compilation process makes many
`compromises based on the limitations of the static hardware platform.
`
`[0052] Alternatively, a_n algorithm can be defined in a high level language then
`compiled into DEL. DEL can be produced via a compiler from high level
`programming languages such as C or FORTRAN or may be designed using a
`hardware definition language such as Verilog, VHDL or a schematic capture
`tool. Computation is performed by reconfiguring a reconfigurable processor
`In this manner, the
`with the DEL and flowing data through the computation.
`
`\\\CS - 80404/0033 - 68254 v2
`
`-11-
`
`Intel Exhibit 1002 - 14
`
`
`
`hardware resources are essentially adapted to conform to the program rather
`than the program being adapted to conform to the hardware resources.
`
`[0053] For purposes of this description a single reconfigurable processor will be
`presented first. A sample of computational logic 201 is shown in Figure 2. This
`
`simple assemblage of functional units performs computation of two results
`("A+B" and "A+B-(B*C)) from three input variables or operands "A", "B" and
`
`"C".
`
`In practice, computational units 201 can be implemented to perform very
`
`simple or arbitrarily complex computations. The input variables (operands) and
`output or result variables may be of any size necessary for a particular
`application. Theoretically, any number of operands and result variables may be
`
`used/generated by a particular DEL. Great complexity of computation can be
`supported by adding additional reconfigurable chips and processors.
`
`[0054] For greatest performance the DEL 200 is constructed as parallel
`pipelined logic blocks composed of computational functional units capable of
`
`taking data and producing results with each clock pulse. The highest possible
`
`performance that can be achieved is computation of a set of results with each
`
`clock pulse. To achieve this, data should be available at the same rate the
`
`computation can consume the data. The rate at which data can be supplied to
`
`DEL 200 is determined, at least in significant part, by the memory bandwidth
`utilization and efficiency. Maximal computational performance can be achieved
`
`with parallel and pipelined DEL together with maximizing
`the memory
`bandwidth utilization and efficiency. Unlike · conventional static hardware
`platforms, however,
`the memory hierarchy provided
`in a RP 100
`is
`
`In accordance with the present invention, through the use of
`reconfigurable.
`data access units and associated memory hierarchy components,
`computational demands and memory bandwidth can be matched.
`
`[0055] High memory bandwidth efficiency is achieved when only data re