`
`RECUNFIGURABLE HARDWARE
`11.8 E
`Ear
`EV331755319US
`Express Mail Label No.
`3% E8
`V...“ hag—=0C)
`CommissionerforPatents
`
`P0 BoxMSO
`“to E
`@1— E
`Alexandria. VA 223131450
`
`
`6 Cl Application Data Sheet. (See 37 CPR 1. 76
`
`APPL'CAT'ON
`
`ELEMENT
`
`3
`
`2. CI
`
`3. E
`
`E Fee Transmittal Form
`(submit an original and a duplicate for fee processing}
`Applicant claims small entity status.
`See 37 CPR 1.27
`total pages _25_ ]
`Specification I
`(preferred Arrangement set forth below}
`- Descriptive title of the Invention
`- Cross References to Related Applications
`- Statement Regarding Fed sponsored RSID
`- Reference to sequence listing. a table. or a
`computer program listing appendix
`- Background of the invention
`- Brief Summary of the Invention
`
`7. I] CD-RDM or CD-R in duplicate. large table
`or Computer Program (Appendix)
`8. Nucleotide andlor Amino Acid Sequence Submission
`(if applicable. all necessary)
`a.
`[3
`Computer Readable Form
`b.
`El
`Specification Sequence Listing on:
`i_
`|:| CD-ROM or CD-R {2 copies); or
`ii. III paper
`
`c.
`
`Statements verifying identity of above copies
`E]
`ACCOMPANYING APPLICATION PARTS
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`
`- Brief Description of the Drawings
`9. E Assignment Papers [coversheatldocumenfisfi
`
`
`
`- Detailed Description
`E Power of
`10_ CI
`3? CFR. 31303} Statement
`
`
`
`
`. Claimls)
`(when there is an assign-reel
`Attorney
`
`
`
`" Abstract N the Disclosure
`English Translation Document
`11. I]
`
`4.
`IE] Drawing(s}[ total sheets 12 ]
`12, III ms 3. Form PTOlSBlOBA
`5. EOathorDeclaratioM tolalpages __3__]
`13. I]-
`Preliminary Amendment
`
`
`a. E Newly executed (original or copy}
`14. EB] Return Receipt Postcard (MPEP 503)
`
`
`b. [I Copy from prior appl {37 c. F. R. § 1 saw);
`15. [:| Certified Copy of Priority Document(s)
`(for continuationldivisional with Box 18 completed)
`
`
`i.
`IjDELETION OF INVENTORIS}
`15. E] Nonpublication Request Under 35 USC
`
`Signed statement attached deleting
`
`.
`.
`_
`_
`_
`122(b}(2}(8}(i}.Applicant must attach form PTOlSBl35
`
`Inventor(st named to prior application.
`
`
`17. B Other: Certificate of Mailin b Exress Mail
`seeaTC.F.R_‘,;1.63d 2 andtaab.
`
`If a CONTINUING APPLICATION. check appropriate box. and supply the requisite information below and in a
`18.
`preliminary amendment, or in an Application Data Sheet under 3? CFR. 1.76:
`
`1M39L90||I||l|Il|ll|lIIIIIIIIIIIIII
`
`TRANSMITTAL
`(only for newI nonprovlslonal applications under
`
`F’TOISBI‘OS (04-04)
`A- -roved for use throu-h 0751:9006
`
`
`
`Attorney Docket No.
`SR0028
`UTILITY
`PATENT APPLICATION
`
`
`
`First Inventor
`
`
`
`
`‘
`SYSTEM AND METHOD OF ENHANCING
`
`
`Daniel Poznanovic et at.
`
`
`EFFICIENCY AND UTILIZATION OF 3? CFR 1-53Ibll
`
`
`
`
`
`
`E] Continuation I___I Divisional El Continuation—in—parthlP)
`
`of prior application No.: _l
`
`E] Copies of IDS
`Citations
`
`
`
`
`
`
`
`
`Group/Art Unit:
`Examinen
`Prior application infonnation:
`
`
`EQfl cflfl
`INUATIQN QR DIVI§IQN§L flPPfi pm: The entire disclosure of the prior application. from whid‘t an oath or declaralion is supplied under Boot 5b. is oansidered a part of
`
`
`the disdosure oi the accompanying continuation or divisional application and is hereby incorporated by reference. The incorporation can 9ng be relied upon when a portion has been -
`inadvertentl omitted torrn lite submitted EJ'IICEHU“ -alti.
`
`
` 19. CORRESPONDENCE ADDRESS
`
`
`Customer Number 25235
`or [:1 Correspondence address below
`
`—_
`
`
`
`m———m—
`
`
`
`
`
`
`
`
`Doria-WE”!
`
`
`A‘s. 4 Adv
`
`\\\CS - 80'1041'0033 - 68264 v1
`
`1
`
`XILINX 1002
`
`XILINX 1002
`
`1
`
`
`
`EXPRESS MAIL NO. EV331755319US
`
`Attorney Docket No. SRCO28
`Client/Matter No. 804040033001
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`In re Application of:
`
`Daniel Poznanovic, David E. Caliga,
`' and Jeffrey Hammes
`
`Serial No. NEW
`
`For: SYSTEM AND METHOD OF
`ENHANCING EFFICIENCY AND
`UTILIZATION OF MEMORY
`BANDWIDTH IN RECONFIGURABLE
`HARDWARE
`
`
`Filed: Herewith
`
`CERTIFICATE OF MAILING BY EXPRESS MAIL
`
`Commissioner for Patents
`PO. Box 1450
`
`Alexandria, VA 22313-1450
`
`Sir:
`
`1.
`2.
`3.
`4.
`5.
`6.
`
`The undersigned hereby certifies that the following documents:
`Utility Patent Application Transmittal;
`Fee Transmittal and $928 filing fee;
`Utility Patent Application- 22 pgs. Spec, 3 pgs. Claims, 1 pg. Abstract;
`Executed Declaration for Utility Patent Application;
`12 sheets of drawings;
`Recordation Form Cover Sheet PTO 1595 with Executed
`
`Assignment and Recording Fee of $40.00;
`Return postcard; and
`7.
`Certificate of Mailing By Express Mail
`8.
`relating to the above application, were deposited as "Express Mail" Mailing Label
`No. EV331755319US, with the United States Postal Service, addressed to Commissioner
`for Patents, PO. Box 1450, Alexandria, VA 22313-
`
`
`
`William .Kubida,
`eg. No.29,664
`
`HOGAN & HARTSON 1.1.9
`One Tabor Center
`
`1200 17th Street, Suite 1500
`Denver, Colorado 80202
`
`(719) 448-5909 Tel
`(303) 899-7333 Fax
`
`\\\CS - 8040410033 ~ 68264 v1
`
`2
`
`
`
`FEE TRANSMITTAL
`
`Complete if Known
`
`
`
`
`
`
`for FY 2004
`e 10i01i2003. Patent fees are subiect to annual revision
`
`First Named Inventor
`
`Daniel Poznanovic et al.
`
`
`
`
`
`
`
`
`
`
`summertime —
`
`
`Group ’ A“ Unit
`i A-élicant claims small entity status. See 37 CFR 1.2?
`SRCUZB
`TOTAL AMOUNT OF PAYMENT
`($) 968.00
`Attorney Docket N01
`
`
`
`METHOD OF PAYMENT (check all that apply)
`FEE CALCULATION (continued)
`
`E check 1:] credit card El money order [I other [:1 none
`3. ADDITIONAL FEES
`
`CI Deposit Account
`
`
`
`
`ms:
`Fee Description
`
`Number
`
`
`
`
`
`Surcharge - late filing fee or oath
`Surcharge - late provisional filing fee or
`Deposit
`cover sheet
`Account
`
`Non—English specification
`Name
`
`
`
` The Director is authorized to: (check all that apply) For filing a request for ex parte
`
`D Charge feels) indicated below E Credit any overpayments
`reexamination .
`_
`_
`E Charge any additional tests} or any underpayment or iae(a)
`Requesting PU'OlIGBiID" 01 SIR poor to
`
`ha
`i
`'nttica
`below, a ce
`for the fillrl
`tee to the above-
`Examineraction
`D fideflfiesflissi Wtednt
`l pt
`9
`Requesting publication of SRI after
`
`Examiner action
`Extension tor reptywithin first month
`Extension for replyvvithin second month
`Extension for reply within third month
`Extension for reply within tourtl'l month
`Fee Description
`
` Fee Paid
`
`Extension for reply within fifth monli'l
`Utility Filing Fee
`770.00
`
`
`
`Notice of Appeal
`Design filing fee
`
`265
`Plant filing fee
`Filing a brlel in support oian appeal
`
`385
`Reissue filing fee
`Request for oral hearing
`
`Petition to institute a public use
`8D
`Provisional filing fee
`proceeding
`Petition to revive — unavoidable
`Petition to revive - unintentional
`
`
`
`
`
`
`
`‘Reduced by Basic Fling Fee Faid
`SUBTOTAL {3]
`[5} 40.00
`
`
`
`Registration No.
`
`
`
`(Attomoymganli
`
`SUBTOTAL [2)
`
`[5) 158.00
`
`SUBMITTED BY
`
`Name [Pianype
`
`
`
`
`a «table
`
`Com-l
`
`.r
`i[ .F
`on; =
`2
`' A
`swathVang!
`
`\\\CS . 8060410033 - 6320-: v1
`
`2. EXTRA CLAIM FEES FOR UTILITY AND REISSUE
`”1"“? its” "’9 i‘" missuei
`
`
`Fae [mm
`Fae Paid
`Design issue fee
`Extra Claims
`below
`Plant lSSUB tee
`?2.DD
`
`
`
`.3“:
`
`33.00
`
`Petitions [O the Commissioner
`
`II " “
`“W”
`X
`Iridapandanl
`II [I
`claims [I
`Processing tee under 3? CFR 1.1 rlq)
`_
`Multiple Dependent
`“or numberprevlouslypaid ifgreater. For Reissues. see below
`Submsslon of Info Dlsclosure Slmt
`Fee Descriptimt
`Recording each patent assignment per
`
`property {limes number or properties}
`Claims in excess oi 20
`Filing a submission alter final reiection (it?r
`
`CFR§1.129(al}
`Independent claims in excess ol'3
`For each additional invention to be
`
`examined (or are 51.12am);
`Multiple dependent claim‘ it not paid
`Request for Continued Examination
`
`43
`“Reissue independent claims over
`Request for expedited examination of a
`originat patent
`design application
`
`9
`“Reissue claims in excess at 20 and over
`Other lee (specify)
`originatpetent
`
`3
`
`
`
`PATENT APPLICATION
`ATTORNEY DOCKET No. SRCOZB
`ClienUMatter No. 804040033001
`
`Express Mail Label No. EV331755319US
`
`SYSTEM AND METHOD OF ENHANCING EFFICIENCY
`AND UTILIZATION OF MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`1.
`
`Related Applications.
`
`[0001]The present
`
`invention claims the benefit of US. Provisional Patent
`
`application Serial No. 60(479339 filed on June 18. 2003, which is incorporated
`
`herein by reference in its entirety.
`
`BACKGROUND OF THE INVENTION
`
`1.
`
`Field of the Invention.
`
`[0002] The present invention relates. in general, to enhancing the efficiency and
`
`utilization of memory bandwidth in reconfigurable hardware. More specifically.
`
`the
`
`invention
`
`relates
`
`to implementing
`
`explicit memory hierarchies
`
`in
`
`reconfigurable processors that make efficient use of off-board, on-board. on-
`
`chip storage and available algorithm locality. These explicit memory hierarchies
`
`avoid many of the tradeoffs and complexities found in the traditional memory
`
`hierarchies of microprocessors.
`
`2.
`
`Relevant Background.
`
`.[0003] Over
`
`the past 30 years. microprocessors have enjoyed annual
`
`performance gains averaging about 50% per year. Most of the gains can be
`
`attributed to higher processor clock speeds. more memory bandwidth and
`
`increasing utilization of instruction level parallelism (iLP) at execution time.
`
`[00041As microprocessors and other dense logic devices (DLDs) consume
`
`data at ever-increasing rates it becomes more of a challenge to design memory
`
`hierarchies that can keep up.
`
`Two measures of the gap between the
`
`microprocessor and memory hierarchy are bandwidth efficiency and bandwidth
`
`\\\CS - 3040410033 - 68254 v2
`
`1
`
`4
`
`
`
`utilization. Bandwidth efficiency refers to the ability to exploit available locality
`
`in a program or algorithm.
`
`In the ideal situation. when there is maximum
`
`bandwidth efficiency. all available locality is utilized. Bandwidth utilization refers
`
`to the amount of memory bandwidth that
`
`is utilized during a calculation.
`
`Maximum bandwidth utilization occurs when all available memory bandwidth is
`
`utilized.
`
`[0005] Potential performance gains from using a faster microprocessor can be
`
`reduced or even negated by a corresponding drop in bandwidth efficiency and
`
`bandwidth utilization. Thus.
`
`there has been significant effort spent on the
`
`development of memory hierarchies that can maintain high bandwidth efficiency
`
`and utilization with faster microprocessors.
`
`[0006] One approach to improving bandwidth efficiency and utilization in
`
`memory hierarchies has been to develop ever more powerful processor
`
`caches. These caches are high-speed memories (typically SRAM) in close
`
`proximity to the microprocessor that try to keep copies of instructions and data
`
`the microprocessor may soon need. The microprocessor can store and retrieve
`
`data from the cache at a much higher rate than from a slower. more distant
`
`main memory.
`
`[0007]|n designing cache memories, there are a number of considerations to
`
`take into account. One consideration is the width of the cache line. Caches
`
`are arranged in lines to help hide memory latency and exploit spatial locality.
`
`When a load suffers a cache miss. a new cache line is loaded from main
`
`memory into the cache. The assumption is that a program being executed by
`
`the microprocessor has a high degree of spatial locality, making it likely that
`
`other memory locations in the cache line will also be required.
`
`[0008] For programs with a high degree of spatial
`
`locality (e.g.. stride-one
`
`access). wide cache lines are more efficient since they reduce the number of
`
`times a processor has to suffer the latency of a memory access. However, for
`
`programs with lower levels of spatial locality, or random access. narrow lines
`
`““35 - 304ml! - 6825-1 v2
`
`-
`
`'2‘
`
`5
`
`
`
`are best as they reduce the wasted bandwidth from the unused neighbors in
`
`the cache line. Caches designed with wide cache lines perform well with
`
`programs that have a high degree of spatial locality, but generally have poor
`gatherl'scatter performance. Likewise, caches with short cache lines have good
`
`gatherfscatter performance, but loose efficiency executing programs with high
`
`spatial locality because of the additional runs to the main memory.
`
`[00091Another consideration in cache design is cache associativity, which
`
`refers to the mapping between locations in main memory and cache sectors.
`
`At one extreme of cache associativity is a direct-mapped cache, while at
`
`another extreme is a fully associative cache.
`
`In a direct mapped~cache, a
`
`specific memory location can be mapped to only a single cache line. Direct-
`
`mapped caches have the advantage of being fast and easy to construct in
`
`logic. The disadvantage is that they suffer the maximum number of cache
`
`conflicts. At the other extreme, a fully associative cache allows a specific
`
`location in memory to be mapped to any cache line. Fully associative caches
`
`tend to be slower and more complex due to the large amount of comparison
`
`logic they need, but suffer no cache conflict misses. Oftentimes, caches fall
`
`between the extremes of direct-mapped and fully associative caches. A design
`
`point between the extremes is a k—set aesociative cache, where each memory
`
`location can map to k cache sectors. These caches generally have less
`
`overhead than fully associative caches, and reduce cache conflicts by
`
`increasing the value of k.
`
`[00101Another consideration in cache design is how cache lines are replaced
`
`due to a capacity or conflict miss.
`
`In a direct-mapped cache, there is only one
`
`possible cache line that can be replaced due to a miss. However, in caches
`
`with higher ievels of associativity, cache lines can be replaced in more that one
`
`way. The way the cache lines are replaced is referred to as the replacement
`
`poiicy.
`
`“\CS - 304041003] - 68254 v2
`
`'3’
`
`6
`
`
`
`[0011] Options for the replacement policy include least recently used (LRU),
`
`random replacement. and first in—first out (FIFO). LRU is used in the majority
`
`of circumstances where the temporal locality set is smaller than the cache size.
`
`but
`
`it
`
`is normally more expensive to build in hardware than a random
`
`replacement cache. An LRU policy can also quickly degrade depending on the
`
`working set size. For example. consider an iterative application with a matrix
`
`size of N bytes running through a LRU cache of size M bytes.
`
`If N is less than
`
`M. then the policy has the desired behavior of 100% cache hits, however. if N is
`
`only slightly larger than M. the LRU policy results in 0% cache hits as lines are
`
`removed just as they are needed.
`
`[0012] Another consideration is deciding on a write policy for the cache. Write-
`
`through caches send data through the cache hierarchy to main memory. This
`
`policy reduces cache coherency issues for multiple processor systems and is
`
`best suited for data that will not be re-read by the processor in the immediate
`
`future.
`
`In contrast. write-back caches place a copy of the data in the cache. but
`
`does not immediately update main memory. This type of caching works best
`
`when a data just written to the cache is quickly requested again by the
`
`pro 08830 r.
`
`[0013] In addition to write-through and write-back caches. another kind of write
`
`policy is implemented in a write-allocate cache where a cache line is allocated
`
`on a write that misses in cache. Write-allocate caches improve performance
`
`when the microprocessor exhibits a lot of write followed by road behavior.
`
`However. when writes are not subsequently read, a write-allocate cache has a
`
`number of disadvantages: When a cache line is allocated.
`
`it is necessary to
`
`read the remaining values from main memory to complete the cache line. This
`
`adds unnecessary memory read traffic during store operations. Also, when the
`
`data is not read again. potentially useful data in the cache is displaced by the
`
`unused data.
`
`was . ammmn . 6825: v1
`
`.4-
`
`7
`
`
`
`[0014] Another consideration is made between the size and the speed of the
`
`cache:
`
`small caches are typically much faster than larger caches. but store
`
`less data and fewer instructions. Less data means a greater chance the cache
`
`will not have data the microprocessor is requesting (Le. a cache miss) which
`
`can slow everything down while the data is being retrieved from the main
`
`memory.
`
`[0015] Newer cache designs reduce the frequency of cache misses by trying to
`
`predict in advance the data that the microprocessor will mquest. An example of
`
`this type of cache is one that supports specutative execution and branch
`
`prediction. Speculative execution allows instructions that likely will be executed
`
`to start early based on branch prediction. Results are stored in a cache cailed
`
`a reorder buffer and retired if the branch was correctly predicted. Of course.
`
`when mis-predictions occur instruction and data bandwidth are wasted.
`
`[0016] There are additional considerations and tradeoffs in cache design, but it
`
`should be apparent from the considerations described hereinbefore that it
`
`is
`
`very difficult
`
`to design a single cache structure that is optimized for many
`
`different programs. This makes cache design particularly challenging for a
`
`multipurpose microprocessor that executes a wide variety of programs. Cache
`
`designers try to derive the program behavior of "average" program constructed
`
`from several actual programs that run on the microprocessor. The cache is
`
`optimized for the average program, but no actual program behaves exactiy like
`
`the average program. As a result, the designed cache ends up being sub-
`
`optimal for nearly every program actually executed by the microprocessor.
`
`Thus.
`
`there is a need for memory hierarchies that have data storage and
`
`retrieval characteristics that are optimized for actual programs executed by a
`
`processor.
`
`[0017] Designers trying to develop ever more efficient caches optimized for a
`
`variety of actual programs also face another problem: as caches add additional
`
`features, the overhead needed to implement the added features also grows.
`
`NCS - 804mm” - 68254 12
`
`'5'
`
`8
`
`
`
`Caches today have so much overhead that microprocessor performance may
`
`be reaching a point of diminishing returns as the overhead starts to cut into
`
`performance.
`
`In the Intel Pentium ill processor for exampie, more than half of
`
`the 10 million transistors are dedicated to instruction cache. branch prediction.
`
`out-of-order execution and supers‘calar logic. The situation has prompted
`
`predictions that as microprocessors grow to a billion transistors per chip.
`
`performance increases will drop to about 20% per year. Such a prediction. if
`
`borne out. could have a significant
`
`impact on technology growth and the
`
`computer business.
`
`[0018] Thus, there is a growing need to develop improved memory hierarchies
`
`that limit the overhead of a memory hierarchy without also reducing bandwidth
`
`efficiency and utilization.
`
`SUMMARY OF THE iNVENTION
`
`[0019] Accordingly, an embodiment of the invention inciudes a reconfigurable
`
`processor that includes a computational unit and a data access unit coupled to
`
`the computational unit. where the data access unit retrieves data from an on-
`
`processor memory and supplies the data to the computational unit. and where
`
`the computational unit and the data access unit are configured by a program.
`
`[0020] The present
`
`invention aiso involves a reconfigurable processor that
`
`includes a first memory of a first type and a data prefetch unit coupled to the
`
`memory. where the data prefetch unit retrieves data from a second memory of
`
`a second type different from the first type, and the first and second memory
`
`types and the data prefetch unit are configured by a program.
`
`[0021]Another embodiment of
`
`the
`
`invention includes a reconfigurable
`
`hardware system that includes a common memory, also referred to as external
`
`memory. and one or more reconfigurable processors coupled to the common
`
`memory, where at least one of the reconfigurable processors includes a data
`
`prefetch unit to read and write data between the unit and the common memory.
`
`“\CS - BMW“!!! - 6825-! v2
`
`'6‘
`
`9
`
`
`
`and where the data prefetch unit is configured by a program executed on the
`
`system.
`
`[0022] Another embodiment of the invention includes a method of transferring
`
`data that includes transferring data between a memory and a data prefetch unit
`
`in a reconfigurable processor. transferring data between the prefetch unit and a
`
`data access unit. and transferring the data between a computational unit and
`
`the data access unit, where the computational unit, data access unit and the
`
`data prefetch unit are configured by a program.
`
`[00231Additional embodiments of the invention are set forth in part
`
`in the
`
`description that follows, and in part will become apparent to those skilied in the
`
`art upon examination of the following specification. or may be learned by the
`
`practice of the invention. The advantages of the invention may be realized and
`
`attained by means of the instrumentalities, combinations, compositions, and
`
`methods particularly pointed out in the appended claims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0024] Figure 1
`
`shows a reconfigurable processor in which the present
`
`invention may be impiemented;
`
`[0025] Figure 2 shows computational
`
`logic as might be loaded into a
`
`reconfigurable processor:
`
`[0026] Figure 3 shows a reconfigurable processor as in Figure 1, but with the
`
`additipn of data access units;
`
`[0027] Figure 4 shows a reconfigurable processor as in Figure 3, but with the
`
`addition of data prefetch units;
`
`[0028] Figure 5 shows reconfigurable processor with the inclusion of external
`
`memory;
`
`\\\cs — Emma: - 68254 v2
`
`'7'
`
`10
`
`10
`
`
`
`[0029] Figure 6 shows reconfigurable processors with external memory and
`
`with an intelligent memory controlier;
`
`[0030] Figure 7 shows a reconfigurable processor having a combination of
`
`data prefetch units and data access units feeding computational logic;
`
`[0031] Figure 8 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controlier to
`
`perform strided memory references;
`
`[0032] Figure 9A and Figure QB show the bandwidth efficiency and utilization
`
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in X-Y plane;
`
`[0033} Figure 10A and Figure 10B show the bandwidth efficiency and utilization
`
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in X—Z plane;
`
`[0034} Figure 11A and Figure 113 show the bandwidth efficiency and utilization
`
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in Y-Z plane:
`
`[0035] Figure 12A and Figure 123 show the bandwidth efficiency and utilization
`
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in a mini-cube;
`
`[0036} Figure 13 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform indirect memory references;
`
`[0037} Figure 14 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform strided memory reference together with computation.
`
`mes - 9mm” . 6515-1 «2
`
`-8-
`
`11
`
`11
`
`
`
`DETAILED DESCRIPTlON
`
`1.
`
`Definitions:
`
`[0038]Direct execution logic (DEL)
`
`-
`
`is an assemblage of dynamically
`
`reconfigurable functional elements that enables a program to establish an
`
`optimized interconnection among selected functional units
`
`in order
`
`to
`
`implement a desired computational, data prefetch andlor data access
`
`functionality for maximizing the parallelism inherent in the particular code.
`
`[0039] Reconfigurable Processor — is a computing device that contains
`
`reconfigurable components such as FPGAs and can. through reconfiguration.
`
`instantiate an algorithm as hardware.
`
`[0040] Reconfigurable Logic - is composed of an interconnection of
`
`_
`
`functional units. control, and storage that implements an algorithm and can be
`
`loaded into a Reconfigurable Processor.
`
`[0041]Functional Unit — is a set of logic that performs a specific operation.
`
`The operation may for example be arithmetic.
`
`logical, control. or data
`
`movement. Functional units are used as building blocks of reconfigurable logic.
`
`[0042] Macro — is another name for a functional unit.
`
`[0043] Memory Hierarchy — is a collection of memories
`
`[0044] Data prefetch Unit — is a functional unit that moves data between
`
`members of a memory hierarchy. The movement may be as simple as a copy.
`
`or as complex as an indirect indexed strided copy into a unit stride memory.
`
`[0045] Data access Unit — is a functional unit that accesses a component of a
`
`memory hierarchy, and delivers data directly to computational logic.
`
`\\\C5 v 80mm: - 65254 v2
`
`-9-
`
`12
`
`12
`
`
`
`[0046] Intelligent Memory Control Unit- is a control unit that has the ability to
`
`select data from its storage according to a variety of algorithms that can be
`
`selected by a data requester, such as a data prefetch unit.
`
`[0047] Bandwidth Efficiency - is defined as the percentage of contributory
`
`data transferred between two points. Contributory data is data that actually
`
`participates in the recipients processing.
`
`[0048] Bandwidth Utilization -— is defined as the percentage of maximum
`
`bandwidth between two points that is actually used to pass contributory data.
`
`2.
`
`Description
`
`[0049]A reconfigurable processor (RP) 100 implements direct executable logic
`
`(DEL) to perform computation, as well a memory hierarchy for maintaining input
`
`data and computational
`
`results.
`
`DEL is an assemblage of dynamically
`
`reconfigurable functional elements that enables a program to establish an
`
`optimized interconnection among selected functional units
`
`in order
`
`to
`
`implement a desired computational, data prefetch andior data access
`
`functionality for maximizing the parallelism inherent in the particular code. The
`
`term DEL may also be used to refer to the set of constructs such as code,
`
`data, configuration variables. and the like that can be loaded into RP 100 to
`
`cause RP 100 to implement a particular assemblage of functional elements.
`
`[0050] Figure 1 presents an RP 100. which may be implemented using field
`
`programmable gate arrays (FPGAs) or other reconfigurable logic devices, that
`
`can be
`
`configured and reconfigured to contain functional units
`
`and
`
`interconnecting circuits. and a memory hierarchy comprising on—board memory
`
`banks 104, on-chip block RAM 106. registers wires. and a connection 108 to
`
`external memory. On—chip reconfigurable components 102 create memory
`
`structures such as registers, FIFOs. wires and arrays using block RAM. Dual—
`
`ported memory 106 is shared between on-chip reconfigurable components 102.
`
`The reconfigurable processor 100 also implements user—defined computational
`
`NCS - 804041003} - 53254 v1
`
`'1 0'
`
`13
`
`13
`
`
`
`logic (e.g.. such as DEL 200 shown in Figure 2) constructed by programming
`
`an FPGA to implement a particuiar interconnection of computational functional
`
`units.
`
`In a particular implementation. a number of RPs 100 are implemented
`
`within a memory subsystem of a conventional computer, such as on devices
`
`that are physically installed in dual inline memory module (DIMM) sockets of a
`
`computer.
`In this manner the RPs 100 can be accessed by memory operations
`and so coexist well with a more conventional hardware platform.- It should be
`
`noted that. although the exemplary implementation of the present invention
`
`illustrated
`
`includes
`
`six banks
`
`of dual ported memory 104 and two
`
`reconfigurable components 102, any number of memory banks andlor
`
`reconfigurable components may be used depending upon the particular
`
`implementation or application.
`
`[0051]Any computer program.
`
`including
`
`complex graphics processing
`
`programs. word processing programs, database programs and the like,
`
`is a
`
`collection of algorithms that interact to implement desired functionality.
`
`In the
`
`common case in which static computing hardware resources are used (e.g., a
`
`conventional microprocessor), the computer program is compiled into a set of
`
`executable code (i.e., object code) units that are linked together to implement
`
`the computer program on the particular hardware resources. The executable
`
`code is generated specifically for a particular hardware platform.
`
`in this
`
`manner. the computer program is adapted to conform to the limitations of the
`
`static hardware platform. However,
`
`the compilation process makes many
`
`compromises based on the limitations of the static hardware platform.
`
`[0052]Alternatively, an algorithm can be defined in a high level language then
`
`compiled into DEL. DEL can be produced via a compiler from high level
`
`programming languages such as C or FORTRAN or may be designed using a
`
`hardware definition language such as Verilog. VHDL or a schematic capture
`
`tool. Computation is performed by reconfiguring a reconfigurable processor
`
`with the DEL and flowing data through the computation.
`
`in this manner. the
`
`\\\CS - summon - 68254 v2
`
`'1 1'
`
`14
`
`14
`
`
`
`hardware resources are essentially adapted to conform to the program rather
`
`than the program being adapted to conform to the hardware resources.
`
`[0053] For purposes of this description a single reconfigurable processor will be
`
`presented first. A sample of computational logic 201 is shown in Figure 2. This
`
`simple assemblage of functional units performs computation of two results
`
`("A+B“ and "A+B-(B*C)) from three input variables or operands "A", "B" and
`
`"C".
`
`In practice. computational units 201 can be implemented to perform very
`
`simple or arbitrarily complex computations. The input variables (operands) and
`output or result variables may be of any size necessary for a particular
`
`application. Theoretically. any number of operands and result variables may be
`
`usedlgenerated by a particular DEL. Great complexity of computation can be
`
`supported by adding additional reconfigurable chips and processors.
`
`[0054] For greatest performance the DEL 200 is constructed as parallel
`
`pipelined logic blocks composed of computational functional units capable of
`
`taking data and producing results with each clock pulse. The highest possible
`
`performance that can be achieved is computation of a set of results with each
`
`clock pulse. To achieve this. data should be available at the same rate the
`computation can consume the data. The rate at which data can be supplied to
`
`DEL 200 is determined, at least in significant part, by the memory bandwidth
`
`utilization and efficiency. Maximal computational performance can be achieved
`
`with parallel and pipelined DEL together with maximizing the memory
`
`bandwidth utilization and efficiency.
`
`Unlike conventional static hardware
`
`platforms, however,
`
`the memory hierarchy provided in
`
`a RP 100 is
`
`reconfigurable.
`
`In accordance with the present invention, through the use of
`
`data
`
`access
`
`units
`
`and
`
`associated memory
`
`hierarchy
`
`components.
`
`computational demands and memory bandwidth can be matched.
`
`[0055] High memory bandwidth efficiency is achieved when only data required
`
`for computation is moved within the memory hierarchy. Figure 3 shows a
`
`simple logic block 300 comprising computational functional units 301, control
`
`““25 - BMMMOJJ - 68254 V2
`
`'1 2'
`
`15
`
`15
`
`
`
`(not shown), and data access functional units 303. The data access unit 303
`
`presents data directly to the computational logic 301.
`
`In this manner. data is
`
`moved from a memory device 305 to the computational logic and from the
`
`computational logic back into a memory device 305 or block RAM memory 307
`
`within an RP 100.
`
`[0056] Figure 4 illustrates the logic block 300 with an addition of a data prefetch
`
`unit 401. The data prefetch unit 401 moves data from one member of the
`
`memory hierarchy 305 to another 308. Data prefetch unit 401 operates
`
`independently of other functional units 301, 302 and 303 and can therefore
`
`operate prior
`
`to.
`
`in parallel with. or after computational
`
`logic.
`
`This
`
`independence of operation permits hiding the latency associated with obtaining
`
`data for use in computation. The data prefetch unit deposits data into the
`
`memory hierarchy within RP 100, where computational logic 301. 302 and 303 .
`
`can access it through data access units.
`
`In the example of Figure 4, prefetch
`
`unit 401 is configured to deposit data into block RAM memory 308. Hence, the
`
`prefetch units 401 may be operated independently of togic block 300 that uses
`
`prefetched data.
`
`[0057] An important feature of the present invention is that many types of data
`
`prefetch units can be defined so that the prefetch hardware can be configured
`
`to conform to the needs of the algorithms currently implemented by the
`
`computational
`
`logic.
`
`The specific characteristics of