`
`3.
`
`2. ©)
`
`EK)
`
`REECEECR CLEMENTS
`& Fee Transmittal Form
`(submit an original and a duplicate for fee processing)
`Applicant claims small entity status.
`See 37 CFR 1.27
` totalpages_26 J
`Specification[
`(preferred Arrangementsetforth below)
`- Descriptivetitle of the Invention
`- Cross References to Related Applications
`- Statement Regarding Fed sponsored R&D
`- Reference to sequence listing, a table, ora
`computer programlisting appendix
`- Background of the Invention
`- Brief Summary of the Invention
`
`
`
`
`
`
`
`Statements verifying identity of above copies
`[]
`
`
`ACCOMPANYING APPLICATION PARTS
`
`
`- Brief Description of the Drawings
`[xX] Assignment Papers (coversheet/document(s))
`9.
`
`
`Power of
`10.0)
`37 CFR. 3.73(b) Statement
`—<]
`- Detailed Description
`
`
`(when there is an assignee)
`Attorney
`- Claim(s)
`
`
`- Abstract of the Disclosure 11.(]—English Translation Document
`
`
`4.
`[J Drawing(s)[ total sheets 12 ]
`12.[]
`‘IDS & Form PTO/SB/08A
`(7) Copies of IDS
`
`
`
`Citations
`13. oO
`Preliminary Amendment
`
`5. & Oathor Declaration[
`totalpages_3_ ]
`
`
`a. J Newly executed (original or copy)
`
`
`
`14. [x] Return Receipt Postcard (MPEP 503)
`b. [J Copy from prior appl. (37 C.F.R. § 1.63(d))
`
`15. L] Certified Copy of Priority Document(s)
`
`
`(for continuation/divisional with Box 18 completed)
`16. [1] Nonpublication Request Under 35 USC
`[JDELETION OF INVENTOR(S)
`
`
`Signed statement attached deleting
`122(b)(2)(B)(i). Applicant must attach form PTO/SB/35
`:
`aes
`en
`
`
`inventor(s} namedin prior application,
`17. &X Other: Certificate of Mailing by Express Mail
`see 37 C.F.R. §§ 1.63(d)(2
`
`
`
`18. Ifa CONTINUING APPLICATION, check appropriate box, and supply the requisite information below and ina
`preliminary amendment, or in an Application Data Sheet under 37 CFR 1.76:
`
`vO9L90IQ
`PTO/SB/05 (04-04)
`Approved for use through 07/31/2006
`
`
`
`Attorney Docket No.
`SRC028
`UTILITY
`
`PATENT APPLICATION
`Daniel Poznanovic etal.
`
`
`TRANSMITTAL
`SYSTEM AND METHOD OF ENHANCING
`
`(Only for new nonprovisional applications under
`EFFICIENCY AND UTILIZATION OF
`061604ae
`37 CFR 1.53(b))
`MEMORY BANDWIDTHIN
`
`RECONFIGURABLE HARDWARE
`
`
`Express Mail Label No.|EV331755319US 3S
`
`Commissionerfor Patents
`S
`
`
`P.O. Box 1450
`©
`Alexandria, VA 22313-1450
`om
`
`
`6. (J Application Data Sheet. (See 37 CFR 1.76
`
`
`
`
`
`
`
`
`
`
`First Inventor
`
`
`
`
`
`
`7. (J CD-ROMor CD-Rin duplicate, large table
`or Computer Program (Appendix)
`8. Nucleotide and/or Amino Acid Sequence Submission
`(if applicable, all necessary)
`a.
`[1
`Computer Readable Form
`b.
`C1
`Specification Sequence Listing on:
`i. (J CD-ROM or CD-R(2 copies); or
`ii. paper
`
`Cc.
`
`i.
`
`[] Continuation [] Divisional [ Continuation-in-part (CIP)
`
`of prior application No.: __/
`
`
`
`
`
`Group/Art Unit:
`Examiner:
`Prior application information:
`FOR CONTINUATION OR DIVISIONAL APPS only: The entire disclosure of the prior application, from which an oath or declaration is supplied under Box 5b, is considered a part of
`
`
`the disclosure of the accompanying continuation or divisional application andis hereby incorporated by reference. The incorporation can only be relied upon when a portion has been -
`
`
`inadvertently omitted form the submitted applicalion parts.
` 19. CORRESPONDENCE ADDRESS
`
`
`Customer Number 25235
`or [) Correspondence address below
`
`
`
`
`CS
`aa
`
`Name(Print/Type)|RegistrationNo.=No.|RegistrationNo.==|29,6640664
`
`oomITae
`a4 cy 4
`
`
`
`AACS - 80404/0039 - 68264 v1
`
`1
`
`XILINX 1002
`
`XILINX 1002
`
`1
`
`
`
`EXPRESS MAIL NO. EV331755319US
`Attorney Docket No. SRC028
`Client/Matter No. 80404.0033.001
`
`IN THE UNITED STATES PATENT AND TRADEMARK OFFICE
`
`In re Applicationof:
`
`Daniel Poznanovic, David E. Caliga,
`and Jeffrey Hammes
`
`Serial No. NEW
`
` Filed: Herewith
`
`For: SYSTEM AND METHOD OF
`ENHANCING EFFICIENCY AND
`UTILIZATION OF MEMORY
`BANDWIDTH IN RECONFIGURABLE
`HARDWARE
`
`CERTIFICATE OF MAILING BY EXPRESS MAIL
`
`Commissioner for Patents
`P.O. Box 1450
`Alexandria, VA 22313-1450
`
`The undersigned herebycertifies that the following documents:
`Utility Patent Application Transmittal;
`Fee Transmittal and $928 filing fee;
`Utility Patent Application- 22 pgs. Spec, 3 pgs. Claims, 1 pg. Abstract;
`Executed Declaration for Utility Patent Application;
`12 sheets of drawings;
`Recordation Form Cover Sheet PTO 1595 with Executed
`Assignment and Recording Fee of $40.00;
`Return postcard; and
`a.
`Certificate of Mailing By Express Mail
`8.
`relating to the above application, were deposited as "Express Mail", Mailing Label
`No. EV331755319US, with the United States Postal Service, addressed to Commissioner
`for Patents, P.O. Box 1450, Alexandria, VA 22313-1450.
` <
`
`ee
`WOK
`ML
`
`ZYLPSE
`6
`
`Wilham J. Kubida, Reg. No. 29,664
`
`
`
`1
`2
`3.
`4,
`5
`6
`
`Da
`
`HOGAN & HARTSON up
`One Tabor Center
`1200 17th Street, Suite 1500
`Denver, Colorado 80202
`(719) 448-5909 Tel
`(803) 899-7333 Fax
`
`N\AAGS- 80404/0033 - 68264 v1
`
`2
`
`
`
`for FY 2004
`
`
`
`
`
`
` Ld Apglicant claims small entity status. See 37 CFR 1.27
`TOTAL AMOUNT OF PAYMENT
`($) 968.00
`METHOD OF PAYMENT(checkall that apply)
`fH check CF) credit card (] money order (1) other (J none
`C1 Deposit Account
`ccount
`Number
`
`
`
`Application Number
`Filing Date
`First Named Inventor
`
`Group / Art Unit
`
`Complete if Known
`
`
`
`
`FEE CALCULATION (continued)
`3. ADDITIONAL FEES
`
`Fee Description
`
`Fee Paid
`
`Surcharge — late filing fee or oath
`Surcharge — late provisionalfiling fee or
`cover sheet
`Non-English specification
`Forfiling a request for ex parte
`reexamination
`Requesting publication of SIR prior to
`Examineraction
`Requesting publication of SRI after
`Examiner action
`
`FEE TRANSMITTAL
`
`e 10/01/2003. Patent fees are subject to annual revision
`
`aA
`
`Deposit
`Account
`Name
`
`Hogan & HartsonL.L.P.
`
`The Director is authorized to: (check all that apply)
`(1 Charge fee(s) indicated below Bd Credit any overpayments
`EX] Charge any additionalfee(s) or any underpaymentoffee(s)
`(I Chargefee(s) indicated below, except forthefiling fee to the above-
`identified deposit account
`
`
`
`Extension for reply within first month
`Extension for reply within second month
`Extensionfor reply within third month
`4, BASIC FILING FEE
`Extension for reply within fourth month
`Large
`Small
`Fee Description
`Fee Paid
`Entity Fee|Entity Fee
`$
`$
`
`FEE CALCULATION
`
`770.00
`
`Utility Filing Fee
`Designfiling fee
`
`Plantfiling fee
`
`Reissue filing fee
`
`Extension for reply within fifth month
`Notice of Appeal
`
`Filing a brief in support of an appeal
`Request for oral hearing
`
`Provisionalfiling fee
`
`Petition to institute a public use
`proceeding
`Petition to revive — unavoidable
`Petition to revive — unintentional
`SUBTOTAL (1)|($) 770.00
`Utility issue fee (or reissue)
`2. EXTRA CLAIM FEES FOR UTILITY AND REISSUE
`Fee from
`Foo Paid
`Design issue fee
`Extra Claims.
`Plantissue fee
`ie
`-
`
`Total Claims
`Independent
`Claims ,
`Multiple Dependent
`
`Petitions to the Commissioner
`
`Fee($)
`
`Fee Description
`
`Claimsin excess of 20
`
`Independentclaimsin excess of 3
`
`Multiple dependentclaim,if not paid
`**Reissue independentclaims over
`original patent
`“Reissue claims In excess of 20 and over
`original patent
`
`Other fee (specify)
`
`Processing fee under 37 CFR 1.17(q)
`Submission ofInfo Disclosure Stmt
`Recording each patent assignment per
`property (times number of properties)
`Filing a submission afterfinal rejection (37
`CFR § 1.129(a))
`For each additional invention to be
`examined (37 CFR §1.129(b))
`Request for Continued Examination
`Request for expedited examination of a
`design application
`
`SUBTOTAL (2
`
`“Reduced by Basic FlingFeePaid
`
`SUBTOTAL(3)
`
`
`SUBMITTED BY Completelifapplicable
`Registration No.
`Telephone
`Name (Print/Type
`(Attormey/Agent}
`ifioShaludibaefc\
`ill
`Se
`
`\W\CS - 80404/0033 - 68264 v1
`
`3
`
`
`
`PATENT APPLICATION
`ATTORNEY DOCKETNo. SRC028
`Client/Matter No. 80404.0033.001
`Express Mail Label No. EV331755319US
`
`SYSTEM AND METHOD OF ENHANCING EFFICIENCY
`AND UTILIZATION OF MEMORY BANDWIDTH IN
`RECONFIGURABLE HARDWARE
`
`1.
`
`Related Applications.
`
`[0001] The present
`
`invention claims the benefit of U.S. Provisional Patent
`
`application Serial No. 60/479,339 filed on June 18, 2003, whichis incorporated
`
`herein by referencein its entirety.
`
`BACKGROUNDOF THE INVENTION
`
`1.
`
`Field of the Invention.
`
`[0002] The present invention relates, in general, to enhancing the efficiency and
`
`utilization of memory bandwidth in reconfigurable hardware. More specifically,
`
`the
`
`invention
`
`relates
`
`to implementing
`
`explicit memory hierarchies
`
`in
`
`reconfigurable processors that make efficient use of off-board, on-board, on-
`
`chip storage and available algorithm locality. These explicit memory hierarchies
`
`avoid many of the tradeoffs and complexities found in the traditional memory
`
`hierarchies of microprocessors.
`
`2.
`
`Relevant Background.
`
`. [0003] Over
`
`the past 30 years, microprocessors have enjoyed annual
`
`performance gains averaging about 50% per year. Most of the gains can be
`
`attributed to higher processor clock speeds, more memory bandwidth and
`
`increasing utilization of instruction level parallelism (ILP) at execution time.
`
`[0004]As microprocessors and other dense logic devices (DLDs) consume
`
`data at ever-increasing rates it becomes more of a challenge to design memory
`
`hierarchies that can keep up.
`
`Two measures of the gap between the
`
`microprocessor and memory hierarchy are bandwidth efficiency and bandwidth
`
`\WCS- 80404/0033 - 68254 v2
`
`1
`
`4
`
`
`
`utilization. Bandwidth efficiency refers to the ability to exploit available locality
`
`in a program or algorithm.
`
`In the ideal situation, when there is maximum
`
`bandwidth efficiency, all available locality is utilized. Bandwidth utilization refers
`
`to the amount of memory bandwidth that
`
`is utilized during a calculation.
`
`Maximum bandwidth utilization occurs when all available memory bandwidth is
`
`utilized.
`
`[0005] Potential performance gains from using a faster microprocessor can be
`
`reduced or even negated by a corresponding drop in bandwidth efficiency and
`
`bandwidth utilization. Thus,
`
`there has been significant effort spent on the
`
`development of memory hierarchies that can maintain high bandwidthefficiency
`
`andutilization with faster microprocessors.
`
`[0006]One approach to improving bandwidth efficiency and utilization in
`
`memory hierarchies has been to develop ever more powerful processor
`
`caches. These caches are high-speed memories (typically SRAM) in close
`
`proximity to the microprocessorthat try to keep copies ofinstructions and data
`
`the microprocessor may soon need. The microprocessorcan store and retrieve
`
`data from the cache at a much higher rate than from a slower, more distant
`
`main memory.
`
`[0007]In designing cache memories, there are a numberof considerations to
`
`take into account. One consideration is the width of the cache line. Caches
`
`are arrangedin lines to help hide memory latency and exploit spatial locality.
`
`Whena load suffers a cache miss, a new cacheline is loaded from main
`
`memory into the cache. The assumption is that a program being executed by
`
`the microprocessor has a high degree of spatial locality, making it likely that
`
`other memory locations in the cacheline will also be required.
`
`[0008] For programs with a high degree of spatial
`
`locality (e.g., stride-one
`
`access), wide cachelines are moreefficient since they reduce the numberof
`
`times a processor has to suffer the latency of a memory access. However, for
`
`programs with lower levels of spatial locality, or random access, narrow lines
`
`WWCS - 80404/0033 - 68254 v2
`
`:
`
`-2-
`
`5
`
`
`
`are best as they reduce the wasted bandwidth from the unused neighbors in
`
`the cache line. Caches designed with wide cache lines perform well with
`
`programs that have a high degree of spatial locality, but generally have poor
`gather/scatter performance. Likewise, caches with short cache lines have good
`gather/scatter performance, but loose efficiency executing programs with high
`
`spatial locality because of the additional runs to the main memory.
`
`[0009] Another consideration in cache design is cache associativity, which
`
`refers to the mapping between locations in main memory and cache sectors.
`
`At one extreme of cache associativity is a direct-mapped cache, while at
`another extreme is a fully associative cache.
`In a direct mapped-cache, a
`specific memory location can be mappedto only a single cache line. Direct-
`
`mapped caches have the advantage of being fast and easy to construct in
`
`logic. The disadvantage is that they suffer the maximum number of cache
`
`conflicts. At the other extreme, a fully associative cache allows a specific
`
`location in memory to be mappedto any cacheline. Fully associative caches
`
`tend to be slower and more complex due to the large amount of comparison
`
`logic they need, but suffer no cache conflict misses. Oftentimes, caches fall
`
`between the extremes of direct-mapped and fully associative caches. A design
`
`point between the extremesis a k-set associative cache, where each memory
`
`location can map to k cache sectors. These caches generally have less
`
`overhead than fully associative caches, and reduce cache conflicts by
`
`increasing the value of k.
`
`[0010] Another consideration in cache design is how cachelines are replaced
`
`due to a capacity or conflict miss.
`
`In a direct-mapped cache, there is only one
`
`possible cache line that can be replaced due to a miss. However, in caches
`
`with higher levels of associativity, cache lines can be replaced in more that one
`
`way. The way the cachelines are replaced is referred to as the replacement
`
`policy.
`
`WWCS - 80404/0033 - 68254 v2
`
`-3-
`
`6
`
`
`
`[0011] Options for the replacement policy include least recently used (LRU),
`
`random replacement, andfirst in—first out (FIFO). LRU is used in the majority
`
`of circumstances where the temporallocality set is smaller than the cachesize,
`
`but
`
`it
`
`is normally more expensive to build in hardware than a random
`
`replacement cache. An LRU policy can also quickly degrade depending on the
`
`working set size. For example, consider an iterative application with a matrix
`
`size of N bytes running through a LRU cacheof size M bytes.
`
`If N is less than
`
`M, then the policy has the desired behavior of 100% cachehits, however,if N is
`
`only slightly larger than M, the LRUpolicy results in 0% cachehits as lines are
`
`removedjust as they are needed.
`
`[0012] Another consideration is deciding on a write policy for the cache. Write-
`
`through caches send data through the cache hierarchy to main memory. This
`
`policy reduces cache coherency issues for multiple processor systems and is
`
`best suited for data that will not be re-read by the processor in the immediate
`
`future.
`
`In contrast, write-back caches place a copyofthe data in the cache, but
`
`does not immediately update main memory. This type of caching works best
`
`when a data just written to the cache is quickly requested again by the
`
`processor.
`
`[0013] In addition to write-through and write-back caches, another kind of write
`policy is implemented in a write-allocate cache where a cacheline is allocated
`
`on a write that misses in cache. Write-allocate caches improve performance
`
`when the microprocessor exhibits a lot of write followed by read behavior.
`
`However, when writes are not subsequently read, a write-allocate cache has a
`
`number of disadvantages: When a cacheline is allocated,
`
`it is necessary to
`
`read the remaining values from main memory to complete the cache line. This
`
`adds unnecessary memory readtraffic during store operations. Also, when the
`
`data is not read again, potentially useful data in the cache is displaced by the
`
`unused data.
`
`ACS- 0404/0033 - 68254 v2
`
`-4-
`
`7
`
`
`
`[0014] Another consideration is made between the size and the speed of the
`
`cache:
`
`small caches are typically much faster than larger caches, but store
`
`less data and fewerinstructions. Less data means a greater chance the cache
`
`will not have data the microprocessor is requesting (i.e., a cache miss) which
`
`can slow everything down while the data is being retrieved from the main
`
`memory.
`
`[0015] Newer cache designs reduce the frequency of cache missesbytrying to
`
`predict in advance the data that the microprocessorwill request. An example of
`
`this type of cache is one that supports speculative execution and branch
`
`prediction. Speculative execution allows instructions thatlikely will be executed
`
`to start early based on branch prediction. Results are stored in a cache called
`
`a reorder buffer and retired if the branch was correctly predicted. Of course,
`
`when mis-predictions occur instruction and data bandwidth are wasted.
`
`[0016] There are additional considerations and tradeoffs in cache design, butit
`
`should be apparent from the considerations described hereinbefore thatit is
`
`very difficult
`
`to design a single cache structure that is optimized for many
`
`different programs. This makes cache design particularly challenging for a
`
`multipurpose microprocessor that executes a wide variety of programs. Cache
`
`designerstry to derive the program behavior of “average” program constructed
`
`from several actual programs that run on the microprocessor. The cache is
`
`optimized for the average program, but no actual program behaves exactly like
`
`the average program. As a result, the designed cache ends up being sub-
`
`optimal for nearly every program actually executed by the microprocessor.
`
`Thus,
`
`there is a need for memory hierarchies that have data storage and
`
`retrieval characteristics that are optimized for actual programs executed by a
`
`processor.
`
`[0017] Designers trying to develop ever more efficient caches optimized for a
`
`variety of actual programs also face another problem: as caches add additional
`
`features, the overhead needed to implement the added features also grows.
`
`WCS - SO404/0033 - 68254 v2
`
`-5-
`
`8
`
`
`
`Caches today have so much overhead that microprocessor performance may
`
`be reaching a point of diminishing returns as the overhead starts to cut into
`
`performance.
`
`In the Intel Pentium Ill processor for example, more than half of
`
`the 10 million transistors are dedicated to instruction cache, branch prediction,
`
`out-of-order execution and superscalar logic. The situation has prompted
`
`predictions that as microprocessors grow to a billion transistors per chip,
`
`performanceincreases will drop to about 20% per year. Such a prediction, if
`
`borne out, could have a significant
`
`impact on technology growth and the
`
`computerbusiness.
`
`[0018] Thus, there is a growing need to develop improved memory hierarchies
`
`that limit the overhead of a memory hierarchy without also reducing bandwidth
`
`efficiency and utilization.
`
`SUMMARYOF THE INVENTION
`
`[0019] Accordingly, an embodiment of the invention includes a reconfigurable
`
`processorthat includes a computational unit and a data access unit coupled to
`the computational unit, where the data access unit retrieves data from an on-
`processor memory and supplies the data to the computational unit, and where
`
`the computational unit and the data access unit are configured by a program.
`
`[0020] The present
`
`invention also involves a reconfigurable processor that
`
`includes a first memory of a first type and a data prefetch unit coupled to the
`
`memory, where the data prefetch unit retrieves data from a second memory of
`
`a second type different from the first type, and the first and second memory
`
`types and the data prefetch unit are configured by a program.
`
`[0021]Another embodiment of
`
`the
`
`invention includes
`
`a_
`
`reconfigurable
`
`hardware system that includes a common memory, also referred to as external
`
`memory, and one or more reconfigurable processors coupled to the common
`
`memory, where at least one of the reconfigurable processors includes a data
`
`prefetch unit to read and write data between the unit and the common memory,
`
`WCS- 80404/0033 - 68254 v2
`
`-6-
`
`9
`
`
`
`and where the data prefetch unit is configured by a program executed on the
`
`system.
`
`[0022] Another embodimentof the invention includes a method of transferring
`
`data that includes transferring data between a memory and a data prefetch unit
`
`in a reconfigurable processor, transferring data between the prefetch unit and a
`
`data access unit, and transferring the data between a computational unit and
`
`the data access unit, where the computational unit, data access unit and the
`
`data prefetch unit are configured by a program.
`
`[0023] Additional embodiments of the invention are set forth in part
`
`in the
`
`description that follows, and in part will become apparent to thoseskilled in the
`
`art upon examination of the following specification, or may be learned by the
`
`practice of the invention. The advantages of the invention may be realized and
`
`attained by means of the instrumentalities, combinations, compositions, and
`
`methodsparticularly pointed out in the appendedclaims.
`
`BRIEF DESCRIPTION OF THE DRAWINGS
`
`[0024] Figure 1
`
`shows a reconfigurable processor in which the present
`
`invention may be implemented;
`
`[0025] Figure 2 shows computational
`
`logic as might be loaded into a
`
`reconfigurable processor;
`
`[0026] Figure 3 shows a reconfigurable processor as in Figure 1, but with the
`
`addition of data accessunits;
`
`[0027] Figure 4 shows a reconfigurable processor as in Figure 3, but with the
`
`addition of data prefetch units;
`
`[0028] Figure 5 shows reconfigurable processor with the inclusion of external
`
`memory;
`
`WCS- 80404/0033 - 68254 v2
`
`-f-
`
`10
`
`10
`
`
`
`[0029] Figure 6 shows reconfigurable processors with external memory and
`
`with an intelligent memory controller;
`
`[0030] Figure 7 shows a reconfigurable processor having a combination of
`
`data prefetch units and data access units feeding computational logic;
`
`[0031] Figure 8 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform strided memory references;
`
`[0032] Figure 9A and Figure 9B show the bandwidth efficiency and utilization
`
`gains obtained when utilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in X-Y plane;
`
`[0033] Figure 10A and Figure 10B show the bandwidth efficiency and utilization
`
`gains obtained whenutilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory references in X-Z plane;
`
`[0034] Figure 11A and Figure 11B show the bandwidth efficiency and utilization
`
`gains obtained whenutilizing a data prefetch unit and an intelligent memory
`
`controller to perform subset memory referencesin Y-Z plane;
`
`[0035] Figure 12A and Figure 12B show the bandwidth efficiency and utilization
`gains obtained whenutilizing a data prefetch unit and an intelligent memory
`controller to perform subset memory references in a mini-cube;
`
`[0036] Figure 13 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform indirect memory references;
`
`[0037] Figure 14 shows the bandwidth efficiency and utilization gains obtained
`
`when utilizing a data prefetch unit and an intelligent memory controller to
`
`perform strided memory reference together with computation.
`
`WES - 80404/0033 - 68254 v2
`
`-8-
`
`11
`
`11
`
`
`
`DETAILED DESCRIPTION
`
`1.
`
`Definitions:
`
`[0038] Direct execution logic (DEL)
`
`-
`
`is an assemblage of dynamically
`
`reconfigurable functional elements that enables a program to establish an
`
`optimized interconnection among selected functional units
`
`in order
`
`to
`
`implement a desired computational, data prefetch and/or data access
`
`functionality for maximizing the parallelism inherent in the particular code.
`
`[0039] Reconfigurable Processor — is a computing device that contains
`
`reconfigurable components such as FPGAs and can, through reconfiguration,
`
`instantiate an algorithm as hardware.
`
`_
`
`[0040] Reconfigurable Logic -—
`is composed of an interconnection of
`functional units, control, and storage that implements an algorithm and can be
`loaded into a Reconfigurable Processor.
`
`[0041} Functional Unit — is a set of logic that performs a specific operation.
`
`The operation may for example be arithmetic,
`
`logical, control, or data
`
`movement. Functional units are used as building blocks of reconfigurable logic.
`
`[0042] Macro—is another namefora functionalunit.
`
`[0043] Memory Hierarchy — is a collection of memories
`
`[0044] Data prefetch Unit — is a functional unit that moves data between
`
`members of a memory hierarchy. The movement may be as simple as a copy,
`
`or as complex as anindirect indexed strided copy into a unit stride memory.
`
`[0045] Data access Unit — is a functional unit that accesses a componentof a
`
`memory hierarchy, and delivers data directly to computationallogic.
`
`WiCS - BO404/0033 - 68254 v2
`
`aQ.
`
`12
`
`12
`
`
`
`[0046] Intelligent Memory Control Unit — is a control unit that has the ability to
`
`select data from its storage according to a variety of algorithms that can be
`
`selected by a data requestor, such as a data prefetch unit.
`
`[0047] Bandwidth Efficiency — is defined as the percentage of contributory
`
`data transferred between two points. Contributory data is data that actually
`
`participates in the recipients processing.
`
`[0048] Bandwidth Utilization — is defined as the percentage of maximum
`bandwidth betweentwopoints that is actually used to pass contributory data.
`
`2.
`
`Description
`
`[0049]A reconfigurable processor (RP) 100 implements direct executable logic
`
`(DEL) to perform computation, as well a memory hierarchy for maintaining input
`
`data and computational
`
`results.
`
`DEL is an assemblage of dynamically
`
`reconfigurable functional elements that enables a program to establish an
`
`optimized interconnection among selected functional units
`
`in order
`
`to
`
`implement a desired computational, data prefetch and/or data access
`
`functionality for maximizing the parallelism inherent in the particular code. The
`
`term DEL mayalso be used to refer to the set of constructs such as code,
`
`data, configuration variables, and the like that can be loaded into RP 100 to
`
`cause RP 100 to implement a particular assemblage of functional elements.
`
`[0050] Figure 1 presents an RP 100, which may be implemented using field
`
`programmable gate arrays (FPGAs) or other reconfigurable logic devices, that
`
`can be
`
`configured and reconfigured to contain functional units
`
`and
`
`interconnecting circuits, and a memory hierarchy comprising on-board memory
`
`banks 104, on-chip block RAM 106, registers wires, and a connection 108 to
`
`external memory. On-chip reconfigurable components 102 create memory
`
`structures such as registers, FIFOs, wires and arrays using block RAM. Dual-
`
`ported memory 106 is shared between on-chip reconfigurable components 102.
`
`The reconfigurable processor 100 also implements user-defined computational
`
`WWCS - BOS04A0033 - 68254 v2
`
`-1 0-
`
`13
`
`13
`
`
`
`logic (e.g., such as DEL 200 shownin Figure 2) constructed by programming
`
`an FPGAto implementa particular interconnection of computational functional
`
`units.
`
`In a particular implementation, a number of RPs 100 are implemented
`
`within a memory subsystem of a conventional computer, such as on devices
`
`that are physically installed in dual inline memory module (DIMM) sockets of a
`
`computer.
`In this manner the RPs 100 can be accessed by memory operations
`and so coexist well with a more conventional hardware platform.
`It should be
`noted that, although the exemplary implementation of the present invention
`
`illustrated
`
`includes
`
`six banks
`
`of dual ported memory 104 and two
`
`reconfigurable components 102, any number of memory banks and/or
`
`reconfigurable components may be used depending upon the particular
`
`implementation or application.
`
`[0051]Any computer program,
`
`including
`
`complex graphics processing
`
`programs, word processing programs, database programs and thelike, is a
`
`collection of algorithms that interact to implement desired functionality.
`
`In the
`
`common casein which static computing hardware resources are used (€.g.,
`a
`conventional microprocessor), the computer program is compiled into a set of
`
`executable code (i.e., object code) units that are linked together to implement
`
`the computer program onthe particular hardware resources. The executable
`
`code is generated specifically for a particular hardware platform.
`
`In this
`
`manner, the computer program is adapted to conform to the limitations of the
`
`static hardware platform. However,
`
`the compilation process makes many
`
`compromises basedonthelimitations of the static hardware platform.
`
`[0052] Alternatively, an algorithm can be defined in a high level language then
`compiled into DEL. DEL can be produced via a compiler from high level
`
`programming languages such as C or FORTRANor maybe designed using a
`
`hardware definition language such as Verilog, VHDL or a schematic capture
`
`tool. Computation is performed by reconfiguring a reconfigurable processor
`
`with the DEL and flowing data through the computation.
`
`In this manner, the
`
`WS - 80404/0033 - 68254 v2
`
`-1 1 -
`
`14
`
`14
`
`
`
`hardware resources are essentially adapted to conform to the program rather
`
`than the program being adapted to conform to the hardware resources.
`
`[0053] For purposesof this description a single reconfigurable processorwill be
`
`presented first. A sample of computational logic 201 is shown in Figure 2. This
`
`simple assemblage of functional units performs computation of two results
`
`("A+B" and "A+B-(B*C)) from three input variables or operands "A", "B" and
`
`"C".
`
`In practice, computational units 201 can be implemented to perform very
`
`simple or arbitrarily complex computations. The input variables (operands) and
`output or result variables may be of any size necessary for a particular
`application. Theoretically, any number of operands and result variables may be
`
`used/generated by a particular DEL. Great complexity of computation can be
`
`supported by adding additional reconfigurable chips and processors.
`
`[0054] For greatest performance the DEL 200 is constructed as parallel
`
`pipelined logic blocks composed of computational functional units capable of
`
`taking data and producing results with each clock pulse. The highest possible
`
`performance that can be achieved is computation of a set of results with each
`
`clock pulse. To achieve this, data should be available at the same rate the
`computation can consumethe data. The rate at which data can be supplied to
`DEL 200 is determined, at least in significant part, by the memory bandwidth
`
`utilization and efficiency. Maximal computational performance can be achieved
`
`with parallel and pipelined DEL together with maximizing the memory
`
`bandwidth utilization and efficiency.
`
`Unlike - conventional static hardware
`
`platforms, however,
`
`the memory hierarchy provided in
`
`a RP 100 is
`
`reconfigurable.
`
`In accordance with the present invention, through the use of
`
`data
`
`access
`
`units
`
`and
`
`associated memory
`
`hierarchy
`
`components,
`
`computational demands and memory bandwidth can be matched.
`
`[0055] High memory bandwidth efficiency is achieved when only data required
`
`for computation is moved within the memory hierarchy. Figure 3 shows a
`
`simple logic block 300 comprising computational functional units 301, control
`
`AWCS - 80404/0033 - 68254 v2
`
`-1 2-
`
`15
`
`15
`
`
`
`(not shown), and data access functional units 303. The data access unit 303
`
`presents data directly to the computational logic 301.
`
`In this manner, data is
`
`moved from a memory device 305 to the computational logic and from the
`
`computational logic back into a memory device 305 or block RAM memory 307
`
`within an RP 100.
`
`[0056] Figure 4 illustrates the logic block 300 with an addition of a data prefetch
`
`unit 401. The data prefetch unit 401 moves data from one memberof the
`
`memory hierarchy 305 to another 308. Data prefetch unit 401 operates
`
`independently of other functional units 301, 302 and 303 and can therefore
`
`operate prior
`
`to,
`
`in parallel with, or after computational
`
`logic.
`
`This
`
`independence of operation permits hiding the latency associated with obtaining
`
`data for use in computation. The data prefetch unit deposits data into the
`
`memory hierarchy within RP 100, where computational logic 301, 302 and 303 .
`
`can access it through data access units.
`
`In the example of Figure 4, prefetch
`
`unit 401 is configured to deposit data into block RAM memory 308. Hence, the
`
`prefetch units 401 may be operated independently of logic block 300 that uses
`
`prefetched data.
`
`[0057] An important feature of the present invention is that many types of data
`
`prefetch units can be defined so that the prefetch hardware can be configured
`
`to conform to the needs of the algorithms currently implemented by the
`
`computational
`
`logic.
`
`The specific characteristics of the prefetch can be
`
`matched with the needs of the computational logic and the format and location
`
`of data in the memory hierarchy. For example, Figure 9A and Figure 9B show
`
`an external memory that is organi